Help me syntax my FOR repeat code

Asked

Viewed 79 times

0

I want to return the percentages of video game genres, only this part of the code does not return all the results.

porcent = pd.DataFrame(base.Genre.unique())

totalPorcGenre = len(base)

for porcen in porcent:
    contagemPorcen = base[base['Genre']==porcen].shape[0]
    print ('Gênero {}: {:0.2f}%'.format(porcen, contagemPorcen * 100/ totalPorcGenre))
Resultado:
Gênero 0: 0.00%

1 answer

0


So I’m going to give you 2 approaches to solving the problem. The first is to follow your logic, iterating over a dataframe. The second is using some of the pandas' own methods. Lets go!

Dataframe of Example

Before launching into the solution, I made a simple cut of your Dataframe to use as an example.

import pandas as pd

data = {'Rank':[1,2,3,4],
        'Name':['Wii Sports', 'Super Mario', 'Mario Kart', 'Wii Sports R'],
        'Genre':['Sports', 'Plataform', 'Racing', 'Sports']}

base = pd.DataFrame(data)
base.head()

#saída
    Rank    Name            Genre
0   1       Wii Sports      Sports
1   2       Super Mario     Plataform
2   3       Mario Kart      Racing
3   4       Wii Sports R    Sports

Solution 1 - Iterating through Dataframe

Constructing its iteration logic, to iterate properly for a Dataframe we must use an iteration method, a Generator with iterrows or tuple. This generator returns an index and the line containing the data of that line, similar to that which the Enum python do. That way, using the same variable nomenclature, your code would look like this:

porcent = pd.DataFrame(base.Genre.unique())
totalPorcGenre = len(base)

for index, row in porcent.iterrows():
    contagemPorcen = base[base['Genre']==row[0]].shape[0]
    print ('Gênero {}: {:0.2f}%'.format(row[0], contagemPorcen * 100/ totalPorcGenre))

#saída
Gênero Sports: 50.00%
Gênero Plataform: 25.00%
Gênero Racing: 25.00%

2nd Solution - Using Pandas native methods

However, I do not think this is the best approach. Pandas has a multitude of methods that make the manipulations of its lines and columns more efficient. These methods are built so that there are no iterations/loops optimizing performance. On bases with few lines is imperceptible, but on larger bases, the difference becomes more evident. Talk is Cheap, show me the code.

#No seu código você criou um DataFrame com um único campo para iterar sobre ele.
porcent = pd.DataFrame(base.Genre.unique())

#Ao invés disso, utilizamos o groupby para criar
#um novo DataFrame com uma nova estatística: count
group_genre = pd.DataFrame({'count': base.groupby(['Genre']).size()})
group_genre

#saída

Genre       count   
Plataform   1
Racing      1
Sports      2

After that, just add a new column with the new metric of interest, the percentage of each game genre.

count_total = base.shape[0]
count_total

#saída
4

group_genre['percent'] = group_genre['count']/count_total
group_genre

#saída
Genre       count   percent
Plataform   1       0.25
Racing      1       0.25
Sports      2       0.50

The interesting thing about this approach, rather than simply iterating, calculating and printing a number on the screen, is that you keep your main statistics of interest at your disposal, allocated in memory to continue your exploration, performing other mathematical manipulations or filling histograms and other diagrams.

  • The first solution worked, thanks for the help @Leonardoborges

  • 1

    Could you explain to me what is parameter axis=None?

  • In pandas, we usually deal with Dataframes (DF) which is a typically matrix data structure, that is, we have rows and columns. In this sense, when we want to apply some aggregator operation in this DF, such as an average function (Mean), the pandas will ask themselves: Should we apply the operation in rows or columns? Axis=0 (default) applies to rows. Axis=1, applies to columns.

  • It’s easier if you think like an excel table. Imagine a table with several rows and columns and you want the average of all rows, just use the function Mean with Axis=0 and have a new row with the average. Now with Axis=1, you would have a new column with the average applied to all column values.

  • And what would be Axis=None?

  • I particularly do not know a pandas method that uses Axis=None, or is 0 (default) or 1. Usually methods that have this argument (Axis) are aggregation methods (Mean, sum, Count, etc.), that is, they operate by aggregating in some way several rows or columns at the same time. And for that, it only makes sense that you choose on which axis you want to apply x or y. Do you know any method that accepts Axis=None?

  • The axis = None and axis= 0 are the same thing (it seems to me), return the same result.

  • 0 and None are distinct objects in python. I would need to do the test, but I believe if I pass None instead of zero, we have an error. See documentation: Apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html Mean: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean. Sum: https://pandas.pydata.org/pandas-Docs/stable/Reference/api/pandas.DataFrame.sum.html All use Axis=0 or 1. None exists.

  • Even so, thanks for the help there in the code, the college this semester is fire

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.