Help me syntax my FOR repeat code

Question

Help me syntax my FOR repeat code

Asked 7 years, 1 month ago

Viewed 79 times

0

I want to return the percentages of video game genres, only this part of the code does not return all the results.

porcent = pd.DataFrame(base.Genre.unique())

totalPorcGenre = len(base)

for porcen in porcent:
    contagemPorcen = base[base['Genre']==porcen].shape[0]
    print ('Gênero {}: {:0.2f}%'.format(porcen, contagemPorcen * 100/ totalPorcGenre))

Resultado:
Gênero 0: 0.00%

1 answer

Browser other questions tagged python python-3.x

You are not signed in. Login or sign up in order to post.

by Leonardo Borges • **171** points · Answer 1 · 2019-03-31T18:28:08+00:00

So I’m going to give you 2 approaches to solving the problem. The first is to follow your logic, iterating over a dataframe. The second is using some of the pandas' own methods. Lets go!

Dataframe of Example

Before launching into the solution, I made a simple cut of your Dataframe to use as an example.

import pandas as pd

data = {'Rank':[1,2,3,4],
        'Name':['Wii Sports', 'Super Mario', 'Mario Kart', 'Wii Sports R'],
        'Genre':['Sports', 'Plataform', 'Racing', 'Sports']}

base = pd.DataFrame(data)
base.head()

#saída
    Rank    Name            Genre
0   1       Wii Sports      Sports
1   2       Super Mario     Plataform
2   3       Mario Kart      Racing
3   4       Wii Sports R    Sports

Solution 1 - Iterating through Dataframe

Constructing its iteration logic, to iterate properly for a Dataframe we must use an iteration method, a Generator with iterrows or tuple. This generator returns an index and the line containing the data of that line, similar to that which the Enum python do. That way, using the same variable nomenclature, your code would look like this:

porcent = pd.DataFrame(base.Genre.unique())
totalPorcGenre = len(base)

for index, row in porcent.iterrows():
    contagemPorcen = base[base['Genre']==row[0]].shape[0]
    print ('Gênero {}: {:0.2f}%'.format(row[0], contagemPorcen * 100/ totalPorcGenre))

#saída
Gênero Sports: 50.00%
Gênero Plataform: 25.00%
Gênero Racing: 25.00%

2nd Solution - Using Pandas native methods

However, I do not think this is the best approach. Pandas has a multitude of methods that make the manipulations of its lines and columns more efficient. These methods are built so that there are no iterations/loops optimizing performance. On bases with few lines is imperceptible, but on larger bases, the difference becomes more evident. Talk is Cheap, show me the code.

#No seu código você criou um DataFrame com um único campo para iterar sobre ele.
porcent = pd.DataFrame(base.Genre.unique())

#Ao invés disso, utilizamos o groupby para criar
#um novo DataFrame com uma nova estatística: count
group_genre = pd.DataFrame({'count': base.groupby(['Genre']).size()})
group_genre

#saída

Genre       count   
Plataform   1
Racing      1
Sports      2

After that, just add a new column with the new metric of interest, the percentage of each game genre.

count_total = base.shape[0]
count_total

#saída
4

group_genre['percent'] = group_genre['count']/count_total
group_genre

#saída
Genre       count   percent
Plataform   1       0.25
Racing      1       0.25
Sports      2       0.50

The interesting thing about this approach, rather than simply iterating, calculating and printing a number on the screen, is that you keep your main statistics of interest at your disposal, allocated in memory to continue your exploration, performing other mathematical manipulations or filling histograms and other diagrams.