So I’m going to give you 2 approaches to solving the problem. The first is to follow your logic, iterating over a dataframe. The second is using some of the pandas' own methods. Lets go!
Dataframe of Example
Before launching into the solution, I made a simple cut of your Dataframe to use as an example.
import pandas as pd
data = {'Rank':[1,2,3,4],
'Name':['Wii Sports', 'Super Mario', 'Mario Kart', 'Wii Sports R'],
'Genre':['Sports', 'Plataform', 'Racing', 'Sports']}
base = pd.DataFrame(data)
base.head()
#saída
Rank Name Genre
0 1 Wii Sports Sports
1 2 Super Mario Plataform
2 3 Mario Kart Racing
3 4 Wii Sports R Sports
Solution 1 - Iterating through Dataframe
Constructing its iteration logic, to iterate properly for a Dataframe we must use an iteration method, a Generator with iterrows or tuple. This generator returns an index and the line containing the data of that line, similar to that which the Enum python do. That way, using the same variable nomenclature, your code would look like this:
porcent = pd.DataFrame(base.Genre.unique())
totalPorcGenre = len(base)
for index, row in porcent.iterrows():
contagemPorcen = base[base['Genre']==row[0]].shape[0]
print ('Gênero {}: {:0.2f}%'.format(row[0], contagemPorcen * 100/ totalPorcGenre))
#saída
Gênero Sports: 50.00%
Gênero Plataform: 25.00%
Gênero Racing: 25.00%
2nd Solution - Using Pandas native methods
However, I do not think this is the best approach. Pandas has a multitude of methods that make the manipulations of its lines and columns more efficient. These methods are built so that there are no iterations/loops optimizing performance. On bases with few lines is imperceptible, but on larger bases, the difference becomes more evident. Talk is Cheap, show me the code.
#No seu código você criou um DataFrame com um único campo para iterar sobre ele.
porcent = pd.DataFrame(base.Genre.unique())
#Ao invés disso, utilizamos o groupby para criar
#um novo DataFrame com uma nova estatística: count
group_genre = pd.DataFrame({'count': base.groupby(['Genre']).size()})
group_genre
#saída
Genre count
Plataform 1
Racing 1
Sports 2
After that, just add a new column with the new metric of interest, the percentage of each game genre.
count_total = base.shape[0]
count_total
#saída
4
group_genre['percent'] = group_genre['count']/count_total
group_genre
#saída
Genre count percent
Plataform 1 0.25
Racing 1 0.25
Sports 2 0.50
The interesting thing about this approach, rather than simply iterating, calculating and printing a number on the screen, is that you keep your main statistics of interest at your disposal, allocated in memory to continue your exploration, performing other mathematical manipulations or filling histograms and other diagrams.
The first solution worked, thanks for the help @Leonardoborges
– user143390
Could you explain to me what is parameter
axis=None
?– user143390
In pandas, we usually deal with Dataframes (DF) which is a typically matrix data structure, that is, we have rows and columns. In this sense, when we want to apply some aggregator operation in this DF, such as an average function (Mean), the pandas will ask themselves: Should we apply the operation in rows or columns? Axis=0 (default) applies to rows. Axis=1, applies to columns.
– Leonardo Borges
It’s easier if you think like an excel table. Imagine a table with several rows and columns and you want the average of all rows, just use the function Mean with Axis=0 and have a new row with the average. Now with Axis=1, you would have a new column with the average applied to all column values.
– Leonardo Borges
And what would be
Axis=None
?– user143390
I particularly do not know a pandas method that uses Axis=None, or is 0 (default) or 1. Usually methods that have this argument (Axis) are aggregation methods (Mean, sum, Count, etc.), that is, they operate by aggregating in some way several rows or columns at the same time. And for that, it only makes sense that you choose on which axis you want to apply x or y. Do you know any method that accepts Axis=None?
– Leonardo Borges
The
axis = None
andaxis= 0
are the same thing (it seems to me), return the same result.– user143390
0 and None are distinct objects in python. I would need to do the test, but I believe if I pass None instead of zero, we have an error. See documentation: Apply: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html Mean: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean. Sum: https://pandas.pydata.org/pandas-Docs/stable/Reference/api/pandas.DataFrame.sum.html All use Axis=0 or 1. None exists.
– Leonardo Borges
Even so, thanks for the help there in the code, the college this semester is fire
– user143390