Arithmetic operations where some Dataframe data is not int in Python (pandas)

Question

Arithmetic operations where some Dataframe data is not int in Python (pandas)

Asked 5 years, 3 months ago

Viewed 348 times

1

I am working with some data from IBGE and I meet with two spreadsheets that I need to take their percentage.

The formula for this is very simple, ie:

percentage = (dividend / divisor) * 100

Following, I have, for example, the two Dataframe:

data1 = {'local': ['São Paulo', 'Rio de Janeiro', 'Curitiba', 'Salvador'],
         'prod_1': [576, 456, 789, 963]}
divisor = pd.DataFrame(data1)

data2 = {'local': ['São Paulo', 'Rio de Janeiro', 'Curitiba', 'Salvador'],
         'prod_2': [123, '-', 231, '-']}
dividendo = pd.DataFrame(data2)

When I apply the formula to get the percentages:

quociente = ( dividendo['prod_2'] / divisor['prod_1'] ) * 100

I have the following mistake, which is already expected:

Typeerror: Unsupported operand type(s) for /: str and 'int'

However, the problem is, how do I outline it to get the percentages and ignore the spaces it contains '-'?

Use for and if is out of the question for being about 70 tables with 500 lines. Besides, they say it’s not good programming practice for Pandas/Python.

At the end of everything, I will need to merge all these spreadsheets and create one with the 70 tables that I want, however, I’m lost in not being able to do the percentage efficiently.

1

What result do you expect to receive from '-' / 456?

– Augusto Vasques

2020/04/21 at 20:16
Well, according to the IBGE, whenever it comes '-' or 'X', indicates that there has been insufficient data collection, or that there has been no production. In this case, if '-', it would be right to have the same '-', or any text indicating this.

– R. C. Junior

2020/04/21 at 22:19

3 answers

2

One solution would be to generate two Series one for prod_1 and another to prod_2 and coercively convert them to a numerical format by the method pandas.to_numeric() as a parameter errors adjusted with coerce which forces invalid values to be converted to NAN and valid values for numpy.float64.

import pandas as pd

data1 = {
  'local': ['São Paulo', 'Rio de Janeiro', 'Curitiba', 'Salvador'],
  'prod_1': [576, 456, 789, 963]
}

data2 = {
  'local': ['São Paulo', 'Rio de Janeiro', 'Curitiba', 'Salvador'],
  'prod_2': [123, '-', 231, '-']
}

dividendo = pd.Series(data2['prod_2'])
divisor = pd.Series(data1['prod_1'])

dividendo = pd.to_numeric(dividendo,errors = 'coerce')
divisor = pd.to_numeric(divisor,errors = 'coerce')

print(dividendo / divisor * 100)

Resulting:

0    21.354167
1          NaN
2    29.277567
3          NaN
dtype: float64

Test the code in the Repl.it: https://repl.it/repls/LightgreenYellowishPcboard

Browser other questions tagged python pandas numpy

You are not signed in. Login or sign up in order to post.

by Arthur Bacci • 97 points · Answer 1 · 2020-04-21T21:35:57+00:00

Using the for is never out of the question! What your code does is the same as a for, you just don’t write the for. The best way is to use the for:

data1 = {'local': ['São Paulo', 'Rio de Janeiro', 'Curitiba', 'Salvador'],
         'prod_1': [576, 456, 789, 963]}
divisor = pd.DataFrame(data1)

data2 = {'local': ['São Paulo', 'Rio de Janeiro', 'Curitiba', 'Salvador'],
         'prod_2': [123, '-', 231, '-']}
dividendo = pd.DataFrame(data2)

quociente = []

for i in range(len(dividendo['prod_2'])):
    try:
        quociente.append(dividendo['prod_2'][i] / divisor['prod_1'][i] * 100)
    except:
        quociente.append(0)

But what I would most recommend is you clean the dataframe first.

by R. C. Junior • 63 points · Answer 2 · 2020-04-22T02:10:17+00:00

Well, I really liked the solutions proposed. I tested each of them and saw that they worked for what I wanted to do.

But the user Augusto Vasques and Arthur Bacci give me an idea, so I changed the approach a little I got the desired result as follows:

data1 = {'local': ['São Paulo', 'Rio de Janeiro', 'Salvador'],
         'prod_1': [576, 456, 963]}
df1 = pd.DataFrame(data1)

data2 = {'local': ['São Paulo', 'Curitiba'],
         'prod_2': [123, 231]}
df2 = pd.DataFrame(data2)

nova = df1.merge(df2.set_index('local'), on='local')

Getting the following result:

        local           prod_1  prod_2
0       São Paulo       576.0   123.0
1       Rio de Janeiro  456.0   NaN
2       Salvador        963.0   NaN
3       Curitiba        NaN     231.0

With the results NaN I put a .fillna('-') or .fillna('X') that will fill them with what will be most correct according to the situation.

What’s more, I thank everyone who collaborated with an answer or question. Here is the solution that I ended up using.