How to change abbreviated values in a DF using Pandas in Python

Asked

Viewed 53 times

2

I have the following problem, I have a df with more than 50 columns, in these columns, some have value of type "35,57B", "6,85T". How do I make iteratively and replace the examples cited above so that they stay as follows "3557000000","6850000000000"?

Exemplo de como é o DF

I tried something like

for col in df:
    df.col = (df.col.replace(r'[KMBT]+$', '', regex=True).astype(float) * df.col.str.extract(r'[\d\.]+([KMBT]+)', expand=False).fillna(1).replace(['K','M','B','T'], [10**3, 10**6, 10**9, 10**12]).astype(int))

But I was unsuccessful, and had the following mistake:

AttributeError: 'DataFrame' object has no attribute 'col'

Thanks for the answers with them it was possible to elaborate the solution. Which was the following:

for col in df.columns[1:]:    
    df[col] = (df[col].str.replace(',','.').replace('N/A','0').replace(r'[kMBT%]+$', '', regex=True).astype(float) * df[col].str.extract(r'[\d\.]+([kMBT%]+)', expand=False).fillna(1).replace(['k','M','B','T',"%"], [10**3, 10**6, 10**9, 10**12,1]).astype(float))
    df[col] = df[col].astype(str)
    df[col] = df[col].str.replace('.',',')
  • Can provide an example dataset?

2 answers

3


One option is to define a function, use list comprehension and iterate over df.columns:

def transform_str(value:str):
    return value.replace(",","").replace("B","000000").replace("T","0000000000")

for col in df.columns:
    df[col]=[transform_str(k) for k in df[col]]
  • Good afternoon! Then your solution generates the following output 3562000000, 6850000000000 only the values are for example 35,63B and 6,87T, the data is not all in the same format, has data like "48,1B". For this reason I was looking for something that multiplied the existing value by, for example, replacing "B" with "10**9". But thank you so much gave me an insight into how to solve!

  • 1

    I get it. But these were the examples you gave. To have a complete solution, I suggest sharing the database or a part of it.

2

In addition to the response of Lucas (that I even prefer), you can keep the basis of your code

Creating the test data frame

import pandas as pd
dados = ["35,57B", "6,85T"]
df = pd.DataFrame({'Dados': dados})
df

Displaying the data frame

     Dados
0   35,57B
1    6,85T

I added an extra replace and in regex I put a comma in place of the point, astype changed to int in both conditions

df['Dados'] = df['Dados'].str.replace(',','').replace(r'[KMBT]+$', '', regex=True).astype('int64') * df['Dados'].str.extract(r'[\d\,]+([KMBT]+)', expand = False).replace(['K','M','B','T'], [10**3, 10**6, 10**9, 10**12]).astype('int64')

If you want to save as string

df['Dados'] = df['Dados'].astype(str)

Exit:

0      3557000000000
1    685000000000000
  • Good Afternoon! I was trying to make your solution iteratively, I ended up falling into the same problem I had before. The code would look something like: for col in df.columns:
 df[col] = df[col].str.replace(',','').replace(r'[KMBT]+$', '', regex=True).astype('int64') * df[col].str.extract(r'[\d\,]+([KMBT]+)', expand = False).replace(['K','M','B','T'], [10**3, 10**6, 10**9, 10**12]).astype('int64') But my first column of df is one of code, which I didn’t want to iterate on. You know what I can do to get this one to start in the first column instead of the 0 column?

  • @Dataworm1523, good afternoon! for col in df.columns.drop(['Coluna_que_você_não_deseja']): I think something like that would work

Browser other questions tagged

You are not signed in. Login or sign up in order to post.