Is it possible to replace certain values with NA in pandas without the use of loops?

Asked

Viewed 31 times

0

I was studying data cleaning, and I saw that sometimes there can be int values in columns that should be string and vice versa, so the solution given by the author of the publication I was reading uses a for loop to replace the values for Nan in the following way.

# Detecting numbers 
cnt=0
for row in df['OWN_OCCUPIED']:
    try:
        int(row)
        df.loc[cnt, 'OWN_OCCUPIED']=np.nan
    except ValueError:
        pass
    cnt+=1

But for large volume of data loops are not too slow? there’s another way to do it?

  • i did not understand this code. Pq it is casting the value to integer if then it replaces the value by nan? It seems that the program is the same thing without the line int(row). Why not use np.where for that purpose?

  • The int(row) is only here to test if it is possible to turn Row into an integer. If it is not, it generates an exception. The big problem of this code, besides iterating item by item, is to consider that the index is numerical, starts with 0 (zero) and is sequential; which is not always true.

1 answer

1

We have two scenarios:

  1. Columns that have integers that should be string
  2. Columns that have string that should be integer

For the first case, for example, the whole case 1 has to become the string 1, the solution is simple: just use the astype

Example

df["coluna"] = df["coluna"].astype(str)

For the second case we have two possibilities:

a. All values that have to be converted from string to int (float) can be converted

b. Some (or several, or all) values that have to be converted from string to int (float) cannot be converted

In case all values can be converted, just use the same solution already described:

df["coluna"] = df["coluna"].astype(int)

For the second case, see the example:

Create a transform function to int or return nan

import numpy as np

def to_int(row):
    try:
        return int(row)
    except ValueError:
        return np.nan

Using the function in a dataframe

df = pd.DataFrame({"A": [1, "a", 3, "4"]})

print(df)

   A
0  1
1  a
2  3
3  4

df["A"] = df["A"].apply(to_int)

print(df)

     A
0  1.0
1  NaN
2  3.0
3  4.0

EDITED 10/08/2021 - reason: comment below

If 10 is whole, do it:

df["A"].apply(lambda x: x if isinstance(x, str) else np.nan)

If 10 is string type, do something like:

def only_strings(row):
    try:
        int(row)
        return np.nan
    except ValueError:
        return row

and call with

df["A"] = df["A"].apply(only_string)

End of issue

  • I guess I couldn’t express my doubt in the question, but in a way his comment helped me a lot, now I know how to clean strings in numerical columns. When I spoke of int values in string columns was actually a mistake, the value is a number with string type. ex: in a column where the value should be yes or no number 10.

  • I updated the post.

  • @Lucaslopes, I understood this by making use of apply with a function that purges an element out of context. If that is not the case, please edit the question by clarifying it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.