Replacing Nan values with the subsequent not Nan of another column

Asked

Viewed 88 times

3

I have a Dataframe with some columns (I’m only representing two in this post). I need to fill the Nan of one column with certain values of another. See below:

Creating the Test Dataframe

>>> import pandas as pd

>>> df = pd.DataFrame({"base": [2, 2, 3, 3, 4, 4, 5, 5], "valores":[3, None, 100, 3, None, None, 15, None]})

>>> df
   base  valores
0     2      3.0
1     2      NaN
2     3    100.0
3     3      3.0
4     4      NaN
5     4      NaN
6     5     15.0
7     5      NaN

The way out I hope:

>>> df
   base  valores
0     2      3.0
1     2      3.0   # valor da coluna base referente ao índice 3
2     3    100.0
3     3      3.0
4     4      5.0   # valor da coluna  base referente ao índice 6
5     4      5.0   # valor da coluna base referente ao índice 6
6     5     15.0
7     5      NaN   # nenhum valor posterior

That is, for each Nan value found, replace with the next valid value. In the case of the latter, if this is Nan, keep it.

What I tried

I tried to use the method fillna() which would update the Nan with a fixed value or the subsequent not-Nan of the same column if method='bfill' as below

>>> df["valores"].fillna(method='bfill')
0      3.0
1    100.0
2    100.0
3      3.0
4     15.0
5     15.0
6     15.0
7      NaN

I also tried to use the method fillna() searching the values of the "base" as below:

>>> df["valores"].fillna(df["base"])
0      3.0
1      2.0
2    100.0
3      3.0
4      4.0
5      4.0
6     15.0
7      5.0
Name: valores, dtype: float64

However the values received are of the same index

I need to join the two features or another way to get the result.

Other ideas

In time: Another method I thought could help is the isna() or notna()

>>> df["valores"].isna()
0    False
1     True
2    False
3    False
4     True
5     True
6    False
7     True
Name: valores, dtype: bool

2 answers

3


It is possible to create a temporary Series with column values only groundwork where values is not null with commands .mask, .isna and.bfill. With this Series in a variable it is possible to pass it inside the command fillna to replace column values values

temp = df['base'].mask(df['valores'].isna()).bfill()
df['valores'] = df['valores'].fillna(temp)
df.head(10)

#saida
    base    valores
0   2       3.0
1   2       3.0
2   3       100.0
3   3       3.0
4   4       5.0
5   4       5.0
6   5       15.0
7   5       NaN

The command mask here returns a Series of the same size as DF, but where the condition is true (df['valores'].isna()) is null, follows some step by step results.

Commando mask with isna

df['base'].mask(df['valores'].isna())
#saida:
0    2.0
1    NaN
2    3.0
3    3.0
4    NaN
5    NaN
6    5.0
7    NaN

Commando mask, isna and bfill

df['base'].mask(df['valores'].isna()).bfill()
#saida:
0    2.0
1    3.0
2    3.0
3    3.0
4    5.0
5    5.0
6    5.0
7    NaN

2

One possibility is to create a dictionary group from the data frame df where keys are the values without duplicates of the column base and the respective key values are a dictionary pointing to the index of the df whose base is the key and valores is different from NAN.

Then apply the transformer replace() from the first to the penultimate line df, transformer that accepts three parameters:

  • val: a line to be transformed.
  • g: a previously defined dictionary.
  • df: which is the data frame itself.

Every line of df is:

  • checked if the line val contains some null value.
    • if yes searches for the next valid consecutive value of base researching g.
    • If you do not find a next consecutive value of base returns the proposed line.
    • Exchange the value NAN at the appropriate value whose index is located in g.

Test the example

import pandas as pd


df = pd.DataFrame({
    "base": [2, 2, 3, 3, 4, 4, 5, 5], 
    "valores":[3, None, 100, 3, None, None, 15, None]
})


def replace(val, g, df):
    if pd.isna(val[1]):
        i= next((k for k in g if k > val[0]), None)
        if i == None:
            return val
        val[1] = df.iloc[g[i][-1:],0]
    return val 
  
  
group = df[df["valores"].notna()].groupby("base").groups             #{2: [0], 3: [2, 3], 5: [6]}
    
df["valores"] = df.transform(replace, 1, group, df)[0:-1]["valores"]

print(df)

#   base  valores
#0     2      3.0
#1     2      3.0
#2     3    100.0
#3     3      3.0
#4     4      5.0
#5     4      5.0
#6     5     15.0
#7     5      NaN

Browser other questions tagged

You are not signed in. Login or sign up in order to post.