Dataframe Pandas - How to use a previous value other than NA for calculation

Asked

Viewed 84 times

0

Having a dataframe as the following example:

df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])
   A  B      C
0  A  1  100.0
1  B  2    NaN
2  C  3    NaN
3  D  4  182.0
4  E  5    NaN

I would like to calculate, without using apply, the difference in column 'C', from row 3 to row 0.

Detail is that it is not fixed this space between lines with Nan, so I can not use the df.C.shift(2) for example.

3 answers

1


Use the function diff() to calculate the difference between lines. How NaN should be ignored, just remove them first with df[~df['C'].isnull()]:

>>> df[~df['C'].isnull()]['C'].diff()
0     NaN
3    82.0
Name: C, dtype: float64

If you want to join df original, just make a pd.merge for index:

diff_series = df[~df['C'].isnull()]['C'].diff()
diff_series.name = 'C diff'
pd.merge(
    df,
    diff_series,
    left_index=True,
    right_index=True,
    how='left'
)

And the result will be:

    A   B   C      C diff
0   A   1   100.0  NaN
1   B   2   NaN    NaN
2   C   3   NaN    NaN
3   D   4   182.0  82.0
4   E   5   NaN    NaN
  • 1

    I believe using df['C'].notnull() would be more intuitive than the denial of isnull(). You could also avoid the use of merge attributing the diff_series directly to a new column to the DF, as the intersection between them is made by the indices (which have been preserved), something like: df['C diff] = diff_series. In fact, it’s an answer :)

0

Not using "apply" is inefficient - basically you have to look at the cell-by-cell content of the column, going up from "3". You can use math.isnan to test if the value is nan.

A function for this using a explicit for and if could be:


def diff_first_previous_valid(series, index):
    for value in series[index - 1::-1]:
        if not math.isnan(value):
            return series[index] - value
    raise ValueError("No previous valid value")

And, using this interactively:

In [241]: diff_first_previous_valid(df["C"], 3)                                                                                                                                    
Out[241]: 82.0

Using apply, the way can be to create an intermediate column - that doubles each value down if the next value is Nan - in case we need a function too, to "stretch" the valid values down.

This "approach" is better if you will repeat the various operation (=a few dozen) times in the same column - otherwise, fetch the number in each operation is better:


def fill_nans(col, first=0):
    previous_val = first
    result = []
    for v in col:
        if math.isnan(v):
            result.append(previous_val)
        else:
            result.append(v)
            previous_val = v
    return result
In [245]: x = fill_nans(df["C"])                                                                                                                                                   

In [246]: x[3] - x[2]                                                                                                                                                              
Out[246]: 82.0
  • why create a fill_nans function if it is already implemented in pandas? See fillna

  • 1

    I had not seen that the Pandas fillna already has this functionality. But it is good sometimes for people to see that not everything is magic, and that logic and basic elements of the programming language can solve any problem, and not wait every time there is a magic in Pandas.

0

NOTE: The solution presented below NAY will meet if existing numbers are repeated.

Defining dataframe

>>> df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])

Creating a new column, copying column C and replacing Nan with the last valid observation

>>> df['D'] = df['C'].fillna(method='ffill')

>>> df
   A  B      C      D
0  A  1  100.0  100.0
1  B  2    NaN  100.0
2  C  3    NaN  100.0
3  D  4  182.0  182.0
4  E  5    NaN  182.0

Calculating differences

>>> df['diferenca'] = df['D'].diff()

Upshot

>>> df

   A  B      C      D  diferenca
0  A  1  100.0  100.0        NaN
1  B  2    NaN  100.0        0.0
2  C  3    NaN  100.0        0.0
3  D  4  182.0  182.0       82.0
4  E  5    NaN  182.0        0.0

Leaving only the results

>>> import numpy as np

>>> df['diferenca'].replace({0: np.nan}, inplace=True)

>>> df
   A  B      C      D  diferenca
0  A  1  100.0  100.0        NaN
1  B  2    NaN  100.0        NaN
2  C  3    NaN  100.0        NaN
3  D  4  182.0  182.0       82.0
4  E  5    NaN  182.0        NaN

Returning to note initial...

>>> df = pd.DataFrame([['A',1,182],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])

>>> df
   A  B      C
0  A  1  182.0
1  B  2    NaN
2  C  3    NaN
3  D  4  182.0
4  E  5    NaN

>>> df['D'] = df['C'].fillna(method='ffill')

>>> df
   A  B      C      D
0  A  1  182.0  182.0
1  B  2    NaN  182.0
2  C  3    NaN  182.0
3  D  4  182.0  182.0
4  E  5    NaN  182.0

>>> df['diferenca'] = df['D'].diff()

>>> df
   A  B      C      D  diferenca
0  A  1  182.0  182.0        NaN
1  B  2    NaN  182.0        0.0
2  C  3    NaN  182.0        0.0
3  D  4  182.0  182.0        0.0
4  E  5    NaN  182.0        0.0

>>> df['diferenca'].replace({0: np.nan}, inplace=True)

>>> df
   A  B      C      D  diferenca
0  A  1  182.0  182.0        NaN
1  B  2    NaN  182.0        NaN
2  C  3    NaN  182.0        NaN
3  D  4  182.0  182.0        NaN
4  E  5    NaN  182.0        NaN
>>>

In this case there would be no results

Browser other questions tagged

You are not signed in. Login or sign up in order to post.