Dataframe Pandas - How to use a previous value other than NA for calculation

Question

Dataframe Pandas - How to use a previous value other than NA for calculation

Asked 4 years, 3 months ago

Viewed 84 times

0

Having a dataframe as the following example:

df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])
   A  B      C
0  A  1  100.0
1  B  2    NaN
2  C  3    NaN
3  D  4  182.0
4  E  5    NaN

I would like to calculate, without using apply, the difference in column 'C', from row 3 to row 0.

Detail is that it is not fixed this space between lines with Nan, so I can not use the df.C.shift(2) for example.

3 answers

1

Use the function diff() to calculate the difference between lines. How NaN should be ignored, just remove them first with df[~df['C'].isnull()]:

>>> df[~df['C'].isnull()]['C'].diff()
0     NaN
3    82.0
Name: C, dtype: float64

If you want to join df original, just make a pd.merge for index:

diff_series = df[~df['C'].isnull()]['C'].diff()
diff_series.name = 'C diff'
pd.merge(
    df,
    diff_series,
    left_index=True,
    right_index=True,
    how='left'
)

And the result will be:

    A   B   C      C diff
0   A   1   100.0  NaN
1   B   2   NaN    NaN
2   C   3   NaN    NaN
3   D   4   182.0  82.0
4   E   5   NaN    NaN

1

I believe using df['C'].notnull() would be more intuitive than the denial of isnull(). You could also avoid the use of merge attributing the diff_series directly to a new column to the DF, as the intersection between them is made by the indices (which have been preserved), something like: df['C diff] = diff_series. In fact, it’s an answer :)

– Terry

2021/04/15 at 20:39

Browser other questions tagged python pandas numpy

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2021-04-14T20:59:58+00:00

Not using "apply" is inefficient - basically you have to look at the cell-by-cell content of the column, going up from "3". You can use math.isnan to test if the value is nan.

A function for this using a explicit for and if could be:


def diff_first_previous_valid(series, index):
    for value in series[index - 1::-1]:
        if not math.isnan(value):
            return series[index] - value
    raise ValueError("No previous valid value")

And, using this interactively:

In [241]: diff_first_previous_valid(df["C"], 3)                                                                                                                                    
Out[241]: 82.0

Using apply, the way can be to create an intermediate column - that doubles each value down if the next value is Nan - in case we need a function too, to "stretch" the valid values down.

This "approach" is better if you will repeat the various operation (=a few dozen) times in the same column - otherwise, fetch the number in each operation is better:


def fill_nans(col, first=0):
    previous_val = first
    result = []
    for v in col:
        if math.isnan(v):
            result.append(previous_val)
        else:
            result.append(v)
            previous_val = v
    return result

In [245]: x = fill_nans(df["C"])                                                                                                                                                   

In [246]: x[3] - x[2]                                                                                                                                                              
Out[246]: 82.0

by Paulo Marques • **3,739** points · Answer 2 · 2021-04-14T22:33:41+00:00

NOTE: The solution presented below NAY will meet if existing numbers are repeated.

Defining dataframe

>>> df = pd.DataFrame([['A',1,100],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])

Creating a new column, copying column C and replacing Nan with the last valid observation

>>> df['D'] = df['C'].fillna(method='ffill')

>>> df
   A  B      C      D
0  A  1  100.0  100.0
1  B  2    NaN  100.0
2  C  3    NaN  100.0
3  D  4  182.0  182.0
4  E  5    NaN  182.0

Calculating differences

>>> df['diferenca'] = df['D'].diff()

Upshot

>>> df

   A  B      C      D  diferenca
0  A  1  100.0  100.0        NaN
1  B  2    NaN  100.0        0.0
2  C  3    NaN  100.0        0.0
3  D  4  182.0  182.0       82.0
4  E  5    NaN  182.0        0.0

Leaving only the results

>>> import numpy as np

>>> df['diferenca'].replace({0: np.nan}, inplace=True)

>>> df
   A  B      C      D  diferenca
0  A  1  100.0  100.0        NaN
1  B  2    NaN  100.0        NaN
2  C  3    NaN  100.0        NaN
3  D  4  182.0  182.0       82.0
4  E  5    NaN  182.0        NaN

Returning to note initial...

>>> df = pd.DataFrame([['A',1,182],['B',2,None],['C',3,None],['D',4,182],['E',5,None]], columns=['A','B','C'])

>>> df
   A  B      C
0  A  1  182.0
1  B  2    NaN
2  C  3    NaN
3  D  4  182.0
4  E  5    NaN

>>> df['D'] = df['C'].fillna(method='ffill')

>>> df
   A  B      C      D
0  A  1  182.0  182.0
1  B  2    NaN  182.0
2  C  3    NaN  182.0
3  D  4  182.0  182.0
4  E  5    NaN  182.0

>>> df['diferenca'] = df['D'].diff()

>>> df
   A  B      C      D  diferenca
0  A  1  182.0  182.0        NaN
1  B  2    NaN  182.0        0.0
2  C  3    NaN  182.0        0.0
3  D  4  182.0  182.0        0.0
4  E  5    NaN  182.0        0.0

>>> df['diferenca'].replace({0: np.nan}, inplace=True)

>>> df
   A  B      C      D  diferenca
0  A  1  182.0  182.0        NaN
1  B  2    NaN  182.0        NaN
2  C  3    NaN  182.0        NaN
3  D  4  182.0  182.0        NaN
4  E  5    NaN  182.0        NaN
>>>

In this case there would be no results