Using resample on pandas with intermittent date variable

Question

Using resample on pandas with intermittent date variable

Asked 5 years, 7 months ago

Viewed 303 times

1

I have a database in which one column is the year and the other columns are the total of formal workers in a city (each column a city). My goal is simply to aggregate my annual data to triennials. Here’s a replicable example of what I’ve done so far:

import random
import pandas as pd
from datetime import datetime

df = pd.DataFrame({'Ano': range(1890,1920),
'A': [random.choice(range(0,50)) for k in range(0,30)],
'B': [random.choice(range(0,50)) for k in range(0,30)],
'C': [random.choice(range(0,50)) for k in range(0,30)]},
index=range(0,30))
#tiro um ano de propósito para replicar o fato de que minha base não tem informação para todos os anos
df = df[df['Ano']!=1907]

df['Ano'] = [datetime.strptime(str(k), '%Y') for k in df['Ano']]

df.set_index('Ano', inplace=True)
print(df.resample('3T').sum())

Problem:

The 3T I used it based on what I saw in the documentation of pandas, but I don’t think I’ve interpreted this correctly since this command is running for a long time until crashing my computer.

I managed to solve here. What is the correct procedure in this case, post the answer so that other users can see or delete the question?

– Lucas

2019/12/09 at 16:25
1

If you are willing to help, posting the answer is better!

– nosklo

2019/12/09 at 17:09

1 answer

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by Lucas • **3,858** points · Answer 1 · 2019-12-10T01:16:15+00:00

Well, the solution was just to do:

print(df.resample('3A').sum())

Within the documentation I quoted in the question, there was a link in a note containing the table below offset_string. So, you just need to identify the unit of time that your date variable is and make the downsample using the number corresponding to the new period. In my case, I went from annual ("A") to triennial ("3A"). Other cases are analogous. Follow the table for reference:

Regarding the fact that the series is intermittent, I chose to create the missing years and fill the values in the columns equal to zero and, only then, perform the aggregation.

EDIT: A less manual and more elegant solution to give you with series with intermittent dates was given in the answer of @lmonferrari in that matter: Is there any way pd. Grouper, when used for time frequencies, adds lines even when there are no records in a time interval?