Grouping and aggregating data

Question

Grouping and aggregating data

Asked 9 years, 2 months ago

Viewed 8,603 times

3

I have the following file on CSV (12 Millions of records):

UF Municipio   Cod  NIS         Valor Data  
MA IMPERATRIZ  803  16361947271 45.00 01/01/2011  
MA IMPERATRIZ  803  74629273937 15.00 01/01/2011  
BA RUY BARBOSA 3845 16481166579 50.00 01/02/2011  
BA RUY BARBOSA 3845 16481166579 50.00 01/03/2011  
MG IPATINGA    653  73639474937 10.00 01/03/2011  
MG IPATINGA    653  83733638376 20.00 01/03/2011  
MG IPATINGA    653  52648747648 25.00 01/03/2011  
...

I need to group the data by Date, UF and Municipality, calculating the amount of NIS and adding up the values. That is, for each group of Date, UF and Municipality, need to count the amount of NIS and add the values. For the above data, the desired result would be:

Data       UF Municipio   Quant. Valor  
01/01/2011 MA IMPERATRIZ  002    60.00  
01/02/2011 BA RUY BARBOSA 001    50.00  
01/03/2011 BA RUY BARBOSA 001    50.00  
01/03/2011 MG IPATINGA    003    55.00  
...

This result should generate a new file CSV.
To add or count the values I use the codes below (which work):

Conta_NIS = csvPanda.groupby(['Data', 'UF', 'Municipio']).NIS.count()  
Soma_Valor = csvPanda.groupby(['Data', 'UF', 'Municipio']).Valor.sum()

But how to include the two aggregations (count and sum) in the same output to export to a new file CSV?

Very grateful to all!

would serve in awk or perl?

– JJoao

2016/05/18 at 11:33

1 answer

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by Sandro • **145** points · Answer 1 · 2016-05-24T13:31:05+00:00

I was able to find the solution using Pandas' "groupby". I created two distinct clusters, but by the same fields, one per value (sum of the column Value) and the other counting the number of NIS.

Bf_valor = csvPanda.groupby(['Data', 'UF', 'Municipio']).Valor.sum()
BF_NIS = csvPanda.groupby(['Data', 'UF', 'Municipio']).NIS.nunique()

Then I created two Data Frames with this data:

Df_value = pandas.Dataframe(Bf_value) DF_NIS = pandas.Dataframe(BF_NIS)

Finally I concatenated them into a single data set:

frames = [Df_value, DF_NIS] Df_bf_payment = pandas.Concat(frames, Axis=1)

It worked. Thank you all very much!