How to create a Dataframe in Pandas based on two Features and more the count of one of them?

Asked

Viewed 74 times

-1

Good afternoon to everyone, I’m conducting a data analysis project and I’m having difficulty at a specific point.

To facilitate understanding, I have a DF with [1000000+ rows x 29 columns].

In this Dataframe each line corresponds to an occurrence of a purchase, with information regarding the customer and the product. I want to create a new df with 3 columns, corresponding respectively to the customer ID, the account creation date (which are inherent to each user, that is, repeated without change each time the customer makes a purchase) and the number of times that this customer has made purchases.

I thought to hold a union between the value_counts() (to obtain the number of purchases from each customer) and the drop_duplicates() (for the ID to repeat only once with its corresponding creation date).

The problem is that the value_counts() returns me a series sorted according to the number of occurrences, while the drop_duplicates() returns me the values according to their position in the initial DF, so the values are mismatched.

This problem would be more easily solved if I simply created a n for range in (len(df)):, but as the DF has more than 1000000 records this process would take too long.

Would anyone have an idea how to accomplish the creation of this df? Thank you very much!

(P.S. I don’t know if my explanation got confused, if that’s the case I try to elaborate better)

  • Put an expected output example and data (can be false data only to simulate the original dataframe), so people will help you faster. Hug!

1 answer

1


It would be nice for you to post the data frame but you could do something like this: df.groupby(['|ID_Cliente','Date_Criação'], as_index=False)['Compra'].sum().sort_values(by='Compra', ascending=False)

Supposing Data_Criaçãobe the field with duplicates and Compra be the field you want to add.

  • Thank you, my friend! It was even simpler than I imagined, just do: data_grouped = data.groupby(['user._id', 'user.created_at'], as_index=False) data_grouped = data_grouped['_id']. Count()

Browser other questions tagged

You are not signed in. Login or sign up in order to post.