-1
Good afternoon to everyone, I’m conducting a data analysis project and I’m having difficulty at a specific point.
To facilitate understanding, I have a DF with [1000000+ rows x 29 columns].
In this Dataframe each line corresponds to an occurrence of a purchase, with information regarding the customer and the product. I want to create a new df with 3 columns, corresponding respectively to the customer ID, the account creation date (which are inherent to each user, that is, repeated without change each time the customer makes a purchase) and the number of times that this customer has made purchases.
I thought to hold a union between the value_counts()
(to obtain the number of purchases from each customer) and the drop_duplicates()
(for the ID to repeat only once with its corresponding creation date).
The problem is that the value_counts() returns me a series sorted according to the number of occurrences, while the drop_duplicates() returns me the values according to their position in the initial DF, so the values are mismatched.
This problem would be more easily solved if I simply created a n for range in (len(df)):
, but as the DF has more than 1000000 records this process would take too long.
Would anyone have an idea how to accomplish the creation of this df? Thank you very much!
(P.S. I don’t know if my explanation got confused, if that’s the case I try to elaborate better)
Put an expected output example and data (can be false data only to simulate the original dataframe), so people will help you faster. Hug!
– lmonferrari