Create column in dataframe that contains the list of row values except null values (Nan)

Asked

Viewed 95 times

0

Code and file used: https://github.com/CaioEuzebio/Python-DataScience-MachineLearning/tree/master/SalesLogistics

I’m working on an analysis using pandas. Basically I need to sort orders by quantity of products, and that contain the same products.

Example: I have order 1 and order 2, both have product A and product B. Using as key the list of products and quantity of products I will create a pivot that will index this combination of products and will return me the order who own the same products.

The purpose of the analysis is to obtain a dataframe like the one below:

dfFinal 

listProds            Ordens  NumProds
[prod1,prod2,prod3]     1       3
                        2
                        3
[prod1,prod3,prod5]     7       3
                        15
                        25
[prod5]                 8       1
                        3           

So far the code is like this.

Setting as index the column 'Order' for the first pivot to be made

df1.index=df1['Ordem']
df3 = df1.assign(col=df1.groupby(level=0).Produto.cumcount()).pivot(columns='col', values='Produto')

With this pivot get the dataframe below.

df3 = 

col 0   1   2   3   4   5   6   7   8   9   ... 54  55  56  57  58  59  60  61  62  63
Ordem                                                                                   
10911KD YIZ12FF-A   YIZ12FF-A   YIIE2FF-A   YIR72FF-A   YIR72FF-A   YIR72FF-A   NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
124636  HYY32ZY-A   HYY32ZY-A   NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1719KD5 YI742FF-A   YI742FF-A   YI742FF-A   YI742FF-A   NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22215KD YI762FF-A   YI762FF-A   YI762FF-A   YI762FF-A   YI762FF-A   YI762FF-A   YI6E2FF-A   YI6E2FF-A   YI6E2FF-A   NaN ... NaN NaN NaN NaN NaN 

When I finish running the code, Nan values appear, and I need to remove them from the lines, so it does not influence the analysis I’m doing.

1 answer

0

Be as simple as possible with the code. It will help you for both debugging and readability of the code, in case someone else needs to analyze. I recommend reading the pandas library, that there are several methods that inhibit verbose.

import pandas as pd    

# arquivos .csv não especificam a variável
df = pd.read_csv('Sales.csv', dtype={
        "Linha" : int,
        "Quantidade" : int,
        })
# transformar em datetime para facilitar análises de data
df['Data'] = pd.to_datetime(df['Data'], format='%d/%m/%Y')
df.groupby(['Ordem', 'Produto']).first()

Browser other questions tagged

You are not signed in. Login or sign up in order to post.