How to join three or more CSV files with something like PROCV and concatenating certain columns

Asked

Viewed 70 times

-3

Arquivo1.csv

Codigo  Nome    Catego  Geral                   Model   Ano
A1      JOGO    LAMA    FZ25-A|YS250|FA250AB    44CF    2017
A2      EMBLE   LAMA    FZ25-A|YS250|FA250AB    44CF    2017
A3      TUBO    ADMI    FZ25-A|YS250|FA250AB    44CF    2017
A4      PRESI   ADMI    FZ25-A|YS250|FA250AB    44CF    2017
A5      PARA    ADMI    FZ25-A|YS250|FA250AB    44CF    2017
A6      AR      ADMI    FZ25-A|YS250|FA250AB    44CF    2017
A7      PRESI   ADMI    FZ25-A|YS250|FA250AB    44CF    2017
A8      CILI    CABE    FZ25-A|YS250|FA250AB    44CF    2017

Arquivo2.csv

Codigo  Nome    Catego  Geral                   Model   Ano
A1      JOGO    LAMA    FZ25-A|YS250|FA250AB    BC51    2018
A2      EMBLE   LAMA    FZ25-A|YS250|FA250AB    BC51    2018
B1      CHAVE   LAMA    FZ25-A|YS250|FA250AB    BC51    2018
B2      CHAVE   LAMA    FZ25-A|YS250|FA250AB    BC51    2018
B3      MANU    LAMA    FZ25-A|YS250|FA250AB    BC51    2018
B4      AMOR    LAMA    FZ25-A|YS250|FA250AB    BC51    2018
B5      GUIA    CABE    FZ25-A|YS250|FA250AB    BC51    2018

Arquivo3.csv

Codigo  Nome    Catego  Geral                   Model   Ano
A8      CILI    CABE    FZ25-A|YS250|FA250AB    BC54    2021
B5      GUIA    CABE    FZ25-A|YS250|FA250AB    BC54    2021
C1      ANEL    CABE    FZ25-A|YS250|FA250AB    BC54    2021
C2      PARA    CABE    FZ25-A|YS250|FA250AB    BC54    2021
A6      AR      ADMI    FZ25-A|YS250|FA250AB    BC54    2021
A7      PRESI   ADMI    FZ25-A|YS250|FA250AB    BC54    2021
A3      TUBO    ADMI    FZ25-A|YS250|FA250AB    BC54    2021
B2      CHAV    LAMA    FZ25-A|YS250|FA250AB    BC54    2021

I did something like this:

import tabula 
import tabulate
import pandas as pd
import numpy as np
import csv
import csvs
import ssl
from tabula import read_pdf
from pandas import read_csv
from pandas import Series, DataFrame
primeiro = pd.read_csv('/content/Arquivo1.csv')
segundo = pd.read_csv('/content/Arquivo2.csv')
terceiro = pd.read_csv('/content/Arquivo3.csv')
df = pd.merge(primeiro, segundo, on='Codigo', how='outer') # Aqui faço algo do tipo PROCV
df ['Model'] = df ['Model_x']. map(str) + ' | ' + df ['Model_y']. map(str) # Aqui concateno coluna Model
df ['Ano'] = df ['Ano_x']. map(str) + ' | ' + df ['Ano_y']. map(str) # Aqui concateno coluna Ano
df.drop('Model_x', axis=1, inplace=True) # Deleto o que sobrou
df.drop('Ano_x', axis=1, inplace=True)
df.drop('Model_y', axis=1, inplace=True)
df.drop('Ano_y', axis=1, inplace=True)
df.to_csv (r'/content/Junta.csv', index = False, header = True, encoding='cp1252', na_rep = 'NaN') # Salvo
somados = pd.read_csv('/content/Junta.csv') # Leio trocando para somados
df = pd.merge(somados, terceiro, on='Codigo', how='outer') # Repito tudo de novo
df ['Model'] = df ['Model_x']. map(str) + ' | ' + df ['Model_y']. map(str)
df ['Ano'] = df ['Ano_x']. map(str) + ' | ' + df ['Ano_y']. map(str)
df.drop('Model_x', axis=1, inplace=True)
df.drop('Ano_x', axis=1, inplace=True)
df.drop('Model_y', axis=1, inplace=True)
df.drop('Ano_y', axis=1, inplace=True)
df.to_csv (r'/content/Junta3.csv', index = False, header = True, encoding='cp1252', na_rep = 'NaN')
df

If you do this you will get the following result: inserir a descrição da imagem aqui

But the correct result is to shift the red frames of the image and then delete everything that was left in yellow, see in detail in the image below: inserir a descrição da imagem aqui

But the correct result is shown below. So how do I get this result below:

Arquivo Junta3.csv

Codigo  Nome    Catego  Geral                   Model       Ano
A1      JOGO    LAMA    FZ25-A|YS250|FA250AB    44CF|BC51   2017|2018
A2      EMBLE   LAMA    FZ25-A|YS250|FA250AB    44CF|BC51   2017|2018
A3      TUBO    ADMI    FZ25-A|YS250|FA250AB    44CF|BC54   2017|2021
A4      PRESI   ADMI    FZ25-A|YS250|FA250AB    44CF        2017
A5      PARA    ADMI    FZ25-A|YS250|FA250AB    44CF        2017
A6      AR      ADMI    FZ25-A|YS250|FA250AB    44CF|BC54   2017|2021
A7      PRESI   ADMI    FZ25-A|YS250|FA250AB    44CF|BC54   2017|2021
A8      CILI    CABE    FZ25-A|YS250|FA250AB    44CF|BC54   2017|2021
B1      CHAVE   LAMA    FZ25-A|YS250|FA250AB    BC51        2018
B2      CHAVE   LAMA    FZ25-A|YS250|FA250AB    BC51|BC54    2018|2021
B3      MANU    LAMA    FZ25-A|YS250|FA250AB    BC51        2018
B4      AMOR    LAMA    FZ25-A|YS250|FA250AB    BC51        2018
B5      GUIA    CABE    FZ25-A|YS250|FA250AB    BC51|BC54   2018|2021
C1      ANEL    CABE    FZ25-A|YS250|FA250AB    BC54        2021
C2      PARA    CABE    FZ25-A|YS250|FA250AB    BC54        2021

Please, the challenge is great, in Excel I’ve done and worked, so I mean using Python with Pandas is also possible, who can help thank you

  • Could you detail the criteria you use to join the three tables?

  • I believe you’re looking for the command Concat

  • @Luiznotari, I understand the purpose, but I want to know specifically the practical side of the activity. What should the algorithm do? When finding two chaves equal example A1 the fields Nome, Catego and Geral will always have the same values in the three tables or not? Only fields Model and Ano will be concatenated with | or do you have any other consideration? The problem is simple, but we need the exact specifications because the time you throw the code into your data the algorithm might encounter something unexpected and fail.

  • Luiz Notari, don’t tear up the questions

1 answer

0


My solution proposal involves the use of functions concat, groupby, map and drop_duplicates.
First merge all your files with concat, with the groupby for Code concatenate the column strings Year and Model, with the map recreate the columns Year and Model with all values found in groupby and finally, use the drop_duplicates to exclude duplicate occurrences.

df = pd.concat([df1,df2,df3], ignore_index = True)
df['Ano'] = df["Ano"].astype('str')

d = df.groupby('Codigo').agg(
                              Ano = ('Ano', lambda x: '|'.join(x)),
                              Model = ('Model', lambda x: '|'.join(x))
                            )
df['Ano'] = df['Codigo'].map(d['Ano'])
df['Model'] = df['Codigo'].map(d['Model'])
df = df.drop_duplicates()

#resultado
df.head()
       Codigo   Nome   Catego  Geral                 Model      Ano
0      A1       JOGO   LAMA    FZ25-A|YS250|FA250AB  44CF|BC51  2017|2018
1      A2       EMBLE  LAMA    FZ25-A|YS250|FA250AB  44CF|BC51  2017|2018
2      A3       TUBO   ADMI    FZ25-A|YS250|FA250AB  44CF|BC54  2017|2021
3      A4       PRESI  ADMI    FZ25-A|YS250|FA250AB  44CF       2017
4      A5       PARA   ADMI    FZ25-A|YS250|FA250AB  44CF       2017
  • @Luiznotari If my answer has been satisfactory, please consider accepting it. This is important because it makes it clear to other users that their problem is solved. No link How and why to accept an answer? shows how to do it :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.