Compare all Rows and Columns of two Df, update and add differences at the end of Rows and Columns

Asked

Viewed 1,113 times

1

I have two csv separated by tab. Both have the same number of Rows and Columns. The first column POS has the same Unique values in both df. Differences (or not), occur in the values (strings) of the columns col1:col4.

I thought I’d run a query, something like: query = subset_pl(subset_pl.isin(subset_ad)) and from that continue with the code, but I got stuck there...

import pandas as pd

subset_ad = pd.read_csv('subset_ad.csv', sep='\t')
subset_ad.set_index('POS')
subset_ad

POS         col1    col2    col3    col4
28355991    A       A       A       A              
28356037    A       A       A       A              
28356130    A       A       A       A              
28356246    A       A       A       A   



subset_pl = pd.read_csv('subset_pl.csv', sep='\t')
subset_pl.set_index('POS')
subset_pl

POS         col1    col2    col3    col4
28355991    A       B       A       A            
28356037    B       B       B       A              
28356130    A       B       A       A              
28356246    A       A       B       A            

What I intend to achieve is: compare subset_ad with subset_pl, upgrade subset_ad with the value of subset_pl maintaining the value of subset_ad separated by , (A,B for example), if there are different values and count these differences both in Rows and Columns by adding an Row and an extra column (cont_col, cont_row) to display the count of cells that have changed...

The output would be something like:

subset_ad

POS         col1    col2    col3    col4    cont_row
28355991    A       A,B     A       A       1       
28356037    A,B     A,B     A,B     A       3       
28356130    A       A,B     A       A       1       
28356246    A       A       A,B     A       1       
cont_col    1       3       2       0      

Any direction will be welcome!

  • 1

    Colleague, your question is not at all clear. To begin with, its values are strings in the "x/x" format where x seems to always be 0, 1 or .. How is this compared? Is it string comparison? If so, what was its difficulty? Second, that column soma_rows, for example, where does this value come from? Is it a sum? If so, sum of what?! Why does the end result have a comma? Anyway, I suggest you provide an example simple of your problem, perhaps with two rows and columns, and explain in detail. If no one can help you.

  • Hello friend. I followed your suggestion and changed the example... At first all the data are of the same string type! I was reading here, and I thought if the option df.merge would not be possible in this case... So that by merging it would add the values of subset_pl at the subset_ad only where these are different as shown above in the desired output in subset_ad[1,1]

  • Now yes! : ) I will answer.

1 answer

2


You can do it like this:

  1. Read both files in different Dataframe’s.
  2. Concatene the Dataframes directly, using the operator +. You can include a string with the comma in the middle. This operator will run for each "cell" of the table.
  3. Scroll over the rows and columns counting the differences. Where there are no differences (and if desired), unify the letters in one.

Code example:

import pandas as pd

# Lê o primeiro arquivo
subset_ad = pd.read_csv('subset_ad.csv', sep='\t', index_col=False)
subset_ad = subset_ad.set_index('POS')

print('subset_ad:')
print('-' * 20)
print(subset_ad)
print('-' * 20)

# Lê o segundo arquivo
subset_pl = pd.read_csv('subset_pl.csv', sep='\t', index_col=False)
subset_pl = subset_pl.set_index('POS')

print('\n')
print('subset_pl:')
print('-' * 20)
print(subset_pl)
print('-' * 20)

# Concatena os arquivos separando o conteúdo por uma vírgula
df = subset_ad + ',' + subset_pl

# Adiciona a coluna e a linha de somatório
df['cont_row'] = [0 for _ in range(len(df))]
df.loc['cont_col'] = [0 for _ in df.columns[:-1]] + ['']

print('\n')
print('subset_ad + subset_pl:')
print('-' * 20)
print(df)
print('-' * 20)

# Contabiliza as diferenças (unificando a letra onde não houver diferença)
for index, row in df.iterrows():
    if index != 'cont_col':
        for col in df.columns[:-1]:
            val = row[col]
            letters = val.split(',')
            if letters[0] == letters[1]:
                df.set_value(index, col, letters[0])
            else:
                cnt = df.loc[index]['cont_row']
                df.set_value(index, 'cont_row', cnt + 1)

                cnt = df.loc['cont_col'][col]
                df.set_value('cont_col', col, cnt + 1)

print('\n')
print('resultado final:')
print('-' * 20)
print(df)
print('-' * 20)

Result of this code:

subset_ad:
--------------------
         col1 col2 col3 col4
POS
28355991    A    A    A    A
28356037    A    A    A    A
28356130    A    A    A    A
28356246    A    A    A    A
--------------------


subset_pl:
--------------------
         col1 col2 col3 col4
POS
28355991    A    B    A    A
28356037    B    B    B    A
28356130    A    B    A    A
28356246    A    A    B    A
--------------------


subset_ad + subset_pl:
--------------------
         col1 col2 col3 col4 cont_row
POS
28355991  A,A  A,B  A,A  A,A        0
28356037  A,B  A,B  A,B  A,A        0
28356130  A,A  A,B  A,A  A,A        0
28356246  A,A  A,A  A,B  A,A        0
cont_col    0    0    0    0
--------------------


resultado final:
--------------------
         col1 col2 col3 col4 cont_row
POS
28355991    A  A,B    A    A        1
28356037  A,B  A,B  A,B    A        3
28356130    A  A,B    A    A        1
28356246    A    A  A,B    A        1
cont_col    1    3    2    0
--------------------
  • 1

    Thank you very much Luiz! Although I wanted to count Rows and cols differently, the answer without this step would be much simpler than I was imagining. It was just a simple concatenation of df = subset_ad + ',' + subset_pl. Worked Perfectly!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.