How to map values with Dataframe pandas

Asked

Viewed 41 times

-1

I am trying to create a new data column in pandas format based on two other dataframes.

This first dataset is where I get the values:

GenPart_pdgID = 
       0   1   2   3   4   5     6     7     8     9     10    11    12     13
0     -4   4  23  23  23  23  23.0 -11.0  11.0  11.0  22.0 -11.0  -4.0 -413.0
1      1  -1  23  23  23 -11  11.0   2.0  21.0  -3.0  11.0 -11.0  11.0  -11.0
2     -1   1  23  21  23  21  23.0  22.0 -13.0  13.0  21.0  21.0   NaN    NaN 
3     -1   1  23  21  23  21  23.0  23.0  23.0 -13.0  13.0  13.0  22.0  -13.0 
4      2  21  23   2  23  23  23.0  23.0  23.0 -11.0  11.0 -11.0  22.0   11.0
...   ..  ..  ..  ..  ..  ..   ...   ...   ...   ...   ...   ...   ...    ...
2734   3  -3  23  23 -11  11  11.0  22.0 -11.0   3.0  -3.0  11.0 -11.0    NaN
2735   1  -1  23  23  23  23  23.0 -13.0  13.0   NaN   NaN   NaN   NaN    NaN
2736   2  -2  23  23 -11  11  22.0  22.0 -11.0 -11.0  22.0  11.0   NaN    NaN
2737  -2  21  23  -2  23  23 -13.0  13.0   3.0  -1.0   1.0  -1.0  -2.0  221.0

The second, will contain two columns with the same row numbers as the dataset GenPart_pdgID:

ele_genIdx = 
         0     1
0      9.0  11.0
1      NaN   NaN
2      NaN   NaN
3      NaN   NaN
4     13.0  11.0
...    ...   ...
2733   NaN   NaN
2734   8.0   6.0
2735   NaN   NaN
2736  -1.0   NaN
2737   NaN   NaN

That is, the first column of ele_genIdx will map which column to take inside the dataframe GenPart_pdgID. For example, line 0 and line 4 contain values 9 and 13 respectively, thus in the dataframe GenPart_pdgID I’ll take his line 0 with column 9 and then row 4 with column 13 and so on.

Note: In the case of lines that contain Nan, I want you to return nothing, since it contains no value

2 answers

0

From what was presented, it seems that the Dataframe is not huge. So, the solution below would meet

Loading libraries

import pandas as pd
import numpy as np

Creating Test Dataframe

df1 = pd.DataFrame({0: [1,2,3], 1:[4,5,6], 2:[7,8,9]})
df2 = pd.DataFrame({0: [0,1,np.nan], 1: [1,2,np.nan]})

print(df1)

   0  1  2
0  1  4  7
1  2  5  8
2  3  6  9


print(df2)

     0    1
0  0.0  1.0
1  1.0  2.0
2  NaN  NaN

Changing name of df2 columns

df2.columns = ["idx0", "idx1"]

print(df2)

   idx0  idx1
0   0.0   1.0
1   1.0   2.0
2   NaN   NaN

Note: the name change is so that the column does not receive suffixes (_x and _y) when merging

Merging dataframes ()

df3 = pd.merge(df1, df2, left_index=True, right_index=True)

print(df3)

   0  1  2  idx0  idx1
0  1  4  7   0.0   1.0
1  2  5  8   1.0   2.0
2  3  6  9   NaN   NaN

Assigning values

df3["valor1"] = df3.apply(lambda row: row[int(row["idx0"])] if not np.isnan(row["idx0"]) else np.nan, axis=1)
df3["valor2"] = df3.apply(lambda row: row[int(row["idx1"])] if not np.isnan(row["idx1"]) else np.nan, axis=1)

print(df3)

   0  1  2  idx0  idx1  valor1  valor2
0  1  4  7   0.0   1.0     1.0     4.0
1  2  5  8   1.0   2.0     5.0     8.0
2  3  6  9   NaN   NaN     NaN     NaN

Performance may be compromised for very large dataframes. However, as said earlier, it seems to me to be a small dataframe with only 2737 lines.

  • Thank you so much for the answer. In fact this dataset is small, but I work with a number of other datasets with various entries. Anyway, I’ll try to use what you sent. Thanks again

-1

You can do it this way:

col = pd.Series(np.diag(val_df.loc[idx_df[0].dropna().index, idx_df[0].dropna()]))

where idx_df corresponds to ele_genIdx and val_df corresponds to GenPart_pdgID.

In the case of your Dataframe, print(col) prints

0    11.0
1    11.0
dtype: float64

Explanation:

  • In the first column of idx_df, Remove all Nan values and return the index of the remaining values:
                                  idx_df[0].dropna().index
  • In the first column of idx_df, Remove all Nan values and return the remaining values:
                                                            idx_df[0].dropna()
  • Consult the values of val_df the row of which are the indices of the non-zero values of idx_df and the column are the non-zero values of idx_df:
                       val_df.loc[idx_df[0].dropna().index, idx_df[0].dropna()]
  • Return the main diagonal of the obtained values:
               np.diag(val_df.loc[idx_df[0].dropna().index, idx_df[0].dropna()])
  • Turn into a series:
      pd.Series(np.diag(val_df.loc[idx_df[0].dropna().index, idx_df[0].dropna()]))

To avoid calculating idx_df[0].dropna() twice, you can assign it to a variable before executing the command:

mask = idx_df[0].dropna()
col = pd.Series(np.diag(val_df.loc[mask.index, mask]))

Optionally, you can use

col = pd.Series(np.diag(val_df.loc[mask.index, mask]), index=mask.index)

where print(col) will print

0    11.0
4    11.0
dtype: float64

instead of the previous output.

  • Wow, that amazing code. It helped me very guy, thank you very much.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.