Find equal names(string) in two distinct dataframes and add a column in the second data frame with another data from the first

Asked

Viewed 66 times

1

Hello.

I have two data frames:

Times with 629 lines.

rank  prev_rank  name        league                 off def     spi  
0   1   1   Manchester City Barclays Premier League 3.34    0.24    95.24  
1   2   2   Liverpool   Barclays Premier League 2.95    0.28    92.96  
2   3   3   Bayern Munich   German Bundesliga   3.29    0.46    92.43  
3   4   4   Paris Saint-Germain French Ligue 1  2.88    0.47    89.55  
4   5   7   Real Madrid Spanish Primera Division    2.79    0.46    88.98  


**E partidas** com 27122 linhas.

date    league_id   league  team1   team2   spi1    spi2    prob1   prob2   probtie ... importance1 importance2 score1  score2  xg1 xg2 nsxg1   nsxg2   adj_score1  adj_score2  
0   2016-08-12  1843    French Ligue 1  Bastia  Paris Saint-Germain 51.16   85.68   0.0463  0.8380  0.1157  ... 32.4    67.7    0.0 1.0 0.97    0.63    0.43    0.45    0.00    1.05  
1   2016-08-12  1843    French Ligue 1  AS Monaco   Guingamp    68.85   56.48   0.5714  0.1669  0.2617  ... 53.7    22.9    2.0 2.0 2.45    0.77    1.75    0.42    2.10    2.10  
2   2016-08-13  2411    Barclays Premier League Hull City   Leicester City  53.57   66.81   0.3459  0.3621  0.2921  ... 38.1    22.2    2.0 1.0 0.85    2.77    0.17    1.25    2.10    1.05  
3   2016-08-13  2411    Barclays Premier League Crystal Palace  West Bromwich Albion    55.19   58.66   0.4214  0.2939  0.2847  ... 43.6    34.6    0.0 1.0 1.11    0.68    0.84    1.60    0.00    1.05  
4   2016-08-13  2411    Barclays Premier League Everton Tottenham Hotspur   68.02   73.25   0.3910  0.3401  0.2689  ... 31.9    48.0    1.0 1.0 0.73    1.11    0.88    1.81    1.05    1.05  

I need to compare the team name (team1) of df2(matches) to the name df1(times) and when I find the corresponding add in a list team_id[] the value contained in "rank" in df1(times) and then add that list in df2 (matches). Including the new variable "team1_id" in data frame 2.

And then do the same pro team2 process generating "team2_id".

I tried a few ways and the last one was this:

team1_id = []
for i in range(0,27121):
    for n in range(0,628):
        if data['team1'].values[i] == times['name'].values[n]:
            team1_id.append(times['rank'][(n)])

But it returned only 24647 values even though there were no blank values in team1 and all exist in the other df. And when checking the values are not sorted correctly, it seems that the first line was not included.

team2_id = []
for i in range(0,27121):
    for n in range(0,628):
        if data['team2'].values[i] == times['name'].values[n]:
            team2_id.append(times['rank'][(n)])

This is correct the first records checked more also with a smaller number of records than expected.

  • Tete use merge: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

1 answer

0

You can try something like:

def insert_rank2(df1, df2):
    if not 'rank' in df2.columns:
        df2['rank'] = pd.Series()
    for team in df1.name:
        index = df2[df2['team1'] ==  team].index
        df2.iloc[index]['rank'] = df1[df1['name'] == team]['rank'].values[0]
    return df2

In this case, you enter by index the ranking where the names of the teams in the two tables are the same. In addition to the code being smaller, more readable, the execution is faster.

Make the most of working with Dataframe, working with list (array) or matrix calculations (as I did now), because instead of going line by line, you insert (or calculate) block, which reduces time with large bases.

  • The above code generated the variable "rank" in df2 plus with all Nan values.

  • Does the code still give problem? In the small sample you sent me, it worked well.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.