Return rows from a dataframe that are not in another dataframe

Asked

Viewed 48 times

1

Basically, I have a Dataframe (Database) that contains all the data registered in my system:

data = {"Id": ["01", "02", "03", "04",'05'],"Fruta": ['Maçã','Abacaxi','Banana','Laranja','Morango']}
base_dados = pd.DataFrame(data)
display(base_dados)

    Id  Fruta
0   01  Maçã
1   02  Abacaxi
2   03  Banana
3   04  Laranja
4   05  Morango

And I have another Dataframe (Production) with the items produced that contain items that are already in the database, as new items that are not there, in the example I quoted a Watermelon and the Melon.

data = {"Id": ["01", "03",'05','06','07'],"Fruta": ['Maçã','Banana','Morango','Melancia','Melao']}
producao = pd.DataFrame(data)
display(producao)

    Id  Fruta
0   01  Maçã
1   03  Banana
2   05  Morango
3   06  Melancia
4   07  Melao 

What I need is to capture in python through pandas, capture only the items that are not in the database yet so I can be doing his treatment in the future.

Example of expected result:

    Id  Fruta
0   06  Melancia
1   07  Melao

2 answers

2


You can do it this way:

producao.loc[~producao['Id'].isin(base_dados['Id'])].reset_index(drop=True)

output from the above code:

   Id     Fruta
0  06  Melancia
1  07     Melao

In short: the expression producao['Id'].isin(base_dados['Id']) returns a Boolean column of all Ids of the table producao listed in the column Id table base_dados. How we want the Idwhich are NOT listed in the table base_dados, we use the operator ~ to invert the boolean values in the column.

Then just pass that column to produção.loc. The reset_index i did just so that the resulting Dataframe index is equal to your expected output example.

  • Friend, that’s exactly it, it worked perfectly. Thank you!!

1

You can use the following code:

base_dados2 = base_dados[["Id","Fruta"]].where(base_dados["Id"]=="01").dropna()

In short: the dataframe base_dados2 receives the columns "Id" and "Fruit" of the dataframe base_data where it has only the condition you want, in the case where the "Id" is equal to "01" and the dropna() at the end serves to the base_dados2 receive only the values that satisfy the condition.

  • 1

    I believe this is the idea, but in '''Where(base_data["Id"]="01"). dropna()'', I won’t have the ID number

  • 1

    This is just an example, you can put the condition you want inside the Where, even comparing the values of two dataframes, it doesn’t need to be the column Id can be the column you want.

  • 1

    In the case I did so: base_dados2 = base_dados[["Id","Fruit"]]. Where(base_dados[["Id","Fruit"]].==producao[["Id","Fruit"]]). dropna() But returns error "Can only compare identically-labeled Series Objects"

  • 1

    I think you want something like this: base_dados2 = base_data[["Id","Fruit"]. Where(~base_data["Fruit"].isin(production["Fruit"]). dropna()

  • 1

    Friend, first thank you for the attention you are giving me, the code you put last worked bringing the items you already had from base_data, I wanted to get the production items that are NOT in base_data, as example would bring the items: Id Fruit 0 06 Watermelon 1 07 Melao

  • To here to help partner, try this, I think now I understand what you want: data2 = production[["Id","Fruit"]]. Where(~production["Fruit"].isin(base_data["Fruit"]). dropna()

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.