Filter column value and extract from these rows in another column the value and write to a new column

Question

Filter column value and extract from these rows in another column the value and write to a new column

Asked 4 years, 5 months ago

Viewed 80 times

-4

I need some help, I’m getting laid. I have the title as shown in the table on the left and need to transform with Pandas in the format of the table in yellow, will the account Pandas?

I am with 3 columns and more than 1000 rows. There are randomly rows that have information different from all the others. In those in question, has in the first column all the same value, see below that whole line:

col1      col2      col3
-------------------------------------------------
FIG.      Item      Nomes diferentes nesta coluna

In all these rows you have the column with the value "FIG."; that is, what differentiates these lines from the others.

I need to filter these lines (with "FIG.") and take the value of this line which is a text that is in the third column and create a fourth column with that value. Can you help me with that? It would look something like this:

col1      col2      col3                               col4
------------------------------------------------------------------------------------
FIG.      Item      Nomes diferentes nesta coluna      Nomes diferentes nesta coluna

After that I want to go back to DataFrame which contains all lines before filtering but which stays with this fourth column.

I did that and the filter worked:

df_mask = df['FIG.'] =='FIG.' filtered_df = df[df_mask] lista = filtered_df

And that worked, too:

df = pd.DataFrame(lista,columns=['FIG.','Item','Nomes diferentes nesta coluna'])

But I’m unable to create a new column with the values "Different names in this column".

1

Luiz, the description is a little confusing to understand. You could update your post by placing 3 lines of your original dataframe and the output you expect?

– Paulo Marques

2021/03/03 at 20:52
Luiz Notari, don’t tear up the questions.

– Augusto Vasques

2021/05/30 at 16:03

2 answers

0

The great difficulty of using Pandas is to think that you can do everything in one line. It may be possible, but it is not readable and is difficult to maintain.

So let’s split up

Creating a test base

import pandas as pd
import numpy as np

df = pd.DataFrame({"FigField": ["FIG", "",  "",  "", "FIG",  "",  "",  "", "FIG",  "",  ""], "CatFilme": ["Categoria 1", "filme a",  "filme b",  "filme c", "Categoria 2",  "filme x",  "filme y",  "filme z", "Categoria 3",  "filme h",  "filme k"]})

We have:

   FigField     CatFilme
0       FIG  Categoria 1
1                filme a
2                filme b
3                filme c
4       FIG  Categoria 2
5                filme x
6                filme y
7                filme z
8       FIG  Categoria 3
9                filme h
10               filme k

Creating auxiliary column "zeroed"

df["ColAux"] = np.nan

We have

   FigField     CatFilme  ColAux
0       FIG  Categoria 1     NaN
1                filme a     NaN
2                filme b     NaN
3                filme c     NaN
4       FIG  Categoria 2     NaN
5                filme x     NaN
6                filme y     NaN
7                filme z     NaN
8       FIG  Categoria 3     NaN
9                filme h     NaN
10               filme k     NaN

Copying category to auxiliary column only where there is FIG

df["ColAux"] = df.apply(lambda row: row["CatFilme"] if row["FigField"] == "FIG" else np.nan, axis=1)

We have

   FigField     CatFilme       ColAux
0       FIG  Categoria 1  Categoria 1
1                filme a          NaN
2                filme b          NaN
3                filme c          NaN
4       FIG  Categoria 2  Categoria 2
5                filme x          NaN
6                filme y          NaN
7                filme z          NaN
8       FIG  Categoria 3  Categoria 3
9                filme h          NaN
10               filme k          NaN

Filling the Nan of the auxiliary column with the category

df["ColAux"].fillna(method="ffill", inplace=True)

We have

   FigField     CatFilme       ColAux
0       FIG  Categoria 1  Categoria 1
1                filme a  Categoria 1
2                filme b  Categoria 1
3                filme c  Categoria 1
4       FIG  Categoria 2  Categoria 2
5                filme x  Categoria 2
6                filme y  Categoria 2
7                filme z  Categoria 2
8       FIG  Categoria 3  Categoria 3
9                filme h  Categoria 3
10               filme k  Categoria 3

Deleting lines containing FIG

df = df[df["FigField"] != "FIG"]

We have

   FigField CatFilme       ColAux
1            filme a  Categoria 1
2            filme b  Categoria 1
3            filme c  Categoria 1
5            filme x  Categoria 2
6            filme y  Categoria 2
7            filme z  Categoria 2
9            filme h  Categoria 3
10           filme k  Categoria 3

Can still be done reset_index() case you find interesting.

@Luiznotari, good that it worked. So mark the answer as accepted.

– Paulo Marques

2021/04/19 at 18:10

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2021-03-05T00:09:29+00:00

I believe you’re looking stack() or unstack()

To remember what each one does, the stack will leave your data frame longer, and the unstack wider.

Since I don’t have access to the data, follow an example.

>>> import pandas as pd

>>> frutas = ['banana', 'abacaxi', 'ameixa', 'banana', 'abacaxi']
>>> valores = [1.0, 1.0, 2.0, 5.0, 3.5]

>>> df = pd.DataFrame({"fruta": frutas, "valor": valores})

>>> df
     fruta  valor
0   banana    1.0
1  abacaxi    1.0
2   ameixa    2.0
3   banana    5.0
4  abacaxi    3.5

Creating the group

>>> df.groupby("fruta").agg('sum')
         valor
fruta
abacaxi    4.5
ameixa     2.0
banana     6.0

Stacking

>>> df.groupby("fruta").agg('sum').stack()
fruta
abacaxi  valor    4.5
ameixa   valor    2.0
banana   valor    6.0
dtype: float64

Unstacking

>>> df.groupby("fruta").agg('sum').unstack()
       fruta
valor  abacaxi    4.5
       ameixa     2.0
       banana     6.0
dtype: float64
>>>

Creating group by both columns

>>> df.groupby(["fruta", "valor"]).agg({'valor': 'sum'})
               valor
fruta   valor
abacaxi 1.0      1.0
        3.5      3.5
ameixa  2.0      2.0
banana  1.0      1.0
        5.0      5.0

Choosing the column for Unstack

>>> df.groupby(["fruta", "valor"]).agg({'valor': 'sum'}).unstack()
        valor
valor     1.0  2.0  3.5  5.0
fruta
abacaxi   1.0  NaN  3.5  NaN
ameixa    NaN  2.0  NaN  NaN
banana    1.0  NaN  NaN  5.0

Filling Nan with zero

>>> df.groupby(["fruta", "valor"]).agg({'valor': 'sum'}).unstack(fill_value=0)
        valor
valor     1.0  2.0  3.5  5.0
fruta
abacaxi   1.0  0.0  3.5  0.0
ameixa    0.0  2.0  0.0  0.0
banana    1.0  0.0  0.0  5.0

See the documentation of the two: stack and Unstack

In time: this is just an example with no specific application and served only to explore the methods stack() and unstack() who will help you with your case.