Extract text from a python string

Question

Extract text from a python string

Asked 3 years, 11 months ago

Viewed 50 times

-2

I have the following difficulty. I have a df with several columns and one of them is Description. In the middle of this description I have the product code and would like to extract this information for a new column. For example

Description: "This is the AA-123.456 description of the product..." New column "AA-123.456"

Does anyone know how I do it in python?

Regular expression?

– FourZeroFive

2021/08/09 at 21:00
@Scoring tries to specify your question better, the way it is no one will be able to help. If you could just put in the code so we can see what’s going on or what you need. In this description of your problem it seems that using regular expressions ( regex ) will help you.

– William

2021/08/09 at 21:30

1 answer

Browser other questions tagged python pandas

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2021-08-09T22:27:41+00:00

If the codes have a training standard as per your example LL-NNN.NNN, where L corresponds to a letter uppercase and N corresponds to a number. In addition, both the hyphenate as to the dot appear at that specific position. You can use the method str.extract with the regular expression ([A-Z]{2}-\d{3}\.\d{3})

Note When it comes to regular expressions, there is always a more comprehensive.

See the example below:

Importing library

import pandas as pd

Creating Test Dataframe

df = pd.DataFrame({"cod": [1, 2, 3, 4], "descricao": ["Um texto qualquer AA-123.456 e segue com mais coisa", "Outro texto com o codigo BB-232.444 e vamos lá", "Por fim um produto CC-666.888 e fim", "um errado AA-12.3.456"]})

print(df)

   cod                                          descricao
0    1  Um texto qualquer AA-123.456 e segue com mais ...
1    2     Outro texto com o codigo BB-232.444 e vamos lá
2    3                Por fim um produto CC-666.888 e fim
3    4                              um errado AA-12.3.456

Creating new column

df["novaColuna"] = df["descricao"].str.extract(r'([A-Z]{2}-\d{3}\.\d{3})')

print(df)

   cod                                          descricao  novaColuna
0    1  Um texto qualquer AA-123.456 e segue com mais ...  AA-123.456
1    2     Outro texto com o codigo BB-232.444 e vamos lá  BB-232.444
2    3                Por fim um produto CC-666.888 e fim  CC-666.888
3    4                              um errado AA-12.3.456         NaN

Note that for the last item, the code was not extracted, as it is not in the cited pattern.