Extract text from a python string

Asked

Viewed 50 times

-2

I have the following difficulty. I have a df with several columns and one of them is Description. In the middle of this description I have the product code and would like to extract this information for a new column. For example

Description: "This is the AA-123.456 description of the product..." New column "AA-123.456"

Does anyone know how I do it in python?

  • Regular expression?

  • @Scoring tries to specify your question better, the way it is no one will be able to help. If you could just put in the code so we can see what’s going on or what you need. In this description of your problem it seems that using regular expressions ( regex ) will help you.

1 answer

0


If the codes have a training standard as per your example LL-NNN.NNN, where L corresponds to a letter uppercase and N corresponds to a number. In addition, both the hyphenate as to the dot appear at that specific position. You can use the method str.extract with the regular expression ([A-Z]{2}-\d{3}\.\d{3})

Note When it comes to regular expressions, there is always a more comprehensive.

See the example below:

Importing library

import pandas as pd

Creating Test Dataframe

df = pd.DataFrame({"cod": [1, 2, 3, 4], "descricao": ["Um texto qualquer AA-123.456 e segue com mais coisa", "Outro texto com o codigo BB-232.444 e vamos lá", "Por fim um produto CC-666.888 e fim", "um errado AA-12.3.456"]})

print(df)

   cod                                          descricao
0    1  Um texto qualquer AA-123.456 e segue com mais ...
1    2     Outro texto com o codigo BB-232.444 e vamos lá
2    3                Por fim um produto CC-666.888 e fim
3    4                              um errado AA-12.3.456

Creating new column

df["novaColuna"] = df["descricao"].str.extract(r'([A-Z]{2}-\d{3}\.\d{3})')

print(df)

   cod                                          descricao  novaColuna
0    1  Um texto qualquer AA-123.456 e segue com mais ...  AA-123.456
1    2     Outro texto com o codigo BB-232.444 e vamos lá  BB-232.444
2    3                Por fim um produto CC-666.888 e fim  CC-666.888
3    4                              um errado AA-12.3.456         NaN

Note that for the last item, the code was not extracted, as it is not in the cited pattern.

  • Paulo, good morning! I applied your solution in my DF and extracted, but I have some code that are with 10 digits other are 11. Example: SE-XXX-.XXX or SE-XXX-XXXX For those with 11 digits it does not display the latter. If I change from Extract(r'([A-Z]{2}- d{3}. d{3})') to Extract(r'([A-Z]{2}- d{3}. d{4})') then all that have 10 digits are presented as Nan.

  • Paul, I made only this adjustment .str.Extract(r'([A-Z]{2}- d{3}. w+)') and it worked. Thank you very much for your help.

  • In this case, the point . will catch whichever character ie LL-NNN.NNN . take the space at the end; or LL-NNN.NNNX will take the X. Use (r'([A-Z]{2}-\d{3}\.\d{3,4})'). So you’ll get 3 or 4 digits at the end...

Browser other questions tagged

You are not signed in. Login or sign up in order to post.