Python 3.6 regular expression for inteitra phrase extraction

Asked

Viewed 691 times

1

I need to extract only the phrases that contain ADMINISTRATION - JUDGE OUTSIDE - NOCTURNE - SISU - GROUP B, for example. That is, I need to get only the name of the course, the city, the turn, the SISU and the group name of the following string:

string = </li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=70>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=71>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO B</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=72>

The string is huge, that’s just a piece. I managed to make one but it’s returning stung things, and also, it’s not picking up accented letters, like for example the "oh" accented HISTORY. The expression I made was that

cursos = re.findall(([A-Z])\w+g)

I need you to get out of this :

ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A

But she returns it to me:

GEOGRAFIA - JUIZ DE FORA - DIURNO - SISU - GRUPO( não está pegando qual grupo é)

and in HISTORY for example she does not get the "O" accented.

  • Can you also tell what url you’re fetching html sff from? It would be easier to help you and you

  • In this case I don’t need the url’s, it’s just the sentences. The urls I’ve already extracted with another expression because I need them in a separate place. It’s just the same sentences.

  • You NEED to use Python 2 for this ? It’s much better to use Python 3 - to start, you won’t have problems with accentuation.

  • (your keyboard has no ' " '? are missing in both html snippet and Python code )

3 answers

3

I’ve been waiting for someone who actually knows about regex to respond but I’m gonna give you a different solution (and in many cases best, best):

from bs4 import BeautifulSoup as bs

string = '</li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=70>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=71>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO B</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=72>'

soup = bs(string, 'html.parser')
aEles = soup.findAll('a')
texts = '\n'.join(i.text for i in aEles if i.text != '')
print(texts)

This will print:

ADMINISTRATION - JUDGE OUTSIDE - NIGHT - SISU - GROUP A
ADMINISTRATION - JUIZ DE FORA - NOCTURNE - SISU - GRUPO B

  • Regular expression for this is not difficult - but the answer is correct: it is wrong on some levels to rely on regular expressions to extract content from HTML markup. The correct is to use a tool that parses HTML and extracts content "knowing" that is content like beautfulsoup

  • @jsbueno agree, obgado. Maybe do with regex once you get home

  • With regex you do not need - I have done below. : -) The correct is with beautfulsoup same.

  • @jsbueno ha lol, dsculpa had not noticed. So AP already has both solutions

2

(1) Regular expressions are not the most suitable tool to extract HTML content - it is best to use an HTML parser that does this -like the beautifulsoup listed in Miguel’s response, or the "Htmlparser" module itself from the standard Python library. https://docs.python.org/3/library/html.parser.html (In Python 2 the module is HTMLParser instead of html.parser - but, I insist, you shouldn’t be using Python 2 - it will leave you 10 years ago in functionality and ease, including accented character handling)

(2) That being said, the problem with your regular expression is that you are focusing the wrong way - instead of looking for the phrases themselves, which may have many variations, it is much easier to look for what is around of the sentence, which is fixed (the tags <a> and </a> . ) If there are more links than those of interest, you can start complicating your regular expression to get only the content of <a> within <li>, for example (and then you’ll understand why the recommendation IS NOT to use regular expressions for that) - or, after extracting all the content from tags <a>, use a standard Python filter with "for" and "if" to leave only what interests you. (can be more readable and easier than a complex regexp).

With all this said, the regular expression to retrieve everything that is inside the tags <a>, which you can use with the method findall is:

re.findall (r"<a.*?>(.*?)</a", string)

The output I get for the HTML snippet you pasted is:

['ADMINISTRA\xc3\x87\xc3\x83O - JUIZ DE FORA - NOTURNO - SISU - GRUPO A',
 'ADMINISTRA\xc3\x87\xc3\x83O - JUIZ DE FORA - NOTURNO - SISU - GRUPO B']

(In Python 2.7 - in Python3, the accentuation of the given section is already correct in the representation)

-1

Using regex to HTML formatted this way, you can do so:

regex = r'(?<=<a href=http://www.ufjf.br/cdara/sisu-2/sisu-20\d{2}-\da-edicao/lista-de-espera-sisu-\d)(?:[\s\S]*?>)([-\x41-\x5A\xC0-\xDC\s]*?)(?:</a>)'
cursos = re.findall(regex, string)
  • You generated an automatic regular expression, and posted your content without paying attention to what it does: in this case the regular expression takes exactly the carcatéres that are in the example phrase of the question, but will fail if the phrase contains different letters. This answer is incorrect.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.