Python 3.6 regular expression for inteitra phrase extraction

Question

Python 3.6 regular expression for inteitra phrase extraction

Asked 8 years, 5 months ago

Viewed 691 times

1

I need to extract only the phrases that contain ADMINISTRATION - JUDGE OUTSIDE - NOCTURNE - SISU - GROUP B, for example. That is, I need to get only the name of the course, the city, the turn, the SISU and the group name of the following string:

string = </li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=70>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=71>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO B</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=72>

The string is huge, that’s just a piece. I managed to make one but it’s returning stung things, and also, it’s not picking up accented letters, like for example the "oh" accented HISTORY. The expression I made was that

cursos = re.findall(([A-Z])\w+g)

I need you to get out of this :

ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A

But she returns it to me:

GEOGRAFIA - JUIZ DE FORA - DIURNO - SISU - GRUPO( não está pegando qual grupo é)

and in HISTORY for example she does not get the "O" accented.

Can you also tell what url you’re fetching html sff from? It would be easier to help you and you

– Miguel

2017/02/28 at 16:37
In this case I don’t need the url’s, it’s just the sentences. The urls I’ve already extracted with another expression because I need them in a separate place. It’s just the same sentences.

– SasukeUchiha

2017/02/28 at 16:47
You NEED to use Python 2 for this ? It’s much better to use Python 3 - to start, you won’t have problems with accentuation.

– jsbueno

2017/03/01 at 13:10
(your keyboard has no ' " '? are missing in both html snippet and Python code )

– jsbueno

2017/03/01 at 13:25

3 answers

Browser other questions tagged python python-2.7 scrapy web-scraping

You are not signed in. Login or sign up in order to post.

by Miguel • **29,306** points · Answer 1 · 2017-02-28T20:02:08+00:00

I’ve been waiting for someone who actually knows about regex to respond but I’m gonna give you a different solution (and in many cases best, best):

from bs4 import BeautifulSoup as bs

string = '</li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=70>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO A</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=71>ADMINISTRAÇÃO - JUIZ DE FORA - NOTURNO - SISU - GRUPO B</a></li><li><a href=http://www.ufjf.br/cdara/sisu-2/sisu-2017-1a-edicao/lista-de-espera-sisu-3/?id_curso=46A&id_grupo=72>'

soup = bs(string, 'html.parser')
aEles = soup.findAll('a')
texts = '\n'.join(i.text for i in aEles if i.text != '')
print(texts)

This will print:

ADMINISTRATION - JUDGE OUTSIDE - NIGHT - SISU - GROUP A
ADMINISTRATION - JUIZ DE FORA - NOCTURNE - SISU - GRUPO B

by jsbueno • **30,668** points · Answer 2 · 2017-03-01T13:27:45+00:00

(1) Regular expressions are not the most suitable tool to extract HTML content - it is best to use an HTML parser that does this -like the beautifulsoup listed in Miguel’s response, or the "Htmlparser" module itself from the standard Python library. https://docs.python.org/3/library/html.parser.html (In Python 2 the module is HTMLParser instead of html.parser - but, I insist, you shouldn’t be using Python 2 - it will leave you 10 years ago in functionality and ease, including accented character handling)

(2) That being said, the problem with your regular expression is that you are focusing the wrong way - instead of looking for the phrases themselves, which may have many variations, it is much easier to look for what is around of the sentence, which is fixed (the tags <a> and </a> . ) If there are more links than those of interest, you can start complicating your regular expression to get only the content of <a> within <li>, for example (and then you’ll understand why the recommendation IS NOT to use regular expressions for that) - or, after extracting all the content from tags <a>, use a standard Python filter with "for" and "if" to leave only what interests you. (can be more readable and easier than a complex regexp).

With all this said, the regular expression to retrieve everything that is inside the tags <a>, which you can use with the method findall is:

re.findall (r"<a.*?>(.*?)</a", string)

The output I get for the HTML snippet you pasted is:

['ADMINISTRA\xc3\x87\xc3\x83O - JUIZ DE FORA - NOTURNO - SISU - GRUPO A',
 'ADMINISTRA\xc3\x87\xc3\x83O - JUIZ DE FORA - NOTURNO - SISU - GRUPO B']

(In Python 2.7 - in Python3, the accentuation of the given section is already correct in the representation)

by Iron Man • **768** points · Answer 3 · 2017-03-01T14:33:05+00:00

Using regex to HTML formatted this way, you can do so:

regex = r'(?<=<a href=http://www.ufjf.br/cdara/sisu-2/sisu-20\d{2}-\da-edicao/lista-de-espera-sisu-\d)(?:[\s\S]*?>)([-\x41-\x5A\xC0-\xDC\s]*?)(?:</a>)'
cursos = re.findall(regex, string)