You can use the expression #8220;(\w.+)”
which will correspond to numbers and letters (minuscules and uppercase) and .
who are among #8220;
and ”
.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
dados = """<p>“Linha 1”</p>
<p>“Linha 2”</p>
<p>“Linha 3 ”</p>
"""
regex = re.compile("#8220;(\w.+)”", re.MULTILINE)
matches = regex.findall(dados)
if matches:
print(matches)
# Saída: ['Linha 1', 'Linha 2', 'Linha 3 ']
As you can see will be returned a list, to access a specific value do:
print(matches[0])
# Saída: Linha 1
DEMO
Note: Regular expressions are not recommended to handle file structures html/xml, the correct would be to use a parser, like the Beautifulsoup
which serves very well for that purpose of scraping.
Take an example:
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
url = 'http://answall.com'
html = urlopen(url).read()
soup = BeautifulSoup(html)
for li in soup.findAll('li'):
for a in li.findAll('a'):
print("%-45s: %s" %(a.text, a['href']))
To regex worked out for you?
– stderr
No, nor the beautifulsoup module, I believe for lack of knowledge on my part.. But I came close to what I wanted with the two.
– Vinicius
Depending on what it is, I think I can help, if it’s something that fits that question, edit, if it’s not create a new question explaining where you’re going. If you can mark the answer as accepted.
– stderr
@Qmechanic73 would like to mark the answer as accepted, as I do?
– Vinicius
To mark an answer as accepted, click the check mark on the left side of the answer; its color will change from gray to green. Behold.
– stderr