Scraping with python

Question

Scraping with python

Asked 10 years, 7 months ago

Viewed 174 times

5

How to capture one or more phrases from a website with Python and/or regular expressions?

I want everything that starts with

<p>&#8220; e acabe com &#8221;</p>

Example:

<p>&#8220;frasefrasefrasefrasefrasefrasefrasefrase.&#8221;</p>

How to proceed?

To regex worked out for you?

– stderr

2015/01/27 at 21:44
No, nor the beautifulsoup module, I believe for lack of knowledge on my part.. But I came close to what I wanted with the two.

– Vinicius

2015/01/27 at 23:48
Depending on what it is, I think I can help, if it’s something that fits that question, edit, if it’s not create a new question explaining where you’re going. If you can mark the answer as accepted.

– stderr

2015/01/28 at 00:16
@Qmechanic73 would like to mark the answer as accepted, as I do?

– Vinicius

2015/01/28 at 04:28
To mark an answer as accepted, click the check mark on the left side of the answer; its color will change from gray to green. Behold.

– stderr

2015/01/28 at 11:03

1 answer

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by stderr • **30,356** points · Answer 1 · 2015-01-27T00:05:12+00:00

You can use the expression #8220;(\w.+)&#8221 which will correspond to numbers and letters (minuscules and uppercase) and . who are among #8220; and &#8221.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

dados = """<p>&#8220;Linha 1&#8221;</p>
<p>&#8220;Linha 2&#8221;</p>

<p>&#8220;Linha 3 &#8221;</p>
"""

regex = re.compile("#8220;(\w.+)&#8221", re.MULTILINE)
matches = regex.findall(dados)

if matches:
    print(matches)
# Saída: ['Linha 1', 'Linha 2', 'Linha 3 ']

As you can see will be returned a list, to access a specific value do:

print(matches[0])
# Saída: Linha 1

DEMO

Note: Regular expressions are not recommended to handle file structures html/xml, the correct would be to use a parser, like the Beautifulsoup which serves very well for that purpose of scraping.

Take an example:

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

url = 'http://answall.com'
html = urlopen(url).read()
soup = BeautifulSoup(html)

for li in soup.findAll('li'):
    for a in li.findAll('a'):
        print("%-45s: %s" %(a.text, a['href']))