Scraping with python

Asked

Viewed 174 times

5

How to capture one or more phrases from a website with Python and/or regular expressions?

I want everything that starts with

<p>&#8220; e acabe com &#8221;</p>

Example:

<p>&#8220;frasefrasefrasefrasefrasefrasefrasefrase.&#8221;</p>

How to proceed?

  • To regex worked out for you?

  • No, nor the beautifulsoup module, I believe for lack of knowledge on my part.. But I came close to what I wanted with the two.

  • Depending on what it is, I think I can help, if it’s something that fits that question, edit, if it’s not create a new question explaining where you’re going. If you can mark the answer as accepted.

  • @Qmechanic73 would like to mark the answer as accepted, as I do?

  • To mark an answer as accepted, click the check mark on the left side of the answer; its color will change from gray to green. Behold.

1 answer

3


You can use the expression #8220;(\w.+)&#8221 which will correspond to numbers and letters (minuscules and uppercase) and . who are among #8220; and &#8221.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

dados = """<p>&#8220;Linha 1&#8221;</p>
<p>&#8220;Linha 2&#8221;</p>

<p>&#8220;Linha 3 &#8221;</p>
"""

regex = re.compile("#8220;(\w.+)&#8221", re.MULTILINE)
matches = regex.findall(dados)

if matches:
    print(matches)
# Saída: ['Linha 1', 'Linha 2', 'Linha 3 ']

As you can see will be returned a list, to access a specific value do:

print(matches[0])
# Saída: Linha 1

DEMO

Note: Regular expressions are not recommended to handle file structures html/xml, the correct would be to use a parser, like the Beautifulsoup which serves very well for that purpose of scraping.

Take an example:

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

url = 'http://answall.com'
html = urlopen(url).read()
soup = BeautifulSoup(html)

for li in soup.findAll('li'):
    for a in li.findAll('a'):
        print("%-45s: %s" %(a.text, a['href']))
  • 1

    The reason for not using regex to parse html is a very famous answer in the English stackoverflow: http://stackoverflow.com/questions/1732348/regex-match-open-tags-exceptit-xhtml-self-contained-tags/1732454#1732454

Browser other questions tagged

You are not signed in. Login or sign up in order to post.