How to extract a text from a page?

Asked

Viewed 1,313 times

-3

I am creating a web scraping program and would like to know if there is any way to extract the text from a site in this way:

This is the html:

<div class="s-sidebarwidget s-sidebarwidget__yellow s-anchors s-anchors__default sidebar-help" id="how-to-format" style="">

    <h4 class="s-sidebarwidget--header mb0">
        Como formatar
            <a href="#wmd-input" class="js-back-to-edit-field s-sidebarwidget--action d-none md:d-inline">back <svg aria-hidden="true" class="svg-icon va-middle iconArrowUpSm" width="14" height="14" viewBox="0 0 14 14"><path d="M3 9h8L7 5z"></path></svg></a>
    </h4>
    <div class="s-sidebarwidget--content d-block">


        <p>
            <span class="dingus">►</span> create code fences with backticks ` or tildes ~
            </p><div class="bg-black-050 p8 bar-sm ff-mono my4 wmx2">
                ```<br>
                like so<br>
                ```
            </div>
        <p></p>
        <p>
            <span class="dingus">►</span> add language identifier to highlight code
            </p><div class="bg-black-050 p8 bar-sm ff-mono my4 wmx2">
                ```python<br>
                <span class="fc-blue-600">def</span> function(foo):<br>
                <span class="fc-blue-600">&nbsp;&nbsp;&nbsp;&nbsp;print</span>(foo)<br>
                ```
            </div>
        <p></p>
        <p><span class="dingus">►</span> coloque retornos entre os parágrafos</p>
        <p><span class="dingus">►</span> para quebra de linha adicione 2 espaços no final</p>
        <p><span class="dingus">►</span> <i>_itálico_</i> ou <b>**negrito**</b></p>
            <p><span class="dingus">►</span> recue o código em 4 espaços</p>
    <p><span class="dingus">►</span> escapes de acentos graves <code>`parecido _portanto_`</code></p>

        <p><span class="dingus">►</span> destaque colocando &gt; no início da linha</p>
        <p><span class="dingus">►</span> para fazer links</p>
        <p>&lt;http://foo.com&gt;<br>[foo](http://foo.com)<br>&lt;a href="http://foo.com"&gt;foo&lt;/a&gt;</p>

        <p class="ar">
            <a href="/editing-help" target="_edithelp">ajuda na formatação »</a><br>
                        <a href="/questions/how-to-ask">ajuda para perguntas »</a>

        </p>

    </div>
</div>

That’s what I see (what I want to extract):

     Como formatar

► create code fences with backticks ` or tildes ~
```
like so
```

► add language identifier to highlight code
```python
def function(foo):
    print(foo)
```

► coloque retornos entre os parágrafos

► para quebra de linha adicione 2 espaços no final

► _itálico_ ou **negrito**

► recue o código em 4 espaços

► escapes de acentos graves `parecido _portanto_`

► destaque colocando > no início da linha

► para fazer links

<http://foo.com>
[foo](http://foo.com)
<a href="http://foo.com">foo</a>

Basically, I would like to extract the text from a site as if I had pressed Ctrl + a, Ctrl + c.

I tried to do it that way, but it didn’t come out as expected: BeautifulSoup(req, 'html.parser').get_text

Is there any way to turn html into "text"?

  • There are a few ways to do this yes. You can use Regex or CGI the latter being the best one to handle HTML/web pages.

  • 1

    Is your question too vague about how far you’ve gone? But if you are starting from scratch you will need to simulate a request with some http request, to parse the content you could for example use beautifulsoup, but there are a thousand ways, as I said before, your question is very vague

  • I’ve tried using bs4 and requests, but I still can’t turn html into text (like my example)

  • Edit your question, and put your Python code, so we can show you where the problem is, and help you

2 answers

1


I figured out a way to do this using the module html2text

import requests
import html2text
req = requests.get('https://pastebin.com/').text
print(html2text.html2text(req))
  • Still to explain how to install the module, because its response may be useful to future visitors

1

Assuming you are using bs4 you could use the method yourself getText (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text)

from bs4 import BeautifulSoup

soup = BeautifulSoup("<ul><li>foo</li><li>bar</li><li>baz</li>", 'html.parser')

print(soup.get_text('\n'))

The \n will be the delimiter between the removed tags

Or use the .strings (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-Stripped-strings) to pick up a vector:

from bs4 import BeautifulSoup

soup = BeautifulSoup("<ul><li>foo</li><li>bar</li><li>baz</li>", 'html.parser')

for string in soup.strings:
    print(repr(string))

Browser other questions tagged

You are not signed in. Login or sign up in order to post.