-3
I am creating a web scraping program and would like to know if there is any way to extract the text from a site in this way:
This is the html:
<div class="s-sidebarwidget s-sidebarwidget__yellow s-anchors s-anchors__default sidebar-help" id="how-to-format" style="">
<h4 class="s-sidebarwidget--header mb0">
Como formatar
<a href="#wmd-input" class="js-back-to-edit-field s-sidebarwidget--action d-none md:d-inline">back <svg aria-hidden="true" class="svg-icon va-middle iconArrowUpSm" width="14" height="14" viewBox="0 0 14 14"><path d="M3 9h8L7 5z"></path></svg></a>
</h4>
<div class="s-sidebarwidget--content d-block">
<p>
<span class="dingus">►</span> create code fences with backticks ` or tildes ~
</p><div class="bg-black-050 p8 bar-sm ff-mono my4 wmx2">
```<br>
like so<br>
```
</div>
<p></p>
<p>
<span class="dingus">►</span> add language identifier to highlight code
</p><div class="bg-black-050 p8 bar-sm ff-mono my4 wmx2">
```python<br>
<span class="fc-blue-600">def</span> function(foo):<br>
<span class="fc-blue-600"> print</span>(foo)<br>
```
</div>
<p></p>
<p><span class="dingus">►</span> coloque retornos entre os parágrafos</p>
<p><span class="dingus">►</span> para quebra de linha adicione 2 espaços no final</p>
<p><span class="dingus">►</span> <i>_itálico_</i> ou <b>**negrito**</b></p>
<p><span class="dingus">►</span> recue o código em 4 espaços</p>
<p><span class="dingus">►</span> escapes de acentos graves <code>`parecido _portanto_`</code></p>
<p><span class="dingus">►</span> destaque colocando > no início da linha</p>
<p><span class="dingus">►</span> para fazer links</p>
<p><http://foo.com><br>[foo](http://foo.com)<br><a href="http://foo.com">foo</a></p>
<p class="ar">
<a href="/editing-help" target="_edithelp">ajuda na formatação »</a><br>
<a href="/questions/how-to-ask">ajuda para perguntas »</a>
</p>
</div>
</div>
That’s what I see (what I want to extract):
Como formatar
► create code fences with backticks ` or tildes ~
```
like so
```
► add language identifier to highlight code
```python
def function(foo):
print(foo)
```
► coloque retornos entre os parágrafos
► para quebra de linha adicione 2 espaços no final
► _itálico_ ou **negrito**
► recue o código em 4 espaços
► escapes de acentos graves `parecido _portanto_`
► destaque colocando > no início da linha
► para fazer links
<http://foo.com>
[foo](http://foo.com)
<a href="http://foo.com">foo</a>
Basically, I would like to extract the text from a site as if I had pressed Ctrl + a, Ctrl + c.
I tried to do it that way, but it didn’t come out as expected: BeautifulSoup(req, 'html.parser').get_text
Is there any way to turn html into "text"?
There are a few ways to do this yes. You can use Regex or CGI the latter being the best one to handle HTML/web pages.
– Leonardo Oliveira
Is your question too vague about how far you’ve gone? But if you are starting from scratch you will need to simulate a request with some http request, to parse the content you could for example use beautifulsoup, but there are a thousand ways, as I said before, your question is very vague
– Lucas Miranda
I’ve tried using bs4 and requests, but I still can’t turn html into text (like my example)
– João
Edit your question, and put your Python code, so we can show you where the problem is, and help you
– Carlos H Marques