Pandas error 302; read_html()

Asked

Viewed 132 times

-2

I need to import a table that is in html using pandas, but when trying to do this the same returns me error.

import pandas as pd

url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/megasena/!ut/p/a1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOLNDH0MPAzcDbwMPI0sDBxNXAOMwrzCjA0sjIEKIoEKnN0dPUzMfQwMDEwsjAw8XZw8XMwtfQ0MPM2I02-AAzgaENIfrh-FqsQ9wNnUwNHfxcnSwBgIDUyhCvA5EawAjxsKckMjDDI9FQE-F4ca/dl5/d5/L2dBISEvZ0FBIS9nQSEh/pw/Z7_HGK818G0K8DBC0QPVN93KQ10G1/res/id=historicoHTML/c=cacheLevelPage/=/'

tabela_megasena = pd.read_html(url)

Below is information regarding the error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-20-d9298fc7ed7e> in <module>()
----> 1 tabela_megasena = pd.read_html(url)

32 frames
/usr/lib/python3.7/urllib/request.py in http_error_302(self, req, fp, code, msg, headers)
    743                 len(visited) >= self.max_redirections):
    744                 raise HTTPError(req.full_url, code,
--> 745                                 self.inf_msg + msg, headers, fp)
    746         else:
    747             visited = new.redirect_dict = req.redirect_dict = {}

HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found

Could you give me some light and explain why this infinite loop is happening, and if possible

  • Go straight to the question and don’t put compliments or thanks in the posts. See: https://answall.com/help/behavior

1 answer

1

The problem is being generated by multiple redirects (error 30X)

Looking at HTML, there are many rowspan and colspan; and this will mess up the dataframe.

The solution presented uses:

Installing libraries

pip install requests
pip install beautifulsoup4
pip install rows
pip install "rows[html]"
pip install pandas

Note Perhaps some complementary like the html5lib or another

Loading libraries

import pandas as pd
import re
import requests
import rows

from bs4 import BeautifulSoup
from io import BytesIO

Loading the page

url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/megasena/!ut/p/a1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOLNDH0MPAzcDbwMPI0sDBxNXAOMwrzCjA0sjIEKIoEKnN0dPUzMfQwMDEwsjAw8XZw8XMwtfQ0MPM2I02-AAzgaENIfrh-FqsQ9wNnUwNHfxcnSwBgIDUyhCvA5EawAjxsKckMjDDI9FQE-F4ca/dl5/d5/L2dBISEvZ0FBIS9nQSEh/pw/Z7_HGK818G0K8DBC0QPVN93KQ10G1/res/id=historicoHTML/c=cacheLevelPage/=/'

response = requests.get(url)
html = response.content

Using the Beautifulsoup

soup = BeautifulSoup(html, 'lxml')

tabela = soup.find("table")

# limpando tabelas dentro da tabela
for tag in tabela.find_all('table'):
    _ = tag.replaceWith('')

# encontrando linhas
soup_tr = tabela.findAll("tr")

HACK

lista_tr = list(soup_tr)
lista_tr[0] = lista_tr[1]

Note 1 the first item in the list is an empty line (tr)) Note 2 I’m repeating the first item on the list because the rows is ignoring him. (I don’t know why)

Turning the list into a string and clearing comments

s = "".join([str(l) for l in lista_tr])
s = "<table>" + s + "</table>"
s = re.sub("(<!--.*?-->)", "", s, flags=re.DOTALL)

Using the Rows library

table = rows.import_from_html(BytesIO(bytes(s, encoding='utf-8')))

Playing on a dataframe

df = pd.DataFrame.from_records(table.__dict__["_rows"])

The dataframe will be:

>>> df.head()

   0                   1           2     3     4     5     6   ...    15    16    17    18   19   20 21
0   2  Belo Horizonte, MG  18/03/1996   9.0  37.0  39.0  41.0  ...  None  0,00  0,00  0,00  NAO  SIM
1   3        Brasília, DF  25/03/1996  10.0  11.0  29.0  30.0  ...  None  0,00  0,00  0,00  NAO  SIM
2   4     Santo André, SP  01/04/1996   1.0   5.0   6.0  27.0  ...  None  0,00  0,00  0,00  SIM  SIM
3   5        Brasília, DF  08/04/1996   1.0   2.0   6.0  16.0  ...  None  0,00  0,00  0,00  SIM  SIM
4   6        Brasília, DF  15/04/1996   7.0  13.0  19.0  22.0  ...  None  0,00  0,00  0,00  SIM  SIM

[5 rows x 22 columns]

Note 3 TODO: Apply column names

  • Thank you very much for your cooperation, Paulo. At first it gave a problem related to "Overflowerror: Python int Too large to Convert to C long", this I solved, but now has another problem related to "Attributeerror: 'Htmlparser' Object has no attribute 'unescape". In case I come to solve I bring the result here.

  • It is on this line soup = BeautifulSoup(html, 'lxml')?

  • import Rows.plugins as plugins File "C: Users atendimentopcp300_01 Desktop Antony Blue Challenge venv lib site-Packages Rows plugins_init_.py", line 24, in <module> from . import plugin_html as html File "C: Users atendimentopcp300_01 Desktop Antony Blue Challenge venv lib site-Packages Rows plugins plugin_html.py", line 43, in <module> unescape = Htmlparser(). unescape Attributeerror: 'Htmlparser' Object has no attribute 'unescape' Process finished with Exit code 1

  • You installed "Rows[html]"?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.