Python Encoding Problem

Asked

Viewed 160 times

0

I’m trying to pull the hashtags used in some Instagram profiles using the code:

   import pandas as pd
   import requests
   import re

   req = requests.get("https://www.instagram.com/globorural/")

   texto = req.text

   pattern = re.compile(r'#[\w\\]*')

   lista_hashtags = pattern.findall(texto)

   print(lista_hashtags)


['#ffffff',
 '#262626',
 '#globorural',
 '#c7c7c7',
 '#globorural',
 '#agronegocio',
 '#corteva',
 '#publi',
 '#parceriapaga',
 '#T\\u00f4naGR',
 '#Agro',
 '#Campo',
 '#Rural',
 '#GloboRural',
 '#Cannabis',
 '#RevistaGloboRural',
 '#Edi\\u00e7\\u00e3oDeNovembro',
 '#T\\u00f4naGR',
 '#Cenoura',
 '#Horta',
 '#Mossor\\u00f3',
 '#EngenhariaAgr\\u00f4noma',
 '#MulheresNoAgro',
 '#GloboRural',
 '#VidaNoCampo',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR',
 '#T\\u00f4naGR',
 '#T\\u00f4naGR',
 '#Ovelhas',
 '#Caprinos',
 '#MulheresNoAgro',
 '#GloboRural',
 '#VidaNoCampo',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR',
 '#T\\u00f4naGR',
 '#Braqui\\u00e1ria',
 '#Plantio',
 '#Almenara',
 '#MinasGerais',
 '#GloboRural',
 '#VidaNoCampo',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR',
 '#receitadocampo',
 '#minhareceitanaGR',
 '#T\\u00f4naGR',
 '#Cavalos',
 '#Equinos',
 '#PasseioACavalo',
 '#GloboRural',
 '#VidaNoCampo',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR',
 '#T\\u00f4naGR',
 '#Agronomia',
 '#MulheresDoAgro',
 '#GloboRural',
 '#VidaNoCampo',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR',
 '#Brasil',
 '#China',
 '#Bolsonaro',
 '#XiJiping',
 '#Agro',
 '#Campo',
 '#GloboRural',
 '#T\\u00f4naGR',
 '#Bovinocultura',
 '#Pecu\\u00e1ria',
 '#Leite',
 '#MedicinaVeterin\\u00e1ria',
 '#BomDespacho',
 '#MinasGerais',
 '#GloboRural',
 '#VidaNoCampo',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR',
 '#receitasdocampo',
 '#minhareceitanaGR',
 '#T\\u00f4naGR',
 '#VidaNoCampo',
 '#GloboRural',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR',
 '#T\\u00f4naGR',
 '#T\\u00f4naGR',
 '#M\\u00e1quinasAgr\\u00edcolas',
 '#M\\u00e1quinasPesadas',
 '#VidaNoCampo',
 '#GloboRural',
 '#Rural',
 '#Agro',
 '#T\\u00f4naGR']

But I’m having trouble converting some characters, like ç ( u00e7) or ô ( u00f4). What is the best way to solve this problem by using some special type of encoding or replacing values that have not been converted correctly with a replace ?

  • Welcome to Stack Overflow in English. Please click edit and translate the question.

  • I think the double reverse bar is a problem. u00f4 is the string ' u00f4', while u00f4 is the accented letter "ô".

1 answer

0

Browser other questions tagged

You are not signed in. Login or sign up in order to post.