0
I’m trying to pull the hashtags used in some Instagram profiles using the code:
import pandas as pd
import requests
import re
req = requests.get("https://www.instagram.com/globorural/")
texto = req.text
pattern = re.compile(r'#[\w\\]*')
lista_hashtags = pattern.findall(texto)
print(lista_hashtags)
['#ffffff',
'#262626',
'#globorural',
'#c7c7c7',
'#globorural',
'#agronegocio',
'#corteva',
'#publi',
'#parceriapaga',
'#T\\u00f4naGR',
'#Agro',
'#Campo',
'#Rural',
'#GloboRural',
'#Cannabis',
'#RevistaGloboRural',
'#Edi\\u00e7\\u00e3oDeNovembro',
'#T\\u00f4naGR',
'#Cenoura',
'#Horta',
'#Mossor\\u00f3',
'#EngenhariaAgr\\u00f4noma',
'#MulheresNoAgro',
'#GloboRural',
'#VidaNoCampo',
'#Rural',
'#Agro',
'#T\\u00f4naGR',
'#T\\u00f4naGR',
'#T\\u00f4naGR',
'#Ovelhas',
'#Caprinos',
'#MulheresNoAgro',
'#GloboRural',
'#VidaNoCampo',
'#Rural',
'#Agro',
'#T\\u00f4naGR',
'#T\\u00f4naGR',
'#Braqui\\u00e1ria',
'#Plantio',
'#Almenara',
'#MinasGerais',
'#GloboRural',
'#VidaNoCampo',
'#Rural',
'#Agro',
'#T\\u00f4naGR',
'#receitadocampo',
'#minhareceitanaGR',
'#T\\u00f4naGR',
'#Cavalos',
'#Equinos',
'#PasseioACavalo',
'#GloboRural',
'#VidaNoCampo',
'#Rural',
'#Agro',
'#T\\u00f4naGR',
'#T\\u00f4naGR',
'#Agronomia',
'#MulheresDoAgro',
'#GloboRural',
'#VidaNoCampo',
'#Rural',
'#Agro',
'#T\\u00f4naGR',
'#Brasil',
'#China',
'#Bolsonaro',
'#XiJiping',
'#Agro',
'#Campo',
'#GloboRural',
'#T\\u00f4naGR',
'#Bovinocultura',
'#Pecu\\u00e1ria',
'#Leite',
'#MedicinaVeterin\\u00e1ria',
'#BomDespacho',
'#MinasGerais',
'#GloboRural',
'#VidaNoCampo',
'#Rural',
'#Agro',
'#T\\u00f4naGR',
'#receitasdocampo',
'#minhareceitanaGR',
'#T\\u00f4naGR',
'#VidaNoCampo',
'#GloboRural',
'#Rural',
'#Agro',
'#T\\u00f4naGR',
'#T\\u00f4naGR',
'#T\\u00f4naGR',
'#M\\u00e1quinasAgr\\u00edcolas',
'#M\\u00e1quinasPesadas',
'#VidaNoCampo',
'#GloboRural',
'#Rural',
'#Agro',
'#T\\u00f4naGR']
But I’m having trouble converting some characters, like ç ( u00e7) or ô ( u00f4). What is the best way to solve this problem by using some special type of encoding or replacing values that have not been converted correctly with a replace ?
Welcome to Stack Overflow in English. Please click edit and translate the question.
– Luiz Augusto
I think the double reverse bar is a problem. u00f4 is the string ' u00f4', while u00f4 is the accented letter "ô".
– epx