How to remove blanks

Asked

Viewed 705 times

2

I have a list and need to remove blanks. I am using replace, but does not take the space from the beginning of the string after the minus sign, only from the end. This space is not a character?

import time
import pandas as pd
import lxml
import html5lib
from bs4 import BeautifulSoup
from pandas import DataFrame
import numpy as np
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

url = "https://www.sunoresearch.com.br/acoes/itsa4/"

option = Options()
option.headless = True
driver = webdriver.Firefox()

driver.get(url)
time.sleep(10)

driver.find_element_by_xpath('//*[@id="demonstratives"]/div[2]/div[1]/div[1]/ng-select/div').click()
driver.find_element_by_xpath('//*[@id="demonstratives"]/div[2]/div[1]/div[1]/ng-select/div/ul/li[2]').click()
driver.find_element_by_xpath('//*[@id="demonstratives"]/div[2]/div[1]/div[2]/ng-select/div').click()
driver.find_element_by_xpath('//*[@id="demonstratives"]/div[2]/div[1]/div[2]/ng-select/div/ul/li[4]').click()
driver.find_element_by_xpath('//*[@id="demonstratives"]/div[2]/div[2]/div/button[2]').click()
element = driver.find_element_by_xpath('//*[@id="demonstratives"]/div[3]/div[2]')
html_content = element.get_attribute('outerHTML')

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find(name='table')

df_full = pd.read_html(str(table))[0]
pd.set_option('display.max_columns', None)
df_full= df_full.T.shift(-2,axis=0).T
dm=df_full[['Descrição','1T2020', '4T2019','3T2019','2T2019','1T2019']]
df=pd.DataFrame(dm)
df.loc[:,'1T2020']= df['1T2020'].apply(lambda x: str(x).replace(".",""))
df.loc[:,'1T2020']= df['1T2020'].apply(lambda x: str(x).replace(" ",""))
df.loc[:,'1T2020']= df['1T2020'].apply(lambda x: str(x).replace(",","."))
df.loc[:,'1T2020']= df['1T2020'].apply(lambda x: str(x).replace("M",""))

print(df)

driver.quit()

inserir a descrição da imagem aqui

That space after the minus sign should be removed, shouldn’t it? How to remove?

Originally:

inserir a descrição da imagem aqui

  • How does this print look without the replaces?

  • Ficaria In line 0: |Net Revenue 1,162.0 M |E ma line 1: |Costs - 773.0 M|

1 answer

3


If you enter the site https://www.sunoresearch.com.br/acoes/itsa4 and inspect some of these values in the browser console, you will see that it uses the  , which corresponds to the no-break space (which is not the same character as the space):

inserir a descrição da imagem aqui

Basically, Unicode defines several different characters for "whitespace", and when you use ' ', is referring to only one of them (and guess what, that’s not the no-break space).

Anyway, a way to remove the no-break space would be:

df.loc[:,'1T2020']= df['1T2020'].apply(lambda x: str(x).replace('\u00a0', ''))

To notation \uxxxx uses the value of the code point in hexadecimal, and in this case, I used \u00a0, which corresponds to the no-break space (to know what a code point is, read here).


You could also decrease all these calls from replace using the module re:

import re
def limpar_campo(s):
    return re.sub(r'[\u00a0 .M]', '', s).replace(',', '.')

df.loc[:,'1T2020']= df['1T2020'].apply(limpar_campo)

So, when calling sub, I change the no-break space, space, point or M by "nothing" (which is the same as removing them), and then I change the comma by dot.

Or, if you want to be more "generic" and remove any characters that match the space (including the space itself no-break space), can use \s:

def limpar_campo(s):
    return re.sub(r'[\s.M]', '', s).replace(',', '.')

Recalling that the shortcut \s also matches characters such as TAB and line breaks, in addition to several others.


If you want to see the code points of a string and the respective character name, you can use ord and the module unicodedata:

from unicodedata import name

def mostrar_chars(s):
    for c in s: # imprime o code point em hexadecimal e o nome do caractere
        print(f'{ord(c):04X} {name(c)}')

mostrar_chars('- 3.718,0 M')

In the example above I used one of the strings returned by the site. The result was:

002D HYPHEN-MINUS
00A0 NO-BREAK SPACE
0033 DIGIT THREE
002E FULL STOP
0037 DIGIT SEVEN
0031 DIGIT ONE
0038 DIGIT EIGHT
002C COMMA
0030 DIGIT ZERO
0020 SPACE
004D LATIN CAPITAL LETTER M

Note that the second character is the no-break space and the penultimate is the "traditional" space (so only this was removed when using replace(" ", "")).

  • I think if you use " s" in regexp it already takes the NBSP too.

  • @jsbueno Yes, I edited the reply to make this clearer. Thank you! :-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.