Import csv with pandas. Column values with semicolon. What to do?

Asked

Viewed 2,351 times

-2

I’m importing CSV files with pandas to do a data wipe. And I’m finding a problem. Some values have ';' .

For example: df[0][1] = "STREET A ; "

I import csv through: df = pd.read_csv('GUAXUPE.csv', header = None, Sep = ',', encoding = 'utf-8')

And despite having placed the separator as the comma, the pandas recognizes the ' ' as a separator and adds a field where it should not, deregulating part of the table.

I have already searched on the internet, I read the part of the documentation of pandas that speaks of read_csv() and I found no answer. Someone has an idea ?

3 answers

0

In the engine command add the engine parameter, it would look like this:

df = pd.read_csv('GUAXUPE.csv', header = None, sep = ',', encoding = 'utf-8',engine='c')

In the read_csv documentation it says that the Python engine automatically recognizes tabs, the C engine does not. I believe it can function.

follows the doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

  • I tried. It didn’t work. You keep recognizing the semicolon as a separator.

  • I’m trying to use the python CSV module as well. The function that converts csv to a dictionary ends up deleting two columns where these characters appear that are interfering with the table formation. On the one hand, it solves , but on the other hand it doesn’t,,

0

I solved the mystery! The problem was not pandas. But rather the way he was supervising the generated tables. When he opened the file. csv in libre office, the program initially asked to choose which characters would be considered as separator. And automatically, the semicolon was selected. I supervised the tables by the notebook jupyter, and it’s all right !

0

One way around is to join the column with the information to the left of ";" with the column containing the information to the right of ";". Would something like this:

df['dados_esquerda']+=";"+df['dados_direita']

In case you didn’t want to name the columns, I believe this could be done like this:

df.loc[:,1]+=";"+df.loc[:,2]

In the above example, I assumed that the second column contains the data to the left of the ";" and the third column the data to the right of the ";".

Later, you can delete the unwanted column with the del command:

del df[2] #ou del df['dados_direita']
  • But the problem is that it’s a gigantic dataframe. I actually wrote a script to handle several csv files. imported the data of cnpj , cnae , from the website of the IRS ; and I’m treating them. And to do that, which you suggested, I would need to know exactly which rows and which columns, the problem happens.

  • The problem is when the panda imports the CSV. when I pass read_csv() it recognizes the point and comma , even in quotes , as a separator.

  • I got to import using the python csv module , and then turning the dictionary into a dataframe. It got better . But still generates a problem formatting the columns...

  • I had the same problem when I generated a dictionary with all the CNAES and their meanings, and made replacements of it in the CNAE column. The dictionary had spacing with the TAB. And when replacing the column, the pandas read the TAB as a separator. This was easy to solve. I used Vscode with crl down arrow shift and eliminated all tab spacings at once.

  • In fact you only need to know in which columns the problem happens. You could paste a line of csv here so I can better understand how it is?

  • "00317713000183",""1","WANDERLEIA SABBAG DUARTE","PRIVILEGIUS MULTI-MARCAS","04","20190124",""63",""""","""",""2135","19941129","4781400","AVENIDA","CONDE RIBEIRO DO VALLE","1243","LETRA: A;","IRMAOS RICCIARDI","37800000","MG","4573","GUAXUPE"35",""35514071","""""",""[email protected]","50","0.0","01","6","20070701","20121231",""N",""","" "00317713000183","WANDERLEIA SABBAG DUARTE","PRIVILEGIUS MULTI-MARCAS","04","20190124","63",""""",""2135","19941129","4781400","AVENIDA","CONDE RIBEIRO DO VALLE",""1243","LETRA: A;",",""IRMAOS RICCIARDI","37800000","MG","4573"

  • This is one of the lines of one of the csv that gives problem . Note that one of the values is "LETTER:A;"

  • But it is not all values of the column that this happens. In this csv there are 10 lines only that has a point and comma value, .

  • It is a csv for each municipality in Brazil . In total , 26 GB.

  • Apparently, all fields have double quotes at the beginning and end of the value. In this case, do a test using as separator "," instead of the simple comma. If it doesn’t work, try using Sep='["],["]' Another alternative is to import csv using a new line as a separator. There on the giant line that will be created, run a script to remove ";". Save CSV again and then reimport.

  • I tried, and it didn’t work. The pandas insists on recognizing the semicolon as a separator, even receiving the comma or any other element as a separator. It also recognizes TAB spacing as a separator. I had a problem with it in the past. But it was easy to solve. It was a dictionary with all CNAES codes and their meaning. I cut all spacing at once.

Show 6 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.