All data from a row in the dataframe is going to the first column using python pandas

Asked

Viewed 628 times

0

I’m reading a CSV file with pandas on a dataframe, it turns out on line 10, all the data is going to the first column, this way:

Dataframe obtido

How can I solve this problem and separate correctly? I need only the first number to be in the ID column.

The line in the csv file is like this:

9009101002,"Apple iPhone XS Smartphone 256GB 4G Screen 5.8"""Front 12MP camera 7MP iOS 12 Golden", 32 934,102,401

And it was automatically generated by the system.

Arquivo CSV

To read the file, I use the following line of code:

df = pd.read_csv("prods_tab.csv", encoding='latin-1', sep=',')

Note: When opening the file in notepad, it is presented as follows:

ID. Forn.,Prod. DESC.,SKU.,GRP. MERC. 3,COD. MARC.
,,,,
302100012,GELADEIRA FROST FREE INVERTER IB53X ELECTROLUX 454 LITROS INOX,100312-,996,302
302100012,GELADEIRA FROST FREE DB84 ELECTROLUX 598 LITROS BRANCO,89 721 ?,,
302100012,Frigobar Electrolux RE80 79 Litros Classe A 110V Branco,?1920-- 63,996,302
,,,,
ID. Forn.,Prod. DESC.,SKU.,GRP. MERC. 3,COD. MARC.
302100012,Geladeira Electrolux SS72X Side by Side Frost Free 504 Litros 2 Portas Classe A 127V Inox,18228 5,996,302
,,,,
ID. Forn.,Prod. DESC.,SKU.,GRP. MERC. 3,COD. MARC.
,,,,
"9009101002,""Smartphone Apple iPhone XS 256GB 4G Tela 5,8"""""""" Câmera 12MP Frontal 7MP iOS 12 Dourado"",   32 934,102,401"
9030121093,SMARTPHONE SAMSUNG GALAXY NOTE 8 N950F 64GB 2CHIPS PRETO,4??349 5,102,607
320621093,BONECA MULTIKIDS BUSH BABY WORLD SHIMMIES BR106,4342I,766,481
320621093,Brinquedo Kit de Voley Disney Princesas Líder 759 ,3 1---24-,766,481
9030121093,SMARTPHONE SAMSUNG GALAXY A8+ A730 64GB 2CHIPS DOURADO,  1 92501 ,,607
,,,,

Note 2: I have already treated the other lines, only this is missing.

  • has a comma there do not have? on line 10, was placed by you or generated by the system?

  • 1

    How is your csv structured (mainly line 10)? Please edit the question and add this information.

  • Reinforcing what @Alexciuffa said, add the top 10-12 lines of your csv to the question.

  • @Rafaelrotiroti was generated by the system.

  • @Alexciuffa edited, added the information.

  • @Sidon edited.

  • Can you put the line you use read_csv() on? , it is better to see the parameters passed.

  • @Rafael ready. :)

  • With the data you passed could not replicate the error here. I assembled a file prods_tab.csv with the header and the line passed, I called df = pd.read_csv("prods_tab.csv", encoding='latin-1', sep=',') and it worked normally. Try to make the file available with a cut of the data so that it is possible to replicate this error. There must be something else in that CSV that’s slipping through your eyes.

  • I did tests here too, the problem really is the CSV, there is a separation pattern, there are hours TAB or some comma, the ideal is to use only one made this adaptation to TAB’s and it worked, this CSV is private or some testing base only?

  • @Rafael he is part of a test I’m doing for trainee in a company. It could help me better as I make the adaptation to TAB’s?

  • I only replace the separations by TAB’s but if your dataset is too big it doesn’t pay off if there are too many different separations. But anyway the first image with the data in columns is related to the dataset or just a representation of the system, I ask because in excel spreadsheet has a different representation.

  • @Rafael then, the dataset is small. I removed the first image and put as it is appearing the dataframe, that image was confused, but it was a copy of the dataframe presented in the notebook jupyter. Anyway, I don’t know how to fix this and I need to deliver :(

  • So you’re using excel, go to the location of that file, right click on it, and click edit, will open in the notepad, then you’ll see how the data really are, copy that same snippet of your example above and put the question.

  • @Rafael I did this and I realized that the line that is in trouble is in quotes... Does it have something to do?

  • Yes, whenever you are going to solve a data analysis problem open them with the notebook, already put a better answer.

Show 11 more comments

1 answer

1


Pandas is a hell of a tool, but sometimes the data is not quite "normalized", it was your case, I didn’t understand why of those quotes but they were one of the problems, you had your header three times in the text tries to leave always at the beginning and once only, alias the header you can quote each column, but always take care of the separation, even a white space is already a reason. Following is the read_csv I used.

df = pd.read_csv('dados.csv', engine = 'python', error_bad_lines = False, sep = ',')

And the dataset:

"ID. Forn.","Prod. DESC.","SKU.","GRP. MERC. 3","COD. MARC."
302100012,GELADEIRA FROST FREE INVERTER IB53X ELECTROLUX 454 LITROS INOX,100312-,996,302
302100012,GELADEIRA FROST FREE DB84 ELECTROLUX 598 LITROS BRANCO,89 721 ?,,
302100012,Frigobar Electrolux RE80 79 Litros Classe A 110V Branco,?1920-- 63,996,302
302100012,Geladeira Electrolux SS72X Side by Side Frost Free 504 Litros 2 Portas Classe A 127V Inox,18228 5,996,302
9009101002, smartphone Apple iPhone XS 256GB 4G Tela 5,8 Câmera 12MP Frontal 7MP iOS 12 Dourado,   32 934,102,401
9030121093,SMARTPHONE SAMSUNG GALAXY NOTE 8 N950F 64GB 2CHIPS PRETO,4??349 5,102,607
320621093,BONECA MULTIKIDS BUSH BABY WORLD SHIMMIES BR106,4342I,766,481
320621093,Brinquedo Kit de Voley Disney Princesas Líder 759 ,3 1---24-,766,481
9030121093,SMARTPHONE SAMSUNG GALAXY A8+ A730 64GB 2CHIPS DOURADO,  1 92501 ,,607
  • 1

    Rafael, really... Those quotes were the problem so, as I received the dataset to solve, I didn’t notice it. Thank you very much!!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.