How to manipulate CSV data in python?

Asked

Viewed 5,223 times

0

I need to manipulate data CSV without using pandas or numpy. It has specific columns to manipulate, so how do I best go through the columns, do this data reading of each one and be able to work with each one?

Example:

My csv file has columns:

A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,X,Z

I see in rows in Python, I need to work only with columns A(which contain full name), B(age), C(city), K(post) and S(salary), so some columns I will have to do calculations like which are older, how many belong to the same city, others present only the information contained.

While searching I had this suggestion to create a new file excluding the data that need not work and remaining only the ones that will be used. But it is generating error.

 import csv

with open('data.csv') as stream, \
open('resultado.csv', 'w') as output:

reader = csv.DictReader(stream, delimiter=',')
writer = csv.DictWriter(output, delimiter=',', fieldnames=['nome','Idade','Cidade','Cargo','Salario'])

writer.writeheader()

for row in reader:
    del row[D,E,F,G,H,I,J,L,M,N,O,P,Q,R,T,U,V,X,Z]
    writer.writerow(row)´´´

Can you help me with this error? And what’s the best way to go through the file so I can isolate the columns and work on them?

Thank you!

1 answer

5


CSV files are always read line by line. However, unless they are really too large, the data fits all in memory (if it doesn’t fit, a specialized system is required - even the Pandas library depends on putting all the data in memory).

In particular, I didn’t understand how having a new file could help in this case- in the example given, I would just be renaming the columns, but still have a column for "name" - only it would have the title "name" instead of "A". So let’s focus on having the columns in memory, and you can work from there (reading a file of this type has negligible time - so it’s okay to read the data every time you run the program)

As a rule, in modern Python, the isolation of a table in columns that will be treated separately is more practical with the same Pandas, which has this practically ready. As you explicitly mentioned that you do not want to use Pandas or numpy, the simple way is to read all the data as CSV file, and then, having in hand a "list list", where each element is a list, make a transposition of this data. A practical way of transposing is with the function zip - but let’s go in parts, to be understandable.

Instead of using the zip to simply transpose the data, which can be done in a row, I will write some lines of code that will:

  1. create a dictionary that will be your final data structure. Each key in the dictionary will be the column title, and value will be a list of the data in that column. For this, the code will use the first line of the CSV file.
  2. Scroll through the data line with a for,and then use the function zip to match the data of each column with the corresponding list in the created dictionary.

The zip function does just this: given two or more interacting objects, it takes an element of each of these objects as a result of each iteration. Like the for Python allows you to put more than one variable, it works very well - in practice the for Python using ZIP can simultaneously traverse the sequence of lists in the data structure we create at the same time and the data of that row. We add the data in the list, and move to the next column. At the end of the row, the for is repeated, picking up the same lists in the data dictionary, but the data in the next column:

Before going straight to the CSV file, to get more didactic, follow an example of this in interactive mode:

In [31]: tabela = [[1, 2, 3], [4, 5, 6]]                                                                           

In [32]: dados = {"a": [], "b": [], "c": []}                                                                       

In [33]: for linha in tabela: 
    ...:     for coluna_dados, valor in zip(dados.items(), linha): 
    ...:         print(coluna_dados, valor) 
    ...:         coluna_dados[1].append(valor) 
    ...:                                                                                                           
('a', []) 1
('b', []) 2
('c', []) 3
('a', [1]) 4
('b', [2]) 5
('c', [3]) 6

In [34]: print (dados)     
{'a': [1, 4], 'b': [2, 5], 'c': [3, 6]}

And the code to do the same thing, but with the data from the CSV file:

from collections import OrderedDict
import csv
with open('data.csv') as stream:
    reader = csv.reader(stream)

    data = OrderedDict((column_name, []) for column_name in next(reader))
for row in reader:
   for column, value in zip(data.values(), row):
       column.append(value)

At that point in the code the variable data is the dictionary described above: in which each column of the original CSV file has a key with its title, and all values in a list.

I used the OrderedDict above to ensure that the code works in any version of Python - but from Python 3.7, normal dictionaries preserve the order - so one can use a dict normal instead of Ordereddict in that code. (In older versions, a normal Dict would not guarantee column order)

Pandas

In projects that have no restrictions on the use of Pandas, the native "Dataframe" structure of Pandas already provides access by columns naturally - the Dataframe also works as a map, where the title of each column is a series with its data:

import pandas as pd
data = pd.read_csv("meuarquivo.csv")
print(data["A"])
  • Thank you very much!!! ^_^

Browser other questions tagged

You are not signed in. Login or sign up in order to post.