Import csv - 18 million lines in R

Asked

Viewed 349 times

2

How to import a base of 20 million lines with 24 variables. There are two variables that are string. The base is importing these two variables as Numeric and excluding the zeros at the beginning. I am using the following command:

base <-read.csv("base.csv",header = TRUE,   sep=";", dec="." , quote = "", encoding = "UTF-8", stringsAs = FALSE)

2 answers

6


This is a job for the reader!

With readr you can read lines and it automatically detects the type of variables. See an example:

library(readr)
sc <- read_csv2(file = '/media/backup/Microdados/sc2017.csv')
nrow(sc)

this file is the RAIS file for the entire state of Santa Catarina. It is in CSV format and is available here.

On top of everything the readr is much faster and more efficient.

Just so you can see how it looks at the end:

inserir a descrição da imagem aqui

  • Thank you! This function worked and in the end it showed which lines are in trouble. Find a string that started with "" and ended the entire line from that name into a single variable. This read_csv2 function is bundled with the 'Import Dataset' command in Rstudio’s Environment. Clicking there he makes a preview of the first 200 lines and you can modify which type of variable you want. Very practical!

  • @Nataliasoares if you have chosen this answer It is the tick to choose and close the question.

1

Use the argument colClasses of function read.csv, where the class of each column will be announced.
Example:

colClasses = c("character", "character", "complex", "factor", "factor", "character", "integer", "integer", "numeric", "character", "character", "Date", "integer", "logical")

My suggestion is to announce colClasses = rep("character", 24) and then change the class of each column.

  • Thank you Marcio! Note this rep function("Character", 24)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.