Importing and cleaning several text files in R

Question

Importing and cleaning several text files in R

Asked 5 years, 6 months ago

Viewed 175 times

1

I’m working on a set of interviews I got from a database in my work, each in a separate text file (a normal file with a multi-line transcription).

Each interview will be part of data frame as a variable. The metadata of each interview is in a different data frame. I have to import the texts for him (one whole interview per line). There are more than 700 text files to be imported.

The issue is that there are many repeated lines in the original interview files and, due to the number of files, import one by one and use x <- unique (x) is out of the question. I wrote the code below to try to import them:

files <- list.files(path="path/to/the/files", pattern="*.txt", full.names=TRUE)
folder <- "path/to/the/folder"    
clean_text <- function(myfile){
  n <- length(myfile)
  myfile <- c(stringr::str_remove_all(myfile[-n], "[\n].*"),
              myfile[n])
  myfile <- unique(myfile)
  myfile <- paste0(myfile, collapse = "\n")
  myfile
}
texts <- ldply(clean_text(files),  read_file)

But instead of having something like this:

All I have is the path to my files

View(texts)
V1
path/to/my/files/file1 path/to/my/files/file2

I’m open to any solution.

Updating

Here’s a workflow of what I’ve been unable to do:

Read 700 text files from a directory (these are files from plain text with lines separated by n)
Delete the lines repeated
Import them to R. It can be a vector of a column in the which each text is a variable.

Update 2

I managed to evolve until I had the text in a list, where each one is a Character:

 str(teste)
List of 3
 $ : chr [1:205] "Hello " ...
 $ : chr [1:581] "hello little buns  ...
 $ : chr [1:849] "- Hello everybody," ...

But I can’t export each of these elements from the list within a single cell of a data frame. I’ve already reviewed Google.

If anyone could help, I’d be grateful.

1

Welcome to Stackoverflow! Unfortunately, this question cannot be reproduced by anyone trying to answer it. Please, take a look at this link and see how to ask a reproducible question in R. So, people who wish to help you will be able to do this in the best possible way.

– Marcus Nunes

2020/01/24 at 12:53
Hello, thank you. but I can’t upload the data because they belong to the university I work at, I can only use it in job citations. Other than that there’s not much more I can change up.

– RodLL

2020/01/24 at 13:08
Then maybe the best option is to hire an R consultant and sign an NDA contract with him. I find it very difficult to find any free solution on the internet with so little information about your problem. After all, we don’t even know what the structure of the text files is, for example, where the information of name, age, gender and interview of the subjects comes from.

– Marcus Nunes

2020/01/24 at 13:12
I don’t need any of this, I just need to put the 700 texts without duplicate lines in a data frame, in which each variable (line) is one of the texts. Everything else I can do

– RodLL

2020/01/24 at 13:20
plain text with duplicate lines -> dataram cell (no duplicate line). V1 text1 without duplicated lines text 2 without duplicate lines text 3 without duplicate lines One by one is easy to do, I can’t find a solution for 700 texts at a time

– RodLL

2020/01/24 at 13:22
Hi Marcus, I put an update in the Post. If you can help

– RodLL

2020/01/24 at 13:49
For single lines, if you are on linux or with a system that has Unix/linux command awk, awk '!_[&0]++' infile > outfile is faster than the unique of R.

– Rui Barradas

2020/01/24 at 16:54
Hi.. thank you so much! It has to run on all files at once?

– RodLL

2020/01/24 at 17:01

Show 3 more comments

2 answers

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2020-01-24T17:34:52+00:00

If I understood the question (I doubt you have) maybe the following code can solve the problem.
The function below uses the Unix/linux command awk to remove duplicate lines of text. Clean files are written to files with the same name as the input files but prefixed with the pref = "out" in their names.

clean_text2 <- function(file, pref = "out"){
  out <- file.path(dirname(file), paste0(pref, basename(file)))
  args <- c("'!_[$0]++'", file, ">", out)
  system2("awk", args)
}

lapply(files, clean_text2)

by RodLL • **121** points · Answer 2 · 2020-01-25T09:42:09+00:00

Thank you all. Apparently I was able to solve the problem with the following function. I did a test on a sample of 10 files and it worked. Then I test with everyone and give feedback.

files <- list.files(path="path/to/you/files", pattern="*.txt", full.names=TRUE)
import.and.clean <- function(myfiles){
  tmp <- lapply(myfiles, read_lines)
  tmp2 <- purrr::map(tmp, unique)
  tmp3 <- purrr::map_chr(tmp2, paste0, collapse = "\n")
  return(tmp3)
}
my.vector <- import.and.clean(files)