Importing and cleaning several text files in R

Asked

Viewed 175 times

1

I’m working on a set of interviews I got from a database in my work, each in a separate text file (a normal file with a multi-line transcription).

Each interview will be part of data frame as a variable. The metadata of each interview is in a different data frame. I have to import the texts for him (one whole interview per line). There are more than 700 text files to be imported.

The issue is that there are many repeated lines in the original interview files and, due to the number of files, import one by one and use x <- unique (x) is out of the question. I wrote the code below to try to import them:

files <- list.files(path="path/to/the/files", pattern="*.txt", full.names=TRUE)
folder <- "path/to/the/folder"    
clean_text <- function(myfile){
  n <- length(myfile)
  myfile <- c(stringr::str_remove_all(myfile[-n], "[\n].*"),
              myfile[n])
  myfile <- unique(myfile)
  myfile <- paste0(myfile, collapse = "\n")
  myfile
}
texts <- ldply(clean_text(files),  read_file)

But instead of having something like this: o resultado que estou esperando

All I have is the path to my files

View(texts)
V1
path/to/my/files/file1 path/to/my/files/file2

I’m open to any solution.

Updating

Here’s a workflow of what I’ve been unable to do:

  1. Read 700 text files from a directory (these are files from plain text with lines separated by n)
  2. Delete the lines repeated
  3. Import them to R. It can be a vector of a column in the which each text is a variable.

Update 2

I managed to evolve until I had the text in a list, where each one is a Character:

 str(teste)
List of 3
 $ : chr [1:205] "Hello " ...
 $ : chr [1:581] "hello little buns  ...
 $ : chr [1:849] "- Hello everybody," ...

But I can’t export each of these elements from the list within a single cell of a data frame. I’ve already reviewed Google.

If anyone could help, I’d be grateful.

  • 1

    Welcome to Stackoverflow! Unfortunately, this question cannot be reproduced by anyone trying to answer it. Please, take a look at this link and see how to ask a reproducible question in R. So, people who wish to help you will be able to do this in the best possible way.

  • Hello, thank you. but I can’t upload the data because they belong to the university I work at, I can only use it in job citations. Other than that there’s not much more I can change up.

  • Then maybe the best option is to hire an R consultant and sign an NDA contract with him. I find it very difficult to find any free solution on the internet with so little information about your problem. After all, we don’t even know what the structure of the text files is, for example, where the information of name, age, gender and interview of the subjects comes from.

  • I don’t need any of this, I just need to put the 700 texts without duplicate lines in a data frame, in which each variable (line) is one of the texts. Everything else I can do

  • plain text with duplicate lines -> dataram cell (no duplicate line). V1 text1 without duplicated lines text 2 without duplicate lines text 3 without duplicate lines One by one is easy to do, I can’t find a solution for 700 texts at a time

  • Hi Marcus, I put an update in the Post. If you can help

  • For single lines, if you are on linux or with a system that has Unix/linux command awk, awk '!_[&0]++' infile > outfile is faster than the unique of R.

  • Hi.. thank you so much! It has to run on all files at once?

Show 3 more comments

2 answers

1

If I understood the question (I doubt you have) maybe the following code can solve the problem.
The function below uses the Unix/linux command awk to remove duplicate lines of text. Clean files are written to files with the same name as the input files but prefixed with the pref = "out" in their names.

clean_text2 <- function(file, pref = "out"){
  out <- file.path(dirname(file), paste0(pref, basename(file)))
  args <- c("'!_[$0]++'", file, ">", out)
  system2("awk", args)
}

lapply(files, clean_text2)
  • Thank you so much for the help. I tried running the above code in a directory with 3 files (01.txt, 02.txt and 03.txt). In the following sequence: files_captions <- dir(path="/path/to/Texts", full.Names=TRUE) folder_captions <- "/path/to/Texts" lapply(files_captions, clean_text2) and I received the following error: sh: out/path/to/Texts/01.txt: No such file or directory sh: out/path/to/Texts/02.txt: No such file or directory sh: out/path/to/Texts/03.txt: No such file or directory ?

  • @user12774787 See now with the modification.

  • Thanks again. The output this time was: > str(teste05) List of 50 $ : int 2 $ : int 2 $ : int 2

1

Thank you all. Apparently I was able to solve the problem with the following function. I did a test on a sample of 10 files and it worked. Then I test with everyone and give feedback.

files <- list.files(path="path/to/you/files", pattern="*.txt", full.names=TRUE)
import.and.clean <- function(myfiles){
  tmp <- lapply(myfiles, read_lines)
  tmp2 <- purrr::map(tmp, unique)
  tmp3 <- purrr::map_chr(tmp2, paste0, collapse = "\n")
  return(tmp3)
}
my.vector <- import.and.clean(files)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.