1
I’m working on a set of interviews I got from a database in my work, each in a separate text file (a normal file with a multi-line transcription).
Each interview will be part of data frame as a variable. The metadata of each interview is in a different data frame. I have to import the texts for him (one whole interview per line). There are more than 700 text files to be imported.
The issue is that there are many repeated lines in the original interview files and, due to the number of files, import one by one and use x <- unique (x)
is out of the question. I wrote the code below to try to import them:
files <- list.files(path="path/to/the/files", pattern="*.txt", full.names=TRUE)
folder <- "path/to/the/folder"
clean_text <- function(myfile){
n <- length(myfile)
myfile <- c(stringr::str_remove_all(myfile[-n], "[\n].*"),
myfile[n])
myfile <- unique(myfile)
myfile <- paste0(myfile, collapse = "\n")
myfile
}
texts <- ldply(clean_text(files), read_file)
But instead of having something like this:
All I have is the path to my files
View(texts)
V1
path/to/my/files/file1 path/to/my/files/file2
I’m open to any solution.
Updating
Here’s a workflow of what I’ve been unable to do:
- Read 700 text files from a directory (these are files from plain text with lines separated by n)
- Delete the lines repeated
- Import them to R. It can be a vector of a column in the which each text is a variable.
Update 2
I managed to evolve until I had the text in a list, where each one is a Character:
str(teste)
List of 3
$ : chr [1:205] "Hello " ...
$ : chr [1:581] "hello little buns ...
$ : chr [1:849] "- Hello everybody," ...
But I can’t export each of these elements from the list within a single cell of a data frame. I’ve already reviewed Google.
If anyone could help, I’d be grateful.
Welcome to Stackoverflow! Unfortunately, this question cannot be reproduced by anyone trying to answer it. Please, take a look at this link and see how to ask a reproducible question in R. So, people who wish to help you will be able to do this in the best possible way.
– Marcus Nunes
Hello, thank you. but I can’t upload the data because they belong to the university I work at, I can only use it in job citations. Other than that there’s not much more I can change up.
– RodLL
Then maybe the best option is to hire an R consultant and sign an NDA contract with him. I find it very difficult to find any free solution on the internet with so little information about your problem. After all, we don’t even know what the structure of the text files is, for example, where the information of name, age, gender and interview of the subjects comes from.
– Marcus Nunes
I don’t need any of this, I just need to put the 700 texts without duplicate lines in a data frame, in which each variable (line) is one of the texts. Everything else I can do
– RodLL
plain text with duplicate lines -> dataram cell (no duplicate line). V1 text1 without duplicated lines text 2 without duplicate lines text 3 without duplicate lines One by one is easy to do, I can’t find a solution for 700 texts at a time
– RodLL
Hi Marcus, I put an update in the Post. If you can help
– RodLL
For single lines, if you are on linux or with a system that has Unix/linux command
awk
,awk '!_[&0]++' infile > outfile
is faster than theunique
of R.– Rui Barradas
Hi.. thank you so much! It has to run on all files at once?
– RodLL