How to identify the page number of a . pdf by something written on it?

Asked

Viewed 104 times

-1

I have a . pdf with 120 pages, each page is a certificate and the only difference between them is the name of the participant.

I also have a . csv with the list of participants' names and e-mail address (I used this list to generate the . pdf and I will try to use the e-mail to send by R)

How I can separate each page (certificate) into a new . pdf and save with attendee name?

I saw functions like pdf_subset of library(pdftools), but how I can identify the page number by something written on it?

library(pdftools)

# extair paginas de 1 
pdf_subset('certificados.pdf',
           pages = 1, output = "Carlos dos Santos.pdf")

It’s a pretty repetitive job, I wouldn’t want to do it manually.

I also thought of separating page by page and then searching for the name in . pdf and changing the name of the file. But I also don’t know how to do this yet.

Follows model of certificate: https://drive.google.com/file/d/1iwgW6kMT7C9Xee5SM65vz-D8B26bpavz/view?usp=sharing

Example of . csv:

nome,email
Prof. Dr. Thiado Souza,[email protected]
Prof. Dr. Marcelo José,mjose@gmail
Ricado Augusto,[email protected]
Carlos José,[email protected]
  • 1

    This is @Rxt, blz? You have the .pdf to make available? It can be a fake (or with names omitted), so I can try some code. Only that as you want to "separate each page (certificate) into a new . pdf", I think it is easier to do via command terminal. You use Ubuntu? Ubuntu has Imagemagick that should do it in an easy way (I just don’t know how to make this loop of each page, but it’s a matter of searching). And dps of that, I imagine the rest is easier.

  • @Guilhermeparreira thanks for your attention! I added the file in question ! Yes, I use Mint actually, but it should not be different, right?!

2 answers

1


Since the certificates follow the same order as the csv file:

library(pdftools)

arq <- read.csv('./rxt.csv')
nomes <- as.character(arq$nome)

cria_pdf <- function(n, i){
  pdf_subset('certificado-teste.pdf',
             pages = i, output = paste(n[i],'.pdf'))
}

lapply(seq_along(nomes),cria_pdf, n = nomes)

This creates separate files in the directory based on the list of csv names.

-1

I believe that the following function divides the input pdf into pages, storing each page in a file and renaming those files with the names of the csv file. The function input is

  1. file - the name of the pdf file;
  2. nomes - the date.frame with a column 'nome'

Needs the package pdftools to process pdf and package files stringi to remove accents and special letters, such as etá in this answer.

library(pdftools)

rename_file <- function(from, to){
  out <- tryCatch(file.rename(from, to),
                  error = function(e) e
  )
  if(inherits(out, "error")){
    out <- tryCatch(file.copy(from, to),
                    error = function(e) e,
                    warning = function(w) w
    )
    if(inherits(out, "error")){
      stop(e)
    }
    if(inherits(out, "warning")){
      warning(out)
      out <- FALSE
    }
    if(file.exists(to)) {
      if(file.exists(from)) unlink(from)
      out <- TRUE
    } else out <- FALSE
  }
  out
}

fun_split_pdf <- function(file, nomes){
  # ler os dados do pdf para uma lista de data.frames
  data_list <- pdf_data(file)
  # ficar com a coluna 'text' de cada data.frame da lista
  # e convertê-la em um único vetor com paste()
  Text <- unlist(lapply(data_list, function(x) paste(x[['text']], collapse = ' ')))
  # se tiver acentos, removê-los
  Text <- stringi::stri_trans_general(Text, "Latin-ASCII")
  # agora ver quais nomes do csv estão em qual página do pdf
  i <- sapply(nomes[['nome']], function(x){
    x <- stringi::stri_trans_general(x, "Latin-ASCII")
    j <- grep(x, Text)
    if(length(j) > 1) j[1] else j
  })
  # ordenar a coluna 'nome' do csv pela ordem do grep
  nms <- nomes[['nome']][i]
  # nomes finais dos ficheiros de saída
  nms <- paste0(nms, '.pdf')

  # fazer o split para um diretório temporário
  tmp <- tempfile()
  pdf_split(file, output = tmp)

  # renomear os ficheiros temporários
  pattern <- paste0(basename(tmp), "_.*\\.pdf")
  tmp_fls <- list.files(path = dirname(tmp), pattern = pattern)
  tmp_fls <- file.path(dirname(tmp), tmp_fls)
  sapply(seq_along(tmp_fls), function(i){
    rename_file(tmp_fls[[i]], nms[[i]])
  })
}

fl <- list.files(pattern = '\\.pdf')
fl
#[1] "certificado-teste.pdf"

nomes <- read.csv("pdf_teste.csv")

fun_split_pdf(fl, nomes)
#[1] TRUE TRUE TRUE TRUE
  • I’m making that mistake: Error in nomes[["nome"]][i] : tipo de subscrito inválido 'list' but the column "name" is charactere.

  • @Rxt What is that print(i) right after the sapply does? If you give a list and not a vector, do it i <- unlist(i) right after the sapply.

  • He returns the name and the positions $Nome e sobrenome [1] 1 $Outro nome e sobrenome [1] 2 ------ if the i<-unlist(i), I have this mistake: Error in file.rename(tmp_fls, nms) : &#xA; 'from' and 'to' are of different lengths and I realized, that some people have registered twice. This could be the mistake?

  • 1

    @Rxt may be. Maybe i <- unique(unlist(i)) solve the problem. You cannot post a link to a file with this repeat registration situation?

  • By removing the repeaters or using the unique() the result is being FALSE for all values.

  • Applying your answer to the examples I posted, I got the message: Warning messages:&#xA;1: In file.rename(tmp_fls, nms) :&#xA; não foi possível renomear o arquivo '/tmp/RtmpwzCKgS/file7e8e15c01a2e_0001.pdf' para 'Prof. Dr. Thiado Souza.pdf', motivo 'Link entre dispositivos inválido'

  • 1

    @Rxt This is a system error message. One solution is to copy to the target file and then unlink the original file. At this time I can’t post code, tomorrow morning (Portugal is GMT) I’m sure I can.

  • @Turns out it wasn’t yesterday, see now.

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.