How to identify the page number of a . pdf by something written on it?

Question

How to identify the page number of a . pdf by something written on it?

Asked 4 years, 10 months ago

Viewed 104 times

-1

I have a . pdf with 120 pages, each page is a certificate and the only difference between them is the name of the participant.

I also have a . csv with the list of participants' names and e-mail address (I used this list to generate the . pdf and I will try to use the e-mail to send by R)

How I can separate each page (certificate) into a new . pdf and save with attendee name?

I saw functions like pdf_subset of library(pdftools), but how I can identify the page number by something written on it?

library(pdftools)

# extair paginas de 1 
pdf_subset('certificados.pdf',
           pages = 1, output = "Carlos dos Santos.pdf")

It’s a pretty repetitive job, I wouldn’t want to do it manually.

I also thought of separating page by page and then searching for the name in . pdf and changing the name of the file. But I also don’t know how to do this yet.

Follows model of certificate: https://drive.google.com/file/d/1iwgW6kMT7C9Xee5SM65vz-D8B26bpavz/view?usp=sharing

Example of . csv:

nome,email
Prof. Dr. Thiado Souza,[email protected]
Prof. Dr. Marcelo José,mjose@gmail
Ricado Augusto,[email protected]
Carlos José,[email protected]

1

This is @Rxt, blz? You have the .pdf to make available? It can be a fake (or with names omitted), so I can try some code. Only that as you want to "separate each page (certificate) into a new . pdf", I think it is easier to do via command terminal. You use Ubuntu? Ubuntu has Imagemagick that should do it in an easy way (I just don’t know how to make this loop of each page, but it’s a matter of searching). And dps of that, I imagine the rest is easier.

– Guilherme Parreira

2020/09/02 at 12:15
@Guilhermeparreira thanks for your attention! I added the file in question ! Yes, I use Mint actually, but it should not be different, right?!

– RxT

2020/09/02 at 14:08

2 answers

1

Since the certificates follow the same order as the csv file:

library(pdftools)

arq <- read.csv('./rxt.csv')
nomes <- as.character(arq$nome)

cria_pdf <- function(n, i){
  pdf_subset('certificado-teste.pdf',
             pages = i, output = paste(n[i],'.pdf'))
}

lapply(seq_along(nomes),cria_pdf, n = nomes)

This creates separate files in the directory based on the list of csv names.

Browser other questions tagged r pdf pdf-generation

You are not signed in. Login or sign up in order to post.

by Rui Barradas • **15,422** points · Answer 1 · 2020-09-02T16:32:58+00:00

I believe that the following function divides the input pdf into pages, storing each page in a file and renaming those files with the names of the csv file. The function input is

file - the name of the pdf file;
nomes - the date.frame with a column 'nome'

Needs the package pdftools to process pdf and package files stringi to remove accents and special letters, such as etá in this answer.

library(pdftools)

rename_file <- function(from, to){
  out <- tryCatch(file.rename(from, to),
                  error = function(e) e
  )
  if(inherits(out, "error")){
    out <- tryCatch(file.copy(from, to),
                    error = function(e) e,
                    warning = function(w) w
    )
    if(inherits(out, "error")){
      stop(e)
    }
    if(inherits(out, "warning")){
      warning(out)
      out <- FALSE
    }
    if(file.exists(to)) {
      if(file.exists(from)) unlink(from)
      out <- TRUE
    } else out <- FALSE
  }
  out
}

fun_split_pdf <- function(file, nomes){
  # ler os dados do pdf para uma lista de data.frames
  data_list <- pdf_data(file)
  # ficar com a coluna 'text' de cada data.frame da lista
  # e convertê-la em um único vetor com paste()
  Text <- unlist(lapply(data_list, function(x) paste(x[['text']], collapse = ' ')))
  # se tiver acentos, removê-los
  Text <- stringi::stri_trans_general(Text, "Latin-ASCII")
  # agora ver quais nomes do csv estão em qual página do pdf
  i <- sapply(nomes[['nome']], function(x){
    x <- stringi::stri_trans_general(x, "Latin-ASCII")
    j <- grep(x, Text)
    if(length(j) > 1) j[1] else j
  })
  # ordenar a coluna 'nome' do csv pela ordem do grep
  nms <- nomes[['nome']][i]
  # nomes finais dos ficheiros de saída
  nms <- paste0(nms, '.pdf')

  # fazer o split para um diretório temporário
  tmp <- tempfile()
  pdf_split(file, output = tmp)

  # renomear os ficheiros temporários
  pattern <- paste0(basename(tmp), "_.*\\.pdf")
  tmp_fls <- list.files(path = dirname(tmp), pattern = pattern)
  tmp_fls <- file.path(dirname(tmp), tmp_fls)
  sapply(seq_along(tmp_fls), function(i){
    rename_file(tmp_fls[[i]], nms[[i]])
  })
}

fl <- list.files(pattern = '\\.pdf')
fl
#[1] "certificado-teste.pdf"

nomes <- read.csv("pdf_teste.csv")

fun_split_pdf(fl, nomes)
#[1] TRUE TRUE TRUE TRUE