Divide bibliographic references into columns in R

Question

Divide bibliographic references into columns in R

Asked 4 years, 11 months ago

Viewed 62 times

-1

I am with df which contains several bibliographic references. My intention is to divide these references into the following columns: "Author", "Title", "Periodico", "Data of the Periodico" (volume, pages) and "Year" The point is that the references have no pattern.

Follow a section of the df

ref<- data.frame(artigo=c("AZEVEDO, L. S. ; NASCIMENTO, E. F. ; CANDEIAS, A. L. B. . Estudo da extração de bordas de reservatório utilizando múltiplas técnicas de fusão de imagens. JOURNAL OF HYPERSPECTRAL REMOTE SENSING, v. 8, p. 95-105, 2019.",
                          "BERGER, R. ; SILVA, J. A. A. ; FERREIRA, R. L. C. ; CANDEIAS, A. L. B. ; RUBILAR, R. . Índices de vegetação para a estimativa do Índice de Área Foliar em plantios clonais de Eucalyptus saligna Smith. CIÊNCIA FLORESTAL (ONLINE), v. 29, p. 885, 2019.",
                          "AZEVEDO, L. S. ; CANDEIAS, ANA LÚCIA BEZERRA . Quantitative and qualitative analysis of the IHS fusion technique applied to Landsat8 satellite images. JOURNAL OF HYPERSPECTRAL REMOTE SENSING, v. 9, p. 21-29, 2019.",
                          "SILVA, JADSON FREIRE ; MIRANDA, RODRIGO QUEIROGA ; CANDEIAS, ANA LÚCIA BEZERRA . Uma nova forma de análise bibliométrica ? NAILS (Network Analysis Interface for Literature Studies): Procedimentos essenciais para pesquisadores brasileiros. Revista Brasileira de Meio Ambiente, v. 7, p. 13-28, 2019.",
                          "Oliveira, Claudianne Brainer de Souza ; CANDEIAS, ANA LÚCIA BEZERRA ; TAVARES JUNIOR, J. R. . Utilização de índices físicos a partir de imagens OLI ? TIRS para o mapeamento de uso e cobertura da terra no entorno do aeroporto internacional do Recife/Guararapes ? Gilberto Freire. REVISTA BRASILEIRA DE GEOGRAFIA FÍSICA, v. 12, p. 1039-1053, 2019.",
                          "SANTOS, AMANDA PEREIRA; SILVA, EDER BATISTA DA ; CANDEIAS, ANA LÚCIA BEZERRA ; COSTA, MARIA APARECIDA TENÓRIO DA . Educação critica: uma aliança entre Educação Ambiental e M-learning. EDUCAÇÃO (SANTA MARIA. ONLINE), v. 44, p. 86, 2019.",
                          "SILVA, JADSON FREIRE ; PAZ, YENÊ MEDEIROS ; LIMA-SILVA, Pedro Paulo ; PEREIRA, João Antônio dos Santos ; CANDEIAS, ANA LÚCIA BEZERRA . Índices de vegetação do Sensoriamento Remoto para processamento de imagens na faixa do visível (RGB). JOURNAL OF HYPERSPECTRAL REMOTE SENSING, v. 9, p. 228-239, 2019.",
                          "ALEXANDRE, Fernando da Silva ; CANDEIAS, ANA LÚCIA BEZERRA ; GOMES, Daniel Dantas Moreira . Modelagem cartográfica para a delimitação das paisagens da bacia hidrográfica do Alto Curso do Rio Mundaú - Pernambuco/Alagoas, Nordeste, Brasil. REVISTA BRASILEIRA DE GEOGRAFIA FÍSICA, v. 12, p. 2489-2502, 2019."))

Is there any possibility of making this division using stringr, for example?

I managed to work with a single reference...

referencia.1<- c("AZEVEDO, L. S. ; NASCIMENTO, E. F. ; CANDEIAS, A. L. B. . Estudo da extração de bordas de reservatório utilizando múltiplas técnicas de fusão de imagens. JOURNAL OF HYPERSPECTRAL REMOTE SENSING, v. 8, p. 95-105, 2019.")
str_sub(referencia.1, start = -5, end = -2 )

[1] "2019"

...but not with the entire dataset

str_sub(ref, start = -5, end = -2)

Warning message:
  In stri_sub(string, from = start, to = end) :
  argument is not an atomic vector; coercing

Besides, how to go extracting these values in separate columns? For example: put the "2019" I removed in a "year" column"?

As the most complicated part seems to be identifying who the authors are in the string home, I would do something in parts. First, I would extract the last four digits of each reference to have the year. From what is left, I would extract the string from v. to have the data of the journal. What was left of it, I would extract what came from the last point, to have the name of the journal. What was left of this, I would extract again what came from the last point, to have the name of the article. What was left in the end would be the authors.

– Marcus Nunes

2020/08/26 at 12:08
Logically, if any of these references are not exactly the pattern of the given examples (e.g. some article title with .), this method would fail.

– Marcus Nunes

2020/08/26 at 12:10
Marcos, thank you. I would know the way to extract only the last 4 digits?

– itamar

2020/08/26 at 13:37

1 answer

Browser other questions tagged r substring stringr

You are not signed in. Login or sign up in order to post.

by Carlos Eduardo Lagosta • **5,497** points · Answer 1 · 2020-08-26T21:50:01+00:00

References follow a pattern, only not size. The general format is:

SURNAME, FIRST NAME; SURNAME, FIRST NAME . Title of the article. TITLE OF THE CALENDAR, v. X, p. X-X, YEAR.

As fields are delimited by combinations of semicolons, you can separate them sequentially:

library(stringr)

# Separa os autores:
spli1 <- str_split(ref$artigo, " \\. ", simplify = TRUE)

# Separa o título dos artigos:
spli2 <- str_split_fixed(spli1[, 2], "\\. ", 2)

# Separa título, volume, páginas e ano:
spli3 <- str_split(spli2[, 2], ", ", simplify = TRUE)

referencias.df <- data.frame(
  autores = str_to_upper(gsub(" ;", ";", spli1[, 1])),
  titulo = spli2[, 1],
  periodico = str_to_title(spli3[, 1]),
  volume = sub("v. ", "", spli3[, 2]),
  pagina = sub("p. ", "", spli3[, 3]),
  ano = as.integer(sub("\\.", "", spli3[, 4]))
)


> referencias.df[1:2,]
                                                                          autores
1                            AZEVEDO, L. S.; NASCIMENTO, E. F.; CANDEIAS, A. L. B.
2 BERGER, R.; SILVA, J. A. A.; FERREIRA, R. L. C.; CANDEIAS, A. L. B.; RUBILAR, R.
                                                                                                          titulo
1                  Estudo da extração de bordas de reservatório utilizando múltiplas técnicas de fusão de imagens
2 Índices de vegetação para a estimativa do Índice de Área Foliar em plantios clonais de Eucalyptus saligna Smith
                                periodico volume pagina  ano
1 Journal Of Hyperspectral Remote Sensing      8 95-105 2019
2              Ciência Florestal (Online)     29    885 2019