How to make the webscrapping of a site that has post method?

Asked

Viewed 473 times

5

I’m having trouble doing the webscrapping for sites using the method post, for example, I need to extract all news related to political parties from the website: http://www.diariodemarilia.com.br.

Below is a schedule I made of a newspaper using the method get to show what is my goal with this programming.

# iniciar bibliotecas 
library(XML)
library(xlsx)
# URL real = http://www.imparcial.com.br/site/page/2?s=%28PSDB%29


url_base <-"http://www.imparcial.com.br/site/page/koxa?s=%28quatro%29"

url_base <- gsub("quatro", "PSD", url_base)

link_imparcial <- c()

for (i in 1:4){

  print(i)

  url1 <- gsub("koxa", i, url_base)

  pag<- readLines(url1)

  pag<- htmlParse(pag)

  pag<- xmlRoot(pag)

  links <- xpathSApply(pag, "//h1[@class='cat-titulo']/a", xmlGetAttr, name="href")

link_imparcial <- c(link_imparcial, links)
}
dados <- data.frame()

for(links in link_imparcial){

pag1<- readLines (links)

pag1<- htmlParse(pag1)

pag1<- xmlRoot(pag1)

titulo <- xpathSApply (pag1, "//div[@class='titulo']/h1", xmlValue)

data_hora <-xpathSApply (pag1, "//span[@class='data-post']", xmlValue)

texto <- xpathSApply (pag1, "//div[@class='conteudo']/p", xmlValue)

dados <- rbind(dados, data.frame(titulo, data_hora, texto))
}


agregar <- 

aggregate(dados$texto,list(dados$titulo,dados$data_hora),paste,collapse=' ')


#definir diretorio

setwd("C:\\Users\\8601314\\Documents")

# salvar em xlsx
write.xlsx(agregar, "PSDB.xlsx", col.names = TRUE, row.names = FALSE)

If it is not possible to solve my problem, I would like indications where I can find programming examples with the method post.

  • 5

    Welcome to Sopt. Add code to the question, not an image of it, this makes it difficult to analyze and reproduce the problem. I suggest reading [Ask]

1 answer

7


In that case you can do so using the package httr:

library(httr)
library(rvest)
library(purrr)
library(stringr)

url <- "http://www.diariodemarilia.com.br/resultado/"
res <- POST(url, body = list("Busca" = "PT"))

After that you can extract the data in the usual way or using the rvest:

noticias <- content(res, as = "text", encoding = "latin1") %>%
      read_html() %>%
      html_nodes("td") 


# extrai titulos
noticias %>%
  html_nodes("strong") %>%
  html_text()
# extrai links
noticias %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  keep(~str_detect(.x, fixed("/noticia/")))
# extrai data
noticias %>%
  html_nodes("em") %>%
  html_text()

The idea to extract information when the site receives forms POST is to find out what information the site sends to the server.

I always open the site using the Chrome tightening F12 to open the developer tools and I will p/ a tab Network.

Then send the form normally through the site and back p/ a aba Network, and click on the first item of the list, in this case is /resultado/.

Now, look down at the image the part Form Data, are this information you need to send to the server using the parameter body of function POST of httr.

inserir a descrição da imagem aqui

  • 2

    Very good this tip to look at my Chrome form data, I did not know and always went only by HTML.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.