5
I’m having trouble doing the webscrapping
for sites using the method post
, for example, I need to extract all news related to political parties from the website: http://www.diariodemarilia.com.br.
Below is a schedule I made of a newspaper using the method get
to show what is my goal with this programming.
# iniciar bibliotecas
library(XML)
library(xlsx)
# URL real = http://www.imparcial.com.br/site/page/2?s=%28PSDB%29
url_base <-"http://www.imparcial.com.br/site/page/koxa?s=%28quatro%29"
url_base <- gsub("quatro", "PSD", url_base)
link_imparcial <- c()
for (i in 1:4){
print(i)
url1 <- gsub("koxa", i, url_base)
pag<- readLines(url1)
pag<- htmlParse(pag)
pag<- xmlRoot(pag)
links <- xpathSApply(pag, "//h1[@class='cat-titulo']/a", xmlGetAttr, name="href")
link_imparcial <- c(link_imparcial, links)
}
dados <- data.frame()
for(links in link_imparcial){
pag1<- readLines (links)
pag1<- htmlParse(pag1)
pag1<- xmlRoot(pag1)
titulo <- xpathSApply (pag1, "//div[@class='titulo']/h1", xmlValue)
data_hora <-xpathSApply (pag1, "//span[@class='data-post']", xmlValue)
texto <- xpathSApply (pag1, "//div[@class='conteudo']/p", xmlValue)
dados <- rbind(dados, data.frame(titulo, data_hora, texto))
}
agregar <-
aggregate(dados$texto,list(dados$titulo,dados$data_hora),paste,collapse=' ')
#definir diretorio
setwd("C:\\Users\\8601314\\Documents")
# salvar em xlsx
write.xlsx(agregar, "PSDB.xlsx", col.names = TRUE, row.names = FALSE)
If it is not possible to solve my problem, I would like indications where I can find programming examples with the method post
.
Welcome to Sopt. Add code to the question, not an image of it, this makes it difficult to analyze and reproduce the problem. I suggest reading [Ask]
– user28595