How do I insert NA, or deviation of missing data, when making Webscraping (rvest) from a page in R?

Asked

Viewed 57 times

1

I’m downloading the hotel rates at Natal- RN by the Booking.com website.

The code is programmed to download according to the check-in date the hotel name (nomes_i), the name of the room (quarto_i) and the rate of the day (precos_i), both the current and the next 502 days (scheduled to download 731 days, but only 502).

The code downloads these variables and stores them in a dataframe (banco_precos_i) that will compose the page dataframe(banco_precos) and later the day dataframe (banco_precos_dia). At the end the data of every day will compose a single dataframe (banco_precos_final).

When it arrives on 10/09/2020, at page 6 , and during a few days of September, the number of lines quarto_i is different from nomes_i and precos_i, which always has the same size, which makes it impossible to generate the dataframe banco_precos_i and consequently the others.

Informed error:

Error in data.frame(nomes_i, quarto_i, precos_i, stringsAsFactors = F) : arguments imply differing number of rows: 18, 17]. 

Some value is missing that I am not able to capture or insert the NA information in place of this one missing information.

As a solution I am putting during the month of September all the data of quarto_i to download as NA, but this information is very important for my analysis and would like to try another solution.

I tried the two suggestions that are here, to download only the missing data, but it didn’t work. Someone has another suggestion?

PS.: I am only informing the parts of the code that are related to the problem:

library(lubridate) 
library(rvest) 
library(devtools)
library(tidyverse) 
library(rlang) 
library(curl)

inicio <- today()
dias <- 0
banco_precos_dia <- c()
banco_precos_final <- c()

for (i in 0:731) {

# data do checkin 
  diacheckin <-  as.Date("2020-09-10")+ddays(i) #estou colocando o as.Date aqui para o código já começar a baixar de onde dá erro, geralmente a variável que vai aqui é inicio
  diain <- as.numeric(day(diacheckin))
  mesin <- as.numeric(month(diacheckin))
  anoin <- as.numeric(year(diacheckin))

# data do check out
  diacheckout <- diacheckin+ddays(1)
  diaout <- as.numeric(day(diacheckout))
  mesout <- as.numeric(month(diacheckout))
  anoout <- as.numeric(year(diacheckout))

qtd <- 250 #existe uma programação para calcular exatamente a quantidade aqui, mas ela não é relevante para este problema

  banco_precos <- c()
  banco_precos_i <- c()

  for(j in seq(0,qtd,25)){
    url_number <- j

    #buscando a pagina

    url <- curl(paste0('https://www.booking.com/searchresults.pt-br.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaCCIAQGYAS24ARfIAQzYAQHoAQH4AQuIAgGoAgO4AvCp5vIFwAIB&sid=b0ea1003a80543236a20e94559c4ed28&tmpl=searchresults&checkin_month=',mesin,'&checkin_monthday=',diain,'&checkin_year=',anoin,'&checkout_month=',mesout,'&checkout_monthday=',diaout,'&checkout_year=',anoout,'&city=-656888&class_interval=1&dest_id=-656888&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&nflt=ht_id%3D203%3Bht_id%3D204%3Bht_id%3D206%3Bht_id%3D216%3B&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=searchresults&src_elem=sb&srpvid=3de2a5cdcd850113&ss=Natal&ss_all=0&ssb=empty&sshis=0&ssne=Natal&ssne_untouched=Natal&top_ufis=1&rows=25&offset=',url_number), "rb")

    #lendo a pagina
    page <- read_html(url)

 #nome dos hoteis
   nomes_i <-page %>%
      html_nodes(".sr-hotel__name") %>%
      html_text()%>%
      {if(length(.) == 0) NA else .}

    #nome do quarto
    quarto_i <- page%>%
     html_nodes(".room_link strong ,  .sold_out_property") %>%
     html_text()%>%
      {if(length(.) == 0) NA else .}

#*quando chega no dia 10/09 e o erro ocorre, eu paro o código, troco o código acima por quarto_i<- NA, espero baixar o mês de setembro e volto a baixar outubro em diante com o código acima - tudo manualmente*

   #precos
    precos_i <- page %>%
      html_nodes(".bui-price-display__value , .sold_out_property ") %>%
      html_text()%>%
      {if(length(.) == 0) NA else .}

    #construindo o banco de dados de cada página
    banco_precos_i <- data.frame(nomes_i, quarto_i, precos_i, stringsAsFactors = F)

    #alimentando o banco de dados de um dia
    banco_precos <- rbind(banco_precos ,banco_precos_i)

    #banco de preço dia     
    banco_precos_dia <- cbind(inicio,banco_precos,diacheckin)

    #suspender execucao no R por 3 seg
    Sys.sleep(3)
}
banco_precos_final <- rbind(banco_precos_final, banco_precos_dia)
}
}
  • As someone managed to run this code, the error may be related to the response received from the server by R. Sometimes you can try to verify that you have received the appropriate response from the server (the package httr can help).

  • The error message, however, suggests some of the information you picked up came incomplete at the time of forming the banco_precos_i. So it says that while one has 18 elements, another has 17. In this case you could enter a check before creating the banco_precos_i and think of a strategy to match the size of the vectors.

  • Hello @Tomásbarcellos, yes, that’s what I’m trying to solve. Did you manage to run the code ? I’m using the package rvest for scraping, you suggest some function of httr that can help me ? I have not seen features in this package that differ from rvest for scraping. As I explained to @Tusca, while exploring his guidelines I discovered that on page 6 there is a specific hotel (Hostel Margo) that the values come different. No room information on .roomlink strong for this hotel and there are two rooms in .roomlink, you have any suggestions to solve this?

  • A solution would be to "force" the code to merge the room information into a single element. Ex: paste(texto_quartos, collapse = " "). After done scraping you could or think about how to treat

  • Another possibility would be to create a variable quarto2 in the table, which is almost always empty, but which in these cases would be filled. Or even add up the size of all the rooms and just keep that information.

  • Or to check the information of how many rooms you have, etc, etc... But only you can determine the best strategy for your specific case

Show 1 more comment

1 answer

0

Your code ran smoothly here. I tried to generate an error on purpose in the vector of rooms and even so it worked, put the NA’s where found nothing:

Maybe what is happening is a block of the server on your IP for generating too many queries on the site. This can get you html pages with errors. Tip: When you identify this error in your code, you save the "page" variable with extension. hmtl and take a look at your browser as it is the page, if it presents something strange in the elements.

  • Hi @tusca. I tried to download starting from 10/09 to explore the possibility of being a server block, but gave error on page 6. Accessing exactly this page saw that the Hostel Margo, instead of the name of the room appear two bedrooms (and that do not appear in the .room_link strong), I tried to download the .roomname or .room_link instead of .room_link strong, but you keep making the mistake because now you drop two names to this hotel instead of just one, you know how I can group these dorms into one? Includes the page link in the question, as it exceeds the characters of the comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.