1
I’m downloading the hotel rates at Natal- RN by the Booking.com website.
The code is programmed to download according to the check-in date the hotel name (nomes_i
), the name of the room (quarto_i
) and the rate of the day (precos_i
), both the current and the next 502 days (scheduled to download 731 days, but only 502).
The code downloads these variables and stores them in a dataframe (banco_precos_i
) that will compose the page dataframe(banco_precos
) and later the day dataframe (banco_precos_dia
). At the end the data of every day will compose a single dataframe (banco_precos_final
).
When it arrives on 10/09/2020, at page 6 , and during a few days of September, the number of lines quarto_i
is different from nomes_i
and precos_i
, which always has the same size, which makes it impossible to generate the dataframe banco_precos_i and consequently the others.
Informed error:
Error in data.frame(nomes_i, quarto_i, precos_i, stringsAsFactors = F) : arguments imply differing number of rows: 18, 17].
Some value is missing that I am not able to capture or insert the NA information in place of this one missing information.
As a solution I am putting during the month of September all the data of quarto_i
to download as NA, but this information is very important for my analysis and would like to try another solution.
I tried the two suggestions that are here, to download only the missing data, but it didn’t work. Someone has another suggestion?
PS.: I am only informing the parts of the code that are related to the problem:
library(lubridate)
library(rvest)
library(devtools)
library(tidyverse)
library(rlang)
library(curl)
inicio <- today()
dias <- 0
banco_precos_dia <- c()
banco_precos_final <- c()
for (i in 0:731) {
# data do checkin
diacheckin <- as.Date("2020-09-10")+ddays(i) #estou colocando o as.Date aqui para o código já começar a baixar de onde dá erro, geralmente a variável que vai aqui é inicio
diain <- as.numeric(day(diacheckin))
mesin <- as.numeric(month(diacheckin))
anoin <- as.numeric(year(diacheckin))
# data do check out
diacheckout <- diacheckin+ddays(1)
diaout <- as.numeric(day(diacheckout))
mesout <- as.numeric(month(diacheckout))
anoout <- as.numeric(year(diacheckout))
qtd <- 250 #existe uma programação para calcular exatamente a quantidade aqui, mas ela não é relevante para este problema
banco_precos <- c()
banco_precos_i <- c()
for(j in seq(0,qtd,25)){
url_number <- j
#buscando a pagina
url <- curl(paste0('https://www.booking.com/searchresults.pt-br.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaCCIAQGYAS24ARfIAQzYAQHoAQH4AQuIAgGoAgO4AvCp5vIFwAIB&sid=b0ea1003a80543236a20e94559c4ed28&tmpl=searchresults&checkin_month=',mesin,'&checkin_monthday=',diain,'&checkin_year=',anoin,'&checkout_month=',mesout,'&checkout_monthday=',diaout,'&checkout_year=',anoout,'&city=-656888&class_interval=1&dest_id=-656888&dest_type=city&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&nflt=ht_id%3D203%3Bht_id%3D204%3Bht_id%3D206%3Bht_id%3D216%3B&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=searchresults&src_elem=sb&srpvid=3de2a5cdcd850113&ss=Natal&ss_all=0&ssb=empty&sshis=0&ssne=Natal&ssne_untouched=Natal&top_ufis=1&rows=25&offset=',url_number), "rb")
#lendo a pagina
page <- read_html(url)
#nome dos hoteis
nomes_i <-page %>%
html_nodes(".sr-hotel__name") %>%
html_text()%>%
{if(length(.) == 0) NA else .}
#nome do quarto
quarto_i <- page%>%
html_nodes(".room_link strong , .sold_out_property") %>%
html_text()%>%
{if(length(.) == 0) NA else .}
#*quando chega no dia 10/09 e o erro ocorre, eu paro o código, troco o código acima por quarto_i<- NA, espero baixar o mês de setembro e volto a baixar outubro em diante com o código acima - tudo manualmente*
#precos
precos_i <- page %>%
html_nodes(".bui-price-display__value , .sold_out_property ") %>%
html_text()%>%
{if(length(.) == 0) NA else .}
#construindo o banco de dados de cada página
banco_precos_i <- data.frame(nomes_i, quarto_i, precos_i, stringsAsFactors = F)
#alimentando o banco de dados de um dia
banco_precos <- rbind(banco_precos ,banco_precos_i)
#banco de preço dia
banco_precos_dia <- cbind(inicio,banco_precos,diacheckin)
#suspender execucao no R por 3 seg
Sys.sleep(3)
}
banco_precos_final <- rbind(banco_precos_final, banco_precos_dia)
}
}
As someone managed to run this code, the error may be related to the response received from the server by R. Sometimes you can try to verify that you have received the appropriate response from the server (the package
httr
can help).– Tomás Barcellos
The error message, however, suggests some of the information you picked up came incomplete at the time of forming the
banco_precos_i
. So it says that while one has 18 elements, another has 17. In this case you could enter a check before creating thebanco_precos_i
and think of a strategy to match the size of the vectors.– Tomás Barcellos
Hello @Tomásbarcellos, yes, that’s what I’m trying to solve. Did you manage to run the code ? I’m using the package
rvest
for scraping, you suggest some function ofhttr
that can help me ? I have not seen features in this package that differ from rvest for scraping. As I explained to @Tusca, while exploring his guidelines I discovered that on page 6 there is a specific hotel (Hostel Margo) that the values come different. No room information on.roomlink strong
for this hotel and there are two rooms in.roomlink
, you have any suggestions to solve this?– Jaylhane Veloso Nunes
A solution would be to "force" the code to merge the room information into a single element. Ex:
paste(texto_quartos, collapse = " ")
. After done scraping you could or think about how to treat– Tomás Barcellos
Another possibility would be to create a variable
quarto2
in the table, which is almost always empty, but which in these cases would be filled. Or even add up the size of all the rooms and just keep that information.– Tomás Barcellos
Or to check the information of how many rooms you have, etc, etc... But only you can determine the best strategy for your specific case
– Tomás Barcellos