How to use the remote driver on proxy protected computer via R software Rselenium package?

Asked

Viewed 125 times

3

Well, I need to access a site on my work network, but this is protected by proxy.

Some sites accept using httr and rvest packages, others do not. To log in to site for examples I cannot. Example:

pro    <- use_proxy("minha.proxy", porta, "meuusuario", "minhasenha")
my_session <- html_session(url, pro)

I usually use this proxy function to access the url I want and go through the proxy.

But in certain sites, in case to log in, this function does not run, or better I can not log in.

The alternative I found was to use a remote driver using the function rsDriver(browser=c("chrome")), for example. On my personal pc I can unwind all the code through the remote driver of the Rselenium Package. Now I can’t work the network. The best options I found researching were:

1)

cprof <- list(chromeOptions = list(
args = c('--proxy-server=http://minha.proxy:porta',
         '--proxy-auth=usuario:senha')))
driver<- rsDriver(browser=c("chrome"), extraCapabilities = cprof)

2)

cprof <- list(chromeOptions = list(
args = c('--proxy-server=http://ip:porta',
         '--proxy-auth=usuario:senha')))
driver<- rsDriver(browser=c("chrome"), extraCapabilities = cprof)

This to pass the proxy, but returns in all:

checking Selenium Server versions:
BEGIN: PREDOWNLOAD
Error in open.connection(con, "rb") : 
  Timeout was reached: Connection timed out after 10000 milliseconds

This error is what usually happens when you don’t pass the proxy (I think!).

So, is there any way to bypass the proxy and open my remote driver? Well, if you have anything to contribute, I’d be grateful!

  • already tried to do with another browser?

  • Sim Daniel, firefox and phantomjs too. But you have the same mistake.

  • 2

    Have you tried something like this: https://stackoverflow.com/a/29663818/3297472

  • This kind of thing is mto hard to debug. I don’t have any proxy server to test!

  • I tried too, the problem is that the function phantom it doesn’t seem to work anymore Error: phantom is now defunct. Users can drive PhantomJS via selenium using &#xA; the RSelenium::rsDriver function or directly using wdman::phantomjs. I tried to make the switch to wdman::phantomjs, but I was unsuccessful because I didn’t fully understand it.

  • I agree that it is very difficult, I think that each proxy server has its particularities which makes it even more difficult.

  • 1

    Try using the library Node puppeteer.js with the R, maybe you can with this API. I wrote something like starting to use the puppeteer.js here. API of puppeteer.js here.

  • Thanks @Jdemello. With this method I managed, but before running it I had to make a change in the environment variables. It would be prudent to put the solution I found in answer to my question?

  • Yes do, I find it a pertinent question. Thank you

  • Can you use Docker on the work server? because I think the containerized version of Selenium would solve the problem.

Show 5 more comments

1 answer

1

I found A solution to my problem!

Since I’m on an institutional network, I need a proxy to surf the Internet. In order for Rstudio to use the proxy you need to define it within the IDE (in the function you are going to use, as in the question) or even change the environment variables as in:Reference 1).

That’s what I did, I inserted the ambient variables reference 2:

variable name: http_proxy
variable value: https://user_id:password@your_proxy:your_port/

variable name: https_proxy
variable value: https:// user_id:password@your_proxy:your_port

This was the first step. Then I followed the steps said by @Jdemello at Reference 3. In Reference 3 basically what I did was download and install the node.js download Ode, then installed the puppeteer.js, created in a notepad and named the file as scrape_mustard.js (View the contents of the file in Reference 3) and scrape_mustard.js for node "create page" with the function system() in Rstudio.

Follows script:

setwd("C:\\Program Files\\nodejs") ### 
#OBS.: Tive que mudar o diretório para a pasta no disco C onde o nodejs foi instalado.

## system("npm i puppeteer") ## Esta função fez instalar o Puppteer

library(magrittr)
system("node scrape_mustard.js") ## Rodar o scape_mustard.js e criar a página que preciso

library(httr)
html <- xml2::read_html("~/PAGINA/page.html") ## ler html

html %>% 
rvest::html_nodes("h1") ## capturar o que existe na tag h1

Difficulties:

  • As I installed Node on disk C the directory had to be changed in Rstudio for there;
  • scraper_mustard.js (the name can be changed) also had to move to the nodejs folder on disk C;
  • The page definition should be done inside the scraper_mustard.js, ie, edit every time the file before running it (made by writeLines()), but if it is in the nodejs folder (as I did) on disk C, you will need administrator permission.

NOTE: I haven’t worked on submitting the login page yet, but I’ve managed to get the page I wanted, which wasn’t possible before. Maybe I made the steps found in the references in the wrong way, but the first step was taken, I thought it fair to share

Alternatively I will try to use the Docker quoted by @José, I’m still studying this. I hope I’ve been clear! Thank you guys!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.