How to get HTML code from a protected page with Cloudflare?

Asked

Viewed 464 times

-3

I’m trying to get the HTML of a page with the Jsoup.

This page has Cloudflare as protection and, instead of getting the HTML code of the site I’m interested in, it’s returning me the HTML of the Cloudflare page (see image below) which is displayed before redirecting to the target site. I need to get the HTML of the site that Cloudflare redirects to after that page.

Página do cloudflare


Example of the Cloudflare page (not the site I’m looking for, but to illustrate).


My code is like this:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

    public static void main(String...args) throws IOException {
        Document document = Jsoup.connect("http://site.com")
                                 .userAgent("Mozilla/5.0")
                                 .timeout(10000)
                                 .get();

        System.out.println(document.html());
    }
}

The output is similar to this:

<html>
 <head>
  <title>You are being redirected...</title> 
  <script> <!-- código JS enorme --> </script>
 </head>
 <body></body>
</html>

I thought I’d define setRedirects for true, but reading the documentation I saw that this is the default value. I found that question with same title on Stackoverflow but the problem there is another.

I also tried to make two requests, the second using the cookies of the first and gave in the same, I fall on the same page:

import java.io.IOException;
import java.util.Map;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

    public static void main(String...args) throws IOException{

        final String URL = "http://site.com/";

        // Executando a primeira requisição.
        Connection.Response response =
            Jsoup.connect(URL)
                 .timeout(10000)
                 .method(Connection.Method.GET)
                 .execute();

        // Pegando os cookies da resposta    
        Map<String, String> cookies = response.cookies();

        Document doc = Jsoup.connect(URL)
                            .cookies(cookies) // Usando os cookies na 2ª chamada
                            .get();

        System.out.println(doc.html()); // Fail! Cloudflare me bloqueia.                                        
    }
}

I accept an answer that makes no use of Jsoup either, as long as it solves this problem. I don’t need anything complex, only that the return containing HTML is a String.

  • passes the site link

2 answers

1

Analyzing cloudflare’s html, in the case of the page http://lubbo-zone.nl, I came to the following conclusion: The javascript you mentioned in the question is an algorithm that uses some data from the page to perform a calculation and send the result of that calculation for validation. If the calculation is correct, you are redirected to the real page, otherwise you are looped in the cloudflare. The calculation is done using jjencode, represented by something similar to:

+((!+[]+!![]+!![]+!![]+[])+(!+[]+!![]))

As this jjencode code changes with each request, it is impossible for you to go through the couldflare without decrypting it every time. I believe it’s possible for you to get past him, but it’s not trivial. If you’re still interested in doing this bypass on cloudflare, here are some interesting links about jjencode:

Tool for Encounter: http://utf-8.jp/public/jjencode.html

Explanation of the operation of jjencode: https://blog.korelogic.com/blog/2015/01/12/javascript_deobfuscation

  • It does this on first access and includes a cookie containing some information about my access. That’s why I tried to make two requests and, on Monday, use the cookies of the first.

1


I ended up leaving Jsoup and using a webdriver. I chose the Htmlunit for this and the code that solves the problem I was encountering is this:

import java.io.IOException;
import com.gargoylesoftware.htmlunit.*;

public class Main {
    public static void main(String...args) throws IOException {

        final String URL = "http://site.com/o/clouflare/bloqueando";

        Page page = new WebClient(BrowserVersion.BEST_SUPPORTED).getPage(URL);
        System.out.println(page.getWebResponse().getContentAsString()); // Feito!
    }
}

One remark about Htmlunit: him printa all validation errors in properties found in the document (HTML, CSS and Javascript) through a Logger. To disable it, I followed that answer and include a line in my code:

Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);

Browser other questions tagged

You are not signed in. Login or sign up in order to post.