-3
I’m trying to get the HTML of a page with the Jsoup.
This page has Cloudflare as protection and, instead of getting the HTML code of the site I’m interested in, it’s returning me the HTML of the Cloudflare page (see image below) which is displayed before redirecting to the target site. I need to get the HTML of the site that Cloudflare redirects to after that page.
Example of the Cloudflare page (not the site I’m looking for, but to illustrate).
My code is like this:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String...args) throws IOException {
Document document = Jsoup.connect("http://site.com")
.userAgent("Mozilla/5.0")
.timeout(10000)
.get();
System.out.println(document.html());
}
}
The output is similar to this:
<html>
<head>
<title>You are being redirected...</title>
<script> <!-- código JS enorme --> </script>
</head>
<body></body>
</html>
I thought I’d define setRedirects
for true
, but reading the documentation I saw that this is the default value. I found that question with same title on Stackoverflow but the problem there is another.
I also tried to make two requests, the second using the cookies of the first and gave in the same, I fall on the same page:
import java.io.IOException;
import java.util.Map;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String...args) throws IOException{
final String URL = "http://site.com/";
// Executando a primeira requisição.
Connection.Response response =
Jsoup.connect(URL)
.timeout(10000)
.method(Connection.Method.GET)
.execute();
// Pegando os cookies da resposta
Map<String, String> cookies = response.cookies();
Document doc = Jsoup.connect(URL)
.cookies(cookies) // Usando os cookies na 2ª chamada
.get();
System.out.println(doc.html()); // Fail! Cloudflare me bloqueia.
}
}
I accept an answer that makes no use of Jsoup either, as long as it solves this problem. I don’t need anything complex, only that the return containing HTML is a String
.
passes the site link
– Brumazzi DB