1
I need a Crawler who makes pagination on a website.
I’m reading the source code and generating a txt that way
public class CodFonte {
public static void crawler(String str) throws IOException {
URL url = new URL(str);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setReadTimeout(15 * 1000);
connection.connect();
// read the output from the server
BufferedReader reader = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
StringBuilder stringBuilder = new StringBuilder();
String linha = "";
String path = System.getProperty("user.home") + "\\Desktop\\";
String fileName = "Fonte Code.txt"; // Nome do arquivo
FileWriter file = new FileWriter(path + fileName);
PrintWriter gravarArq = new PrintWriter(file);
gravarArq.println("SITE -------- " + url);
while ((linha = reader.readLine()) != null) {
gravarArq.println(linha);
}
file.close();
reader.close();
}
}
But I need to go to next page, the url is friendly does not change according to the request form that is via POST.
What is the structure of the URL (paging)? It is a number?
– Felipe Douradinho
is number the input is within a table --- <input type="text" maxlength="4" maxsize="4" value="3" name="pag"></input>
– roque
And you can already read this input by your
java
above?– Felipe Douradinho
Yes I can get to it using Document doc = Jsoup.parse(html);
– roque
Well, I once created a Crawler like this in PHP. The idea was the same. The difference is that I put a link already with the next page at the end of the page, so I clicked and then the current page was the next page. You will have to use a method of
sleep
(to wait between a page and another) and use aloop
to follow to the next page, plus a stop condition (ex.:maxPaginasToCraw
). First, you have to see how many pages there are, accessing all of them and checking http header (code 200) to save the total number of pages to use in the loop in a variable.– Felipe Douradinho
That part of taking the total of pages I got, has a div in which the page itself already speaks the total, then I myself make a conditions and recover the total, now how to submit the post with the parameters?
– roque
Your problem is just go to the next page then? You can already get the code from the current page?
– Felipe Douradinho
Yes, I can even generate a txt of the current page, and I need to know how to submit the form via post with this Crawler once I have the next page I can scan it
– roque