Crawler to make pagination

Asked

Viewed 188 times

1

I need a Crawler who makes pagination on a website.

I’m reading the source code and generating a txt that way

public class CodFonte {

public static void crawler(String str) throws IOException {

    URL url = new URL(str);
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setReadTimeout(15 * 1000);
    connection.connect();

    // read the output from the server
    BufferedReader reader = new BufferedReader(new InputStreamReader(
            connection.getInputStream()));
    StringBuilder stringBuilder = new StringBuilder();

    String linha = "";

    String path = System.getProperty("user.home") + "\\Desktop\\"; 
    String fileName = "Fonte Code.txt"; // Nome do arquivo

    FileWriter file = new FileWriter(path + fileName);
    PrintWriter gravarArq = new PrintWriter(file);
    gravarArq.println("SITE -------- " + url);

    while ((linha = reader.readLine()) != null) {
        gravarArq.println(linha);
    }
     file.close();
    reader.close();
}

}

But I need to go to next page, the url is friendly does not change according to the request form that is via POST.

  • What is the structure of the URL (paging)? It is a number?

  • is number the input is within a table --- <input type="text" maxlength="4" maxsize="4" value="3" name="pag"></input>

  • And you can already read this input by your java above?

  • Yes I can get to it using Document doc = Jsoup.parse(html);

  • Well, I once created a Crawler like this in PHP. The idea was the same. The difference is that I put a link already with the next page at the end of the page, so I clicked and then the current page was the next page. You will have to use a method of sleep (to wait between a page and another) and use a loop to follow to the next page, plus a stop condition (ex.: maxPaginasToCraw). First, you have to see how many pages there are, accessing all of them and checking http header (code 200) to save the total number of pages to use in the loop in a variable.

  • That part of taking the total of pages I got, has a div in which the page itself already speaks the total, then I myself make a conditions and recover the total, now how to submit the post with the parameters?

  • Your problem is just go to the next page then? You can already get the code from the current page?

  • Yes, I can even generate a txt of the current page, and I need to know how to submit the form via post with this Crawler once I have the next page I can scan it

Show 3 more comments

1 answer

2


Making a post request and getting feedback, help?

HttpURLConnectionExample.java

package com.meupacote.app;

import java.io.BufferedReader;
import java.io.DataOutputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

public class HttpURLConnectionExample {

    private final String USER_AGENT = "Mozilla/5.0";

    public static void main(String[] args) throws Exception {

        HttpURLConnectionExample http = new HttpURLConnectionExample();

        System.out.println("\nTesting 1 - Enviar request via POST");
        http.sendPost();

    }

    // HTTP POST request
    private void sendPost() throws Exception {

        String url = "http://www.url.com/";
        URL obj = new URL(url);
        HttpsURLConnection con = (HttpsURLConnection) obj.openConnection();

        //add reuqest header
        con.setRequestMethod("POST");
        con.setRequestProperty("User-Agent", USER_AGENT);
        con.setRequestProperty("Accept-Language", "en-US,en;q=0.5");

        String urlParameters = "param1=valor1&param2=valor2";

        // Send post request
        con.setDoOutput(true);
        DataOutputStream wr = new DataOutputStream(con.getOutputStream());
        wr.writeBytes(urlParameters);
        wr.flush();
        wr.close();

        int responseCode = con.getResponseCode();
        System.out.println("\Enviando 'POST' request para a URL : " + url);
        System.out.println("Parâmetros parameters : " + urlParameters);
        System.out.println("Response Code: " + responseCode);

        BufferedReader in = new BufferedReader(
                new InputStreamReader(con.getInputStream()));
        String inputLine;
        StringBuffer response = new StringBuffer();

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }
        in.close();

        //print result
        System.out.println(response.toString());

    }

}
  • So, but the url is friendly, does not change, so for me to get to the next page I need to send the request via post and so recover the new page

  • I updated the answer. Maybe I didn’t get it wrong

  • Good answer! But, there is some special reason to use StringBuffer instead of StringBuilder? I also think I could use the syntax Try-with-Resources to simplify the code and make it more robust.

  • Calm down, then this code I believe will help a lot, as you took the parameters that must be passed in the request. 'Cause I think I’m passing the wrong parameters

  • So you think the parameters are wrong, so the feedback doesn’t help you?

  • because when I execute my code it always returns the same page, I made a loop to call the requisicao post with the total of pages, the site that I need to do Crawler eh esse http://www.apinfo.com/apinfo/inc/list4.cfm. I need to store the vacancies on this site, but I can’t get past the first page

  • Got it! So the post URL is http://www.apinfo.com/apinfo/inc/list4.cfm. and the parameters are <input type="hidden" name="estado" value="99">&#xA;<input type="hidden" name="tv" value="351">&#xA;<input type="hidden" name="ddmmaa1" value="27/06/15">&#xA;<input type="hidden" name="ddmmaa2" value="">&#xA;<input type="hidden" name="onde" value="1">&#xA;<input type="hidden" name="andor" value="1">&#xA;<input type="hidden" name="keyw" value="">&#xA;<input type="hidden" name="pag" value="4">

  • Just set in String urlParameters = "param1=valor1&param2=valor2"; the parameters and see what returns...what will probably return will be the whole HTML of the page, ready to parse using part of the code of your crawler

  • now it’s clear, I’m going to implement here, vlw man

  • Imagine! I think I will also do one of Apinfo in C#! Hug

  • Guy I tested here and it worked right, I was doing a very complicated thing using Jsoup, the way you did it was simple and direct vlw

  • Oops! Haha perfect! That’s great!

  • Can you give me a hand on another Rawler ? http://answall.com/questions/72459/crawler-para-fazer-login-no-site-da-nota-fiscal-paulista

Show 8 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.