Data scraping with jsoup and saving in txt

Asked

Viewed 94 times

0

Whoa, way to go, guys. I’m trying to learn data scraping on my own, and as my English doesn’t help, I’m turning 30. Basically this is it. In executing my code, he lists the athletes of the International Judo Federation, one below the other. I found that each iteration, it takes all of a country at once. Then String becomes a block with all the athletes in that country. I’d like to break up to catch one athlete at a time, but I couldn’t. So, when printing, he prints one underneath the other, but when sending it to txt, he doesn’t do it, he puts everything in the same line and only jumps when he changes parents. Another thing I realized is that he glues the last name of an athlete to the first of the next. Example: The code is thus saving txt:

0_ Afghanistan ABDUL HADI Gada Khilafghan Zergulahmadi Ahmad Shabiralipoor Abdul Hadiarman Khaledaryan Mod Reshadassadi Yahyaassadi Rohullahbakhshi Mohammad tawfiqBAREKZAI Ahmad Hamedbayat Habibafaiz ZADA Ajmalfaizzada Ajmal FAZLI Abdul Fahimhussaini Atefahussaini Sayed Hussain

I’d like it to stay that way:

0_ Afghanistan

ABDUL HADI Gada Khil

AFGHAN Zergul

AHMADI Ahmad Shabir

ALIPOOR Abdul Hadi

ARMAN Khaled

ARYAN Mod Reshad

ASSADI Yahya

ASSADI Rohullah

BAKHSHI Mohammad Tawfiq

BAREKZAI Ahmad Hamed

BAYAT Habiba

Follows my code.

package pack;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import org.jsoup.Jsoup;
public class Main {

    @SuppressWarnings("null")
    public static void main(String[] args) throws IOException{
        // Nesse bloco eu estou pegando os paises e suas siglas e inserindo no bd para usar na url de raspagem de atletas.
        org.jsoup.nodes.Document doc = Jsoup.connect("https://www.ijf.org/judoka?name=&nation=all&gender=both&category=all").get();
        String paises = doc.select("option").text().replace(")", ")\n").replaceAll("All","").toString();
        int pos = paises.indexOf(")")+1;
        int quebra = paises.indexOf("\n")+1;
        int i=0;    

        ArrayList<Nacoes>bd = new ArrayList<>();
        String pais = "a", sigla ;
        while (pais.length()>0) {
            Nacoes n = new Nacoes();
            pais = paises.substring(0,pos);
            sigla = pais.substring(pais.length()-4, pais.length()-1);
            pais = pais.substring(0,pais.length()-5);            
            paises = paises.substring(quebra,paises.length());
            quebra = paises.indexOf("\n")+1;
            pos = paises.indexOf(")")+1;
            n.setPais(pais);
            n.setSigla(sigla);
            bd.add(n);      
            i++;
            pais = paises.substring(0,pos);
        }


        File arquivo = new File("C:\\ifjAtletas.txt");   
        FileWriter grava = new FileWriter(arquivo);
        PrintWriter escreve = new PrintWriter(grava);

        org.jsoup.nodes.Document doc2 = null;
        String inHtml = ("https://www.ijf.org/judoka?name=&nation=");
        String fimHtml = ("&gender=both&category=all");
        i=0;

        while(i<5) {
            doc2 =  Jsoup.connect(inHtml+bd.get(i).sigla+fimHtml).get();
            String atletas = doc2.select("a").text().toString();
            atletas = atletas.substring(1247, (atletas.length())-54).replace(" "+bd.get(i).sigla+" ","\n");
            escreve.println(i+"_ "+bd.get(i).pais +"\n "+atletas+"\n");

            System.out.println((i+"_ "+bd.get(i).pais +"\n "+atletas+"\n"));
            i++;
        }


        escreve.close();
        grava.close();
    }
}

The Nation class has only two strings and its getters/setters.

Follow a piece of website code if anyone wants to suggest an easier way.

 {`  <div class="results container-narrow">
                                                                                                        <a href="/judoka/33416" class="judoka">
                            <div class="judoka__profile_image">
                                <img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/33416.jpg" alt="">
                            </div>
                            <div class="judoka__info">
                                <div class="family_name">ADRIANO</div>
                                <div class="given_name">Gabriel</div>
                                <div class="country">
                                    <img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png" alt="">
                                    BRA
                                </div>
                            </div>
                        </a>
                                                                                            <a href="/judoka/1039" class="judoka">
                            <div class="judoka__profile_image">
                                <img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/1039.jpg" alt="">
                            </div>
                            <div class="judoka__info">
                                <div class="family_name">AGUIAR</div>
                                <div class="given_name">Mayra</div>
                                <div class="country">
                                    <img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png" alt="">
                                    BRA
                                </div>
                            </div>
                        </a>`
}

Thank you guys, hugs.

1 answer

0


From what I understand researching the problem is on this line:

escreve.println(i+"_ "+bd.get(i).pais +"\n "+atletas+"\n");

Using Printwriter on Windows is correct to use " r n" and print instead of println to skip lines and not just " n", so you would have to stay

escreve.print(i+"_ "+bd.get(i).pais +"\r\n"+atletas+"\r\n");

Another way to do this line break is by using System.getProperty("line.separator");

  • The option of getProperty is better because it is agnostic, ie independent of the operating system.

  • I don’t know the getProperty, where would I fit it? Adding the r didn’t work.

  • you can concatenate equal r n...example writes.println(i+"_ "+bd.get(i). parents +System.getProperty("line.separator")+athletes+System.getProperty("line.separator"));

  • Man, thanks for the help. It didn’t work, but I think I might have to do a more specific scraping. Hugs.

  • how does the string athletes get to you? gets like player1 player2 player3 ?

  • The string in the print is below each other, in txt, everything is on the same line, without the n. It jumps in line, when it changes the country. The way I’m scraping, he doesn’t take athlete for athlete, he takes parents for parents.

  • ah discovered, actually in addition to r n have to replace the println by just print tb, test for kindness I tested here with some things and it worked

  • Haaaaa muleque, It worked dude. Po, thank you very much bro. I was getting discouraged . Excellent.

  • beauty, joe, thanks, I changed with the solution defined, mark as right when you can to help the community

Show 4 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.