0
Whoa, way to go, guys. I’m trying to learn data scraping on my own, and as my English doesn’t help, I’m turning 30. Basically this is it. In executing my code, he lists the athletes of the International Judo Federation, one below the other. I found that each iteration, it takes all of a country at once. Then String becomes a block with all the athletes in that country. I’d like to break up to catch one athlete at a time, but I couldn’t. So, when printing, he prints one underneath the other, but when sending it to txt, he doesn’t do it, he puts everything in the same line and only jumps when he changes parents. Another thing I realized is that he glues the last name of an athlete to the first of the next. Example: The code is thus saving txt:
0_ Afghanistan ABDUL HADI Gada Khilafghan Zergulahmadi Ahmad Shabiralipoor Abdul Hadiarman Khaledaryan Mod Reshadassadi Yahyaassadi Rohullahbakhshi Mohammad tawfiqBAREKZAI Ahmad Hamedbayat Habibafaiz ZADA Ajmalfaizzada Ajmal FAZLI Abdul Fahimhussaini Atefahussaini Sayed Hussain
I’d like it to stay that way:
0_ Afghanistan
ABDUL HADI Gada Khil
AFGHAN Zergul
AHMADI Ahmad Shabir
ALIPOOR Abdul Hadi
ARMAN Khaled
ARYAN Mod Reshad
ASSADI Yahya
ASSADI Rohullah
BAKHSHI Mohammad Tawfiq
BAREKZAI Ahmad Hamed
BAYAT Habiba
Follows my code.
package pack;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import org.jsoup.Jsoup;
public class Main {
@SuppressWarnings("null")
public static void main(String[] args) throws IOException{
// Nesse bloco eu estou pegando os paises e suas siglas e inserindo no bd para usar na url de raspagem de atletas.
org.jsoup.nodes.Document doc = Jsoup.connect("https://www.ijf.org/judoka?name=&nation=all&gender=both&category=all").get();
String paises = doc.select("option").text().replace(")", ")\n").replaceAll("All","").toString();
int pos = paises.indexOf(")")+1;
int quebra = paises.indexOf("\n")+1;
int i=0;
ArrayList<Nacoes>bd = new ArrayList<>();
String pais = "a", sigla ;
while (pais.length()>0) {
Nacoes n = new Nacoes();
pais = paises.substring(0,pos);
sigla = pais.substring(pais.length()-4, pais.length()-1);
pais = pais.substring(0,pais.length()-5);
paises = paises.substring(quebra,paises.length());
quebra = paises.indexOf("\n")+1;
pos = paises.indexOf(")")+1;
n.setPais(pais);
n.setSigla(sigla);
bd.add(n);
i++;
pais = paises.substring(0,pos);
}
File arquivo = new File("C:\\ifjAtletas.txt");
FileWriter grava = new FileWriter(arquivo);
PrintWriter escreve = new PrintWriter(grava);
org.jsoup.nodes.Document doc2 = null;
String inHtml = ("https://www.ijf.org/judoka?name=&nation=");
String fimHtml = ("&gender=both&category=all");
i=0;
while(i<5) {
doc2 = Jsoup.connect(inHtml+bd.get(i).sigla+fimHtml).get();
String atletas = doc2.select("a").text().toString();
atletas = atletas.substring(1247, (atletas.length())-54).replace(" "+bd.get(i).sigla+" ","\n");
escreve.println(i+"_ "+bd.get(i).pais +"\n "+atletas+"\n");
System.out.println((i+"_ "+bd.get(i).pais +"\n "+atletas+"\n"));
i++;
}
escreve.close();
grava.close();
}
}
The Nation class has only two strings and its getters/setters.
Follow a piece of website code if anyone wants to suggest an easier way.
{` <div class="results container-narrow">
<a href="/judoka/33416" class="judoka">
<div class="judoka__profile_image">
<img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/33416.jpg" alt="">
</div>
<div class="judoka__info">
<div class="family_name">ADRIANO</div>
<div class="given_name">Gabriel</div>
<div class="country">
<img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png" alt="">
BRA
</div>
</div>
</a>
<a href="/judoka/1039" class="judoka">
<div class="judoka__profile_image">
<img class="" src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/profiles/200/1039.jpg" alt="">
</div>
<div class="judoka__info">
<div class="family_name">AGUIAR</div>
<div class="given_name">Mayra</div>
<div class="country">
<img src="https://78884ca60822a34fb0e6-082b8fd5551e97bc65e327988b444396.ssl.cf3.rackcdn.com/flags/20x15/bra.png" alt="">
BRA
</div>
</div>
</a>`
}
Thank you guys, hugs.
The option of
getProperty
is better because it is agnostic, ie independent of the operating system.– StatelessDev
I don’t know the getProperty, where would I fit it? Adding the r didn’t work.
– Joe Reis
you can concatenate equal r n...example writes.println(i+"_ "+bd.get(i). parents +System.getProperty("line.separator")+athletes+System.getProperty("line.separator"));
– Lucas Miranda
Man, thanks for the help. It didn’t work, but I think I might have to do a more specific scraping. Hugs.
– Joe Reis
how does the string athletes get to you? gets like player1 player2 player3 ?
– Lucas Miranda
The string in the print is below each other, in txt, everything is on the same line, without the n. It jumps in line, when it changes the country. The way I’m scraping, he doesn’t take athlete for athlete, he takes parents for parents.
– Joe Reis
ah discovered, actually in addition to r n have to replace the println by just print tb, test for kindness I tested here with some things and it worked
– Lucas Miranda
Haaaaa muleque, It worked dude. Po, thank you very much bro. I was getting discouraged . Excellent.
– Joe Reis
beauty, joe, thanks, I changed with the solution defined, mark as right when you can to help the community
– Lucas Miranda