Return list of files contained in a web page

Asked

Viewed 42 times

0

I need to pass the link of a page to a program in js Node, and it returns me all the names of the files . bz2 that it contains in it. The page would be that of the image.

Imagem com os arquivos que preciso

It would be something like:

    request('https://dumps.wikimedia.org/ptwiki/20190801/).then((data)=>{
        console.log(data)
});

and the answer I need would be

[
ptwiki-20190801-pages-articles-multistream1.xml-p220p95098.bz2,

ptwiki-20190801-páginas-artigos-multistream-index1.txt-p220p95098.bz2,

ptwiki-20190801-páginas-artigos-multistream2.xml-p95101p442463.bz2 ,

ptwiki-20190801-páginas-artigos-multistream-index2.txt-p95101p442463.bz2,

ptwiki-20190801-pages-articles-multistream3.xml-p442475p1428483.bz2 ,

ptwiki-20190801-páginas-artigos-multistream-index3.txt-p442475p1428483.bz2,

ptwiki-20190801-páginas-artigos-multistream4.xml-p1428492p2522162.bz2,

ptwiki-20190801-páginas-artigos-multistream-index4.txt-p1428492p2522162.bz2,
ptwiki-20190801-páginas-artigos-multistream5.xml-p2522163p4022163.bz2,

ptwiki-20190801-páginas-artigos-multistream-index5.txt-p2522163p4022163.bz2 ,

ptwiki-20190801-pages-articles-multistream5.xml-p4022163p4362684.bz2,

ptwiki-20190801-páginas-artigos-multistream-index5.txt-p4022163p4362684.bz2,

ptwiki-20190801-pages-articles-multistream6.xml-p4362698p5862698.bz2,

ptwiki-20190801-páginas-artigos-multistream-index6.txt-p4362698p5862698.bz2,

ptwiki-20190801-pages-articles-multistream6.xml-p5862698p6052937.bz2,

ptwiki-20190801-páginas-artigos-multistream-index6.txt-p5862698p6052937.bz2    ] 

Just as array or could be like json ,this using Node js,searched on the internet but found nothing about it.

1 answer

1


Hello, as you listed only the bz2, I put a filter to show only it. However, I saw that you did not list all of them, but I did not make this filter, since you did not specify if you only needed a certain date or all of them. (but it’s not so hard to do either)

( async () => {
    const cheerio = require('cheerio')
    const request = require('request')

  request('https://dumps.wikimedia.org/ptwiki/20190801/', (error, response, body) => {
      const $ = cheerio.load(body)

      $('a').each(function() {
        if ($(this).text().includes('.bz2')) {
            console.log($(this).text())
        }
      })
  })

})()

Just to explain:

( () => {
// Isso é uma função auto executável, por isso ao rodar, já irá funcionar.
// Adeque o código que está aqui dentro da forma como você precisar.
})()

request is necessary to get the html of the page

cheerio is necessary to manipulate html in an easier way, just as with JQuery.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.