How to get values from a column of multiple tables displayed on a web page?

Asked

Viewed 806 times

6

On a web page, there are one or more tables with information that I need to get in the form of a list.

Specifically, I need to get a list of the values of the 2nd column of a table of the web page I enter. As for example, the pages below.

Parte da publicação do Diário Oficial de MT de 17/12/2013
Link: https://www.iomat.mt.gov.br/do/navegadorhtml/mostrar.htm?id=630236&edi_id=3580

Parte da publicação do Diário Oficial de MT de 17/12/2013
Link: https://www.iomat.mt.gov.br/do/navegadorhtml/mostrar.htm?id=630237&edi_id=3580

Currently, I copy the table to a spreadsheet and then filter the list. But if you had a script (Python, Ruby or Perl) or program (Java or C#) that just informed the link and it already returned the list would be a hand in the wheel.

The page with this type of matter always has one of the above patterns.

  • 1

    What language after all? For every language you quoted there will be different tools of web crawling.

  • Did you try to make this script? What problem happened while you were trying? What code gave you trouble? What is the question exactly?

  • As I specified in the question, any script or program. Who already has a similar solution based on this question.

  • 3

    Open the Chrome console (Ferramentas > Console Javascript) and write $x("//tr[position() > 1]/td[2]/p/span/text()").

  • Exactly @rodrigorgs. I could post as reply. Thank you for your time.

  • I can’t post as an answer while the question is in abeyance.

  • How could I rephrase the question to get it approved? I don’t see how I can be any clearer about what I need. I posted in the question the solution that I use (even if rough, I know), but wanted something more robust.

  • 1

    @ricidleiv, I thought your question was well formulated. You made it clear that any solution (with any language) is useful to you.

  • 3

    I believe that to improve the question you should leave it more "generic", so it can be useful for a larger number of developers. For example in the title: "How to get values from a column of multiple tables displayed on a web page?" So within the question, you might even be a little more specific. This way, you open the range of your question and continue with your problem solved.

  • 2

    The @Brunogasparotto suggestion is great! So everyone wins

  • 1

    Thank you @Brunogasparotto, I accepted your suggestion of the title and tried to reformulate the introduction of the question.

Show 6 more comments

1 answer

9


Open the Chrome console (Ferramentas > Console Javascript) and write the following:

$x("//tr[position() > 1]/td[2]/p/span/text()")

This will call the Javascript function $x (set to the Chrome console) and return the result of the expression Xpath given as parameter.

Explanation of the expression Xpath

  • //tr[position() > 1]: selects all the elements tr of the page except the first
  • td[2]: selects only the second element td (i.e., the second column); you can change the column number or use td[position()=1 or position()=2] to select the first two columns, for example.
  • p/span: selects the element span within the element p (this goes for the page of your example; for other pages, you should check the elements within the td)
  • text() selects the content of the tag.

Use in programming languages

The most practical solution in any programming language involves Xpath with some XML library. Example in Ruby:

require 'nokogiri'
require 'open-uri'
require 'openssl'
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
doc = Nokogiri::HTML(open(ARGV[0]).read)
doc.xpath("//tr[position() > 1]/td[2]/p/span/text()").each { |x| puts x}

Example of use

ruby script.rb 'https://www.iomat.mt.gov.br/do/navegadorhtml/mostrar.htm?id=630237&edi_id=3580'
  • 2

    I thought the solution was great Rodrigo, but if you could explain in detail what the command does (for example, which of the parameters represents the column to be searched) would be great, would make it more "adaptable", because it is a good solution for many people, but not everyone knows the proposed syntax.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.