Extract information from lattes

Asked

Viewed 2,565 times

6

Introducing

Since 1999, Brazilian researchers have had a website where they can post information about their academic career. This information is known as Currículos Lattes. I wish to download a few thousand of these resumes and write, together with some collaborators, an article in this regard.

This link goes to the resume of researcher Suzana Carvalho Herculano Houzel. Note that by clicking on the link, the browser was directed to a page with a captcha. This is my first problem. How do I get through this? I tried two different approaches: one using python, one using R.

python

Apparently there’s a well-known python module called scriptLattes. In theory, he would be able to download a series of latte resumes, provided he was provided with a list of resume Ids (for example, the resume ID I put up is 4706332670277273).

However, the module has not been updated since 2015. Since then, lattes have implemented captcha on their pages. I think this is a problem for the module, because I tried to run one of the examples on my Ubuntu and received the following result:

$ ./scriptLattes.py ./exemplo/teste-01.config
[ROTULO]  [Sem rotulo]

[LENDO REGISTRO LATTES: 1o. DA LISTA]
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>

This command only stopped after I manually canceled it with Ctrl+C. I imagine the problem is, precisely, the captcha implemented after the last version of this module has been published.

I have some experience with web scraping in python. I know the modules scrapy and beautifulsoup, but I’m not an expert on them.

R

R has a package called Getlattesdata. However, the following news is posted in your repository:

**ATTENTION: The package is not Working as of 2017-11-26. The Lattes website, Where the xml files Were available, is offline. **

In fact, this server with the xml files has been off the air since November last year and never came back. I tested the package today and it still doesn’t work.

I found other R packets that work with the lattes, like the Cocholattes, for example. The problem is that I need to download the data manually by logging in one by one.

I have experience with web scraping on R, working with package rvest.

Digger

The site Digger makes, himself, a Crap of the resumes lattes. I contacted the team of the site and the data are not available. However, they sell access to their API through a credit system. I’m against paying to get free information, but if nothing else works, I might have to do it myself.

Completion

See that my problem is not even with the organization and scraping of the data itself, but it is prior to this: how to access the pages with the resumes of researchers? I have experience with data scraping, but have never faced a problem like this, with captcha.

Also, I don’t have a list of all the Ids of all the resumes I want. Each resume has two unique Ids. In the case of this curriculum, Ids are 4706332670277273 and K4727050Y3, each accessed by a different url:

Although the Ids are different, the pages above have the same content.

What to do in this case? I think getting the list of resumes I want is not difficult. This link has the address of more than 5 million resumes lattes. I could make a Crawl and a Scrap on it to get the Ids I need.

Therefore, my problem is to download data from resumes (i.e., pages like http://lattes.cnpq.br/4706332670277273) automatically, without having to enter the captcha. How could I do this? R or python, whatever for me.

  • 1

    Have you read about the data extractor?

  • 5

    Hello Marcus. My understanding is that Captcha exists rightly to prevent what you want to do (i.e. a machine make automated accesses). I suggest reading this page on Cnpq and contact them requesting access. It can avoid problems (ethical or even legal) for you, and even perhaps (or probably) allow you to do what you need more easily. :)

  • I didn’t know this Cpnq page about data extraction. As I am an employee of a higher education institution, I will take a look at the procedures to request my access to this data. Thank you both for the tip!

  • @Luizvieira I believe your comment should be an answer.

  • @hdiogenes Thanks for the remark, but I think the comment does not fit as an answer because it does not really respond to what the AP wants (however mistaken I believe it is to do what it wants). :)

  • I I broke the latte captcha a while ago. I believe it is not difficult to integrate any of the crawlers you already have.

  • I managed to solve my problem with the help of the Superintendent of Informatics of my university. They have an agreement with Cnpq, which allows them to have access to all the data that interests me. It is not the solution I wanted, because I will depend on an intermediary always want to update the data, but it is the solution that was possible to obtain.

  • Lattes platforms have not been using captcha for some time. I do not know if this information is relevant at this time, but just to let you know if someone is reading this post in the present day.

  • 1

    At least in January 2019, this is true for research done in order to view researchers' resumes, but access to the xml of these resumes still needs to be done by filling in a captcha.

Show 4 more comments

1 answer

2

I do not know the specific system of captcha Lates, but I will try to give a "broad" solution.

In general the ideal is to scrap only HTML with requests and BeautifulSoup as you mentioned (or, with my new favorite library for this, the requests-html). This method is preferable because it consumes little processing power and little bandwidth, since it consists only of HTML and Parsing requests, without loading images, scripts, etc.

Unfortunately the captcha is made to prevent this type of scraping and is effective. The solution to this requires a little more technology. The selenium is a driver browser; I mean, it provides you with a "zombie" browser and an API to control this browser programmatically (click such button, go to such address, etc).

Even so, it does not directly pass through Captchas. The solution is to know where the captcha is, get a screenshot of the browser in that area, and then either use a vision/OCR algorithm, if the captcha is weak, or use a captcha break service (you send the message to the service API and get back the text contained).

These options are obviously not ideal; running a browser uses far more resources from both your machine and the Lates server by uploading images, CSS and scripts, and captcha breaking services, although cheap, are not free. It is worth examining the site to see if there is any way around the captcha.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.