6
Introducing
Since 1999, Brazilian researchers have had a website where they can post information about their academic career. This information is known as Currículos Lattes. I wish to download a few thousand of these resumes and write, together with some collaborators, an article in this regard.
This link goes to the resume of researcher Suzana Carvalho Herculano Houzel. Note that by clicking on the link, the browser was directed to a page with a captcha. This is my first problem. How do I get through this? I tried two different approaches: one using python, one using R.
python
Apparently there’s a well-known python module called scriptLattes. In theory, he would be able to download a series of latte resumes, provided he was provided with a list of resume Ids (for example, the resume ID I put up is 4706332670277273).
However, the module has not been updated since 2015. Since then, lattes have implemented captcha on their pages. I think this is a problem for the module, because I tried to run one of the examples on my Ubuntu and received the following result:
$ ./scriptLattes.py ./exemplo/teste-01.config
[ROTULO] [Sem rotulo]
[LENDO REGISTRO LATTES: 1o. DA LISTA]
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>
<urlopen error [Errno 110] Connection timed out>
<urlopen error [Errno -2] Name or service not known>
This command only stopped after I manually canceled it with Ctrl+C. I imagine the problem is, precisely, the captcha implemented after the last version of this module has been published.
I have some experience with web scraping in python. I know the modules scrapy
and beautifulsoup
, but I’m not an expert on them.
R
R has a package called Getlattesdata. However, the following news is posted in your repository:
**ATTENTION: The package is not Working as of 2017-11-26. The Lattes website, Where the xml files Were available, is offline. **
In fact, this server with the xml files has been off the air since November last year and never came back. I tested the package today and it still doesn’t work.
I found other R packets that work with the lattes, like the Cocholattes, for example. The problem is that I need to download the data manually by logging in one by one.
I have experience with web scraping on R, working with package rvest
.
Digger
The site Digger makes, himself, a Crap of the resumes lattes. I contacted the team of the site and the data are not available. However, they sell access to their API through a credit system. I’m against paying to get free information, but if nothing else works, I might have to do it myself.
Completion
See that my problem is not even with the organization and scraping of the data itself, but it is prior to this: how to access the pages with the resumes of researchers? I have experience with data scraping, but have never faced a problem like this, with captcha.
Also, I don’t have a list of all the Ids of all the resumes I want. Each resume has two unique Ids. In the case of this curriculum, Ids are 4706332670277273 and K4727050Y3, each accessed by a different url:
Although the Ids are different, the pages above have the same content.
What to do in this case? I think getting the list of resumes I want is not difficult. This link has the address of more than 5 million resumes lattes. I could make a Crawl and a Scrap on it to get the Ids I need.
Therefore, my problem is to download data from resumes (i.e., pages like http://lattes.cnpq.br/4706332670277273) automatically, without having to enter the captcha. How could I do this? R or python, whatever for me.
Have you read about the data extractor?
– Woss
Hello Marcus. My understanding is that Captcha exists rightly to prevent what you want to do (i.e. a machine make automated accesses). I suggest reading this page on Cnpq and contact them requesting access. It can avoid problems (ethical or even legal) for you, and even perhaps (or probably) allow you to do what you need more easily. :)
– Luiz Vieira
I didn’t know this Cpnq page about data extraction. As I am an employee of a higher education institution, I will take a look at the procedures to request my access to this data. Thank you both for the tip!
– Marcus Nunes
@Luizvieira I believe your comment should be an answer.
– hdiogenes
@hdiogenes Thanks for the remark, but I think the comment does not fit as an answer because it does not really respond to what the AP wants (however mistaken I believe it is to do what it wants). :)
– Luiz Vieira
I I broke the latte captcha a while ago. I believe it is not difficult to integrate any of the crawlers you already have.
– Begnini
I managed to solve my problem with the help of the Superintendent of Informatics of my university. They have an agreement with Cnpq, which allows them to have access to all the data that interests me. It is not the solution I wanted, because I will depend on an intermediary always want to update the data, but it is the solution that was possible to obtain.
– Marcus Nunes
Lattes platforms have not been using captcha for some time. I do not know if this information is relevant at this time, but just to let you know if someone is reading this post in the present day.
– Cadu
At least in January 2019, this is true for research done in order to view researchers' resumes, but access to the xml of these resumes still needs to be done by filling in a captcha.
– Marcus Nunes