0
I’m trying to download all the PDF’s from the Official Gazette of the DF for academic research. The site is in the public domain, obviously, so it’s not something that is incorrect to do.
The problem is that if I have to download click the click I’m chipped (are more than 700 dodf’s per year) then I’m using wget
.
After a lot of research I ended up arriving in this command line, using post-data, because the site is dynamic (uses form & Asp):
wget -r --no-parent --post-data "ano=2005&mes=11_Novembro" http://www.buriti.df.gov.br/ftp/default.asp
Of course this way I’ll have to run the wget once for each month and every year. But it’s hard work but much less than if I have to drop one by one.
The code stayed that way because the site is in ASP and uses a form
to spend the month and the year and return the related diaries. The above execution goes to the right page (the structure that the wget
received is the one of the correct month, and the wget
finds PDF files in subdirectories).
The problem is that when trying to download the PDF’s returns the error:
405 - Method not Allowed.
If anyone can help me, I’d be grateful.
Please translate your question, this is the stackoverflow in English.
– user28595
Sorry I didn’t know. I registered on the site in English and automatically it seems that I was transferred to the in Portuguese. I will translate at this time. I apologize again.
– shk19
It would be the case to add in the question the operating system you are using, or the possible programming languages that could be used to assist in the task. A shell or console script could help. Although looking better at the address, it seems to me that a small Crawler would be the way as it has the POST to filter the main list of links, then get the "sublinks" to the files themselves.
– Bacco
Thanks for the quick response! I am using windows and an external version of WGET (the commands are being passed via CMD. I know it’s very amateur, but I’m a little rusty in my programming. If there is no alternative command line for wget, I will search how to build a Crawler. I found the suggestion excellent. Thank you very much.
– shk19
Aeee got a way! Anyway coming to thank for the promptness and suggestions!! It turned out that when you talk about Crawler, I realized that WGET, although it did not return the PDF, recorded 1 file with the links. I used a console command to concatenate all files into 1, and converted to TXT. So just play excel, treat and extract direct links to pdf’s (which do not give HTTP 405 error). I played the links in an HTML and went back to Wget! That way it worked!!! Strong hug everyone and thank you so much!
– shk19