How to make a Crawler web access pages that need authentication?

Asked

Viewed 2,139 times

0

I need to develop a web-crowler where he would access a page (in which it is necessary to login and I have such credentials) and the "robot" would find all the links of the page and list somewhere, can be a memo or even a txt file. It would be a process similar to the firefox Downthemmall plugin. Site authentication is simple, done via https. But I also have the option to type captcha to access the page with the files.

  • 1

    This question is very broad, I do not think it can be answered in the way it is. Any language can also be used to implement a Crawler, and there are several libraries ready to assist in this process. As for accessing pages that need to be logged in, I think it is possible (assuming you have the login credentials), but it will depend a lot on how the site authenticates and mentions the sessions. Please edit your question by putting more context, so maybe we can help you better.

  • 1

    Hello @mgibsonbr , I made an edition, I hope I was clear. Thanks.

  • You’re better now, I’d answer, but I don’t think you know enough about it. But I don’t quite understand: when you say you refer to "authentication via https" you mean with certificates on the client side? And that captcha has to be used at all times, or is there an alternative form of authentication?

  • I have 2 options, one would be the login so I can access the page with the information and the other would be the captcha. Example, I enter the website of the IRS with a CPF, if I have the credentials of the site I do not need to type the captcha to return the information, but if I do not have the credentials is necessary. In my case I have the credentials.

  • In this case, go back to what I said in my first comment: it is necessary to know how the site authenticates (probably through a POST request) and how it maintains the session (probably through a cookie). I don’t know many crawling libraries, but you probably have one that supports that sort of thing.

  • This program has some language requirement?

  • @Miguelangelo no, the language is not a problem.

  • The site keeps a credential in my browser and the section expires in 30 minutes. I believe I should keep cookies.

  • What language will be used?

  • @GT8 None yet, I’m researching which best applies to my needs, I’m thinking of perl, py or even pascal (which I found great tools).

  • 1

    Why so much downvote? Someone could leave a comment, if you think the question needs to be improved, please?

  • 1

    @mgibsonbr despite not having given the -1, I think you yourself said what was wrong in your 1st comment. (almost) any language could be used, each site may have a different method of providing login data, and "being simple" does not give details of how it is actually done.

Show 7 more comments

1 answer

3


I have some PHP crawlers that access pages that require credentials. It depends on each case, since each one has a form of authentication. In my case, I know the required forms. For example, access a website where their login page contains the following form:

<form class="onclick-submit card grid-3" accept-charset="utf-8" method="post" action="https://painel2.oculto.net/conectorPainel.php" id="frmLogin" >
    <input class="hidden" type="text" name="email" id="txtUserName" value="[email protected]" />
    <input class="hidden" type="password" name="senha" id="txtPassword" value="senha" />
    <input class="hidden" type="checkbox" name="permanecerlogado" tabindex="6" id="chkRemember" checked="checked" />
    <input class="hidden" type="hidden" value="login" name="acao" />
    ...
</form>

In this case, my PHP Crawler authenticates the site before processing the content:

$curl = new cURL();
$curl->post('https://painel2.oculto.net/conectorPainel.php', '[email protected]&senha=senha&permanecerlogado=1&acao=login');

The site will create a session for my subsequent visits and the program will have privileged access. I don’t even check the response of the site, since the chances of login failure are minimal and if it is denied by some other failure (such as connection failure, low server, etc.) the program will interrupt the execution and try later.

Most websites therefore require only 3 basic information:

  • login
  • password
  • URL

But it certainly doesn’t work for everyone, as some sites create tokens for each session (e.g., icloud.com) or some algorithm that makes automation difficult. In these cases, requires manual programming.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.