I have some PHP crawlers that access pages that require credentials. It depends on each case, since each one has a form of authentication. In my case, I know the required forms. For example, access a website where their login page contains the following form:
<form class="onclick-submit card grid-3" accept-charset="utf-8" method="post" action="https://painel2.oculto.net/conectorPainel.php" id="frmLogin" >
<input class="hidden" type="text" name="email" id="txtUserName" value="[email protected]" />
<input class="hidden" type="password" name="senha" id="txtPassword" value="senha" />
<input class="hidden" type="checkbox" name="permanecerlogado" tabindex="6" id="chkRemember" checked="checked" />
<input class="hidden" type="hidden" value="login" name="acao" />
...
</form>
In this case, my PHP Crawler authenticates the site before processing the content:
$curl = new cURL();
$curl->post('https://painel2.oculto.net/conectorPainel.php', '[email protected]&senha=senha&permanecerlogado=1&acao=login');
The site will create a session for my subsequent visits and the program will have privileged access. I don’t even check the response of the site, since the chances of login failure are minimal and if it is denied by some other failure (such as connection failure, low server, etc.) the program will interrupt the execution and try later.
Most websites therefore require only 3 basic information:
But it certainly doesn’t work for everyone, as some sites create tokens for each session (e.g., icloud.com) or some algorithm that makes automation difficult. In these cases, requires manual programming.
This question is very broad, I do not think it can be answered in the way it is. Any language can also be used to implement a Crawler, and there are several libraries ready to assist in this process. As for accessing pages that need to be logged in, I think it is possible (assuming you have the login credentials), but it will depend a lot on how the site authenticates and mentions the sessions. Please edit your question by putting more context, so maybe we can help you better.
– mgibsonbr
Hello @mgibsonbr , I made an edition, I hope I was clear. Thanks.
– MDomingues
You’re better now, I’d answer, but I don’t think you know enough about it. But I don’t quite understand: when you say you refer to "authentication via https" you mean with certificates on the client side? And that captcha has to be used at all times, or is there an alternative form of authentication?
– mgibsonbr
I have 2 options, one would be the login so I can access the page with the information and the other would be the captcha. Example, I enter the website of the IRS with a CPF, if I have the credentials of the site I do not need to type the captcha to return the information, but if I do not have the credentials is necessary. In my case I have the credentials.
– MDomingues
In this case, go back to what I said in my first comment: it is necessary to know how the site authenticates (probably through a POST request) and how it maintains the session (probably through a cookie). I don’t know many crawling libraries, but you probably have one that supports that sort of thing.
– mgibsonbr
This program has some language requirement?
– Miguel Angelo
@Miguelangelo no, the language is not a problem.
– MDomingues
The site keeps a credential in my browser and the section expires in 30 minutes. I believe I should keep cookies.
– MDomingues
What language will be used?
– user3813
@GT8 None yet, I’m researching which best applies to my needs, I’m thinking of perl, py or even pascal (which I found great tools).
– MDomingues
Why so much downvote? Someone could leave a comment, if you think the question needs to be improved, please?
– mgibsonbr
@mgibsonbr despite not having given the -1, I think you yourself said what was wrong in your 1st comment. (almost) any language could be used, each site may have a different method of providing login data, and "being simple" does not give details of how it is actually done.
– woliveirajr