How to get the html of the page after loading javascript, using Guzzlehttp

Asked

Viewed 87 times

0

Good morning,

I’m creating a Crawler for you to access a specific page and then take some specific data from the page, but I’m having problems.

Right now I’m trying to perform a test on instagram, my code is as follows:

$client = new Client();

$request = $client->request('GET', 'https://www.instagram.com/user/');

return response()->json( $request->getBody() );

However, the moment I print getBody is returning empty {}, I also tried to add a second parameter to get the data, as follows:

return response()->json( $request->getBody()->getContents() );

By using getContents you are returning me little html and the rest of javascript, because of this I believe the error may be in the way I am calling.

1 answer

0


When you use the Guzzle, the value returned by getBody is an object.

That object is the GuzzleHttp\Stream\Stream.

By default the class json_encode usually brings confusing results when you try to serialize an object that does not implement the magic interface JsonSerialize.

But the class GuzzleHttp\Stream\Stream implements the method __toString, which allows a special behavior for the class as the same is treated as string.

What can be done?

Make a cast of the result to see the value returned from your request, thus:

$client = new Client();

// Pequena correção, 'request' é requisição, 'response' é resposta

$response = $client->request('GET', 'https://www.instagram.com/user/');

var_dump((string) $response->getBody());

It is also important to remember that the function response()->json() Laravel’s purpose is to serialize values to the JSON format. Probably your request will be HTML.

Be sure of what you will do with the return of Guzzle.

Depending on the situation, for you, it would be necessary to do just that:

 return response((string) $response->getBody());

By using getContents you are returning me little html and the rest of javascript, because of this I believe the error may be in the way I am calling.

I don’t know what you’re trying to do, but if you want Instagram to give you back one JSON, you might need to access the Instagram API.

Take a look at her here

  • Thanks for the reply Wallace, I will try to explain a little my situation, I would like to follow several specific profiles and when post an image note I manage to bring the same to my system, I used the cast but the result was wrong and then, and I noticed that it is possible to get the html data after loading the js used the "timeout" setting, but since instagram uses React, the returned html does not contain the published images, it would only be this information that would like, by the quick read in the API, I noticed that it is necessary to pass a token to get the data.

  • Yes, you need to pass the token. It would be the easiest way for you to get a "cool" instagram access. The problem with you doing Crawler is that they can change the page at any time and you dance. The API is already versioned. But if you still insist on making a Crawler, maybe you can use Domdocument to access the attribute src of <img>

  • I got it Allace, I’m aware of the problem of changing the page, regarding trying to get the Domdocument, what’s going on is that the same Guzzle setting a 5-second timeout is not getting the images, is returning an unknown numbering in place of the attributes '<img>', would know how to solve this?

  • @Henriquesouzagoncalves this is because Instagram must be returning an HTML that uses Javascript to render the page. There is no easy solution to this. I’ve heard that Ghost JS solves this, but I can’t say for sure, because I’ve never used

  • All right, Wallace, I’ll check your information, in case it doesn’t work I’ll have to opt for the API myself, but I’ll be waiting for other answers, thanks for your help Wallace!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.