How to get the Gmail source code using Python3

Asked

Viewed 101 times

0

I am accessing the Email using this code I found and adapted:

import requests
from bs4 import BeautifulSoup

form_data = {'Email': '[email protected]', 'Passwd': 'senhaexemplo'}
post = "https://accounts.google.com/signin/challenge/sl/password"

def login(self):
    with requests.Session() as s:
        soup = BeautifulSoup(s.get("https://mail.google.com").text, "html.parser")
        for inp in soup.select("#gaia_loginform input[name]"):
            if inp["name"] not in form_data:
                form_data[inp["name"]] = inp["value"]
        s.post(post, form_data)
        html = s.get("https://mail.google.com/mail/u/0/#inbox").text
        print(html)

My goal is to take the Emails and print on screen, with subject and content, and I know how to do this using certain html tags... But for that I need the source code of the site, and when I will look at the result of print(html) doesn’t come with any tag, everything gets compressed... something like that:

{\"1\":\"be_35\",\"53908043\":0},{\"1\":\"be_36\",\"53908043\":0},{\"1\":\"be_30\",\"53908043\":0},{\"1\":\"be_31\",\"53908043\":0},{\"1\":\"be_169\",\"53908043\":0},{\"1\":\"su_ltz\"},{\"1\":\"ic_sspvcd\"},{\"1\":\"bu_wdtfsm\"},{\"1\":\"be_26\",\"53908043\":0},{\"1\":\"be_29\",\"53908043\":0},{\"1\":\"be_280\",\"53908043\":0},{\"1\":\"be_281\",\"53908043\":0},{\"1\":\"30\",\"53908046\":0},{\"1\":\"31\",\"53908043\":0},{\"1\":\"32\",\"53908046\":0},{\"1\":\"33\",\"53908046\":0},{\"1\":\"be_277\",\"53908043\":0},{\"1\":\"34\",\"53908045\":\"\"},{\"1\":\"be_278\",\"53908043\":0},{\"1\":\"35\",\"53908046\":0},{\"1\":\"be_275\",\"53908043\":0},{\"1\":\"be_276\",\"53908043\":0},{\"1\":\"be_273\",\"53908043\":1},{\"1\":\"38\",\"83947487\":{}},{\"1\":\"se_192\",\"53908045\":\"en,es,pt,ja,fr\"},{\"1\":\"be_274\",\"53908043\":0},{\"1\":\"39\",\"53908046\":0}

How can I get the right source code?

  • It would not be easier to use IMAP or some API instead of mounting a Crawler?

  • I’m using this method for study purposes, looking for ways to solve a problem without the easiest method. I think the problem in my case is the json encryption, I might be wrong.

1 answer

1


Not to rain on your parade, but... Sites that use AJAX do not return content in HTML, they generate content dynamically, after loading, using Javascript. You would have to use a radically different solution, like Phantomjs, which effectively loads all the auxiliary files on the page and executes the Javascript code, to then analyze the DOM and extract the content.

  • Thank you! Clarified enough, I have little experience when it comes to https and web.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.