Web scraping of a microsoft form Forms returns None [python]

Question

Web scraping of a microsoft form Forms returns None [python]

Asked 5 years, 6 months ago

Viewed 330 times

0

Hello, I’m having difficulties in making a web scraping of a form made by microsoft Forms. (NOTE: The form was made by me).

I have the following code:

from bs4 import BeautifulSoup
import requests

linkForms01 = 'https://forms.office.com/Pages/AnalysisPage.aspx?id=vNBJ8bUOmk-egiSnbqz43tJCnHAzn91Lq2qUycLdTl5UOFFCQ0lXME85UlFKT1dBTFJPSllFUkkzVy4u&AnalyzerToken=qTmVTXSAWoyMXQcd56doC9W6W20G51UR'

page03 = requests.get(linkForms01) 
page03.encoding = page03.apparent_encoding

soup03 = BeautifulSoup(page03.text, 'html.parser')
texto03 = soup03.get_text('\n')
xxxx = soup03.find(class_="analyze-view-detail-text-lines")
print(xxxx)

In general, I can extract a lot of information from this database, but the answers to the questionnaire cannot. I thought about removing the information that is in the Getaggregatesurveydata file, this file can be seen in Inspect - Network - XHR, but I’m not sure if this is possible.

Anyone who can help, I’d be grateful :)

2 answers

0

microsoft Forms has an undocumented REST(API) service. It is from this JSON request that information is digested.

First let’s find out where GET comes from that will get the information.

https://forms.office.com/formapi/api/f149d0bc-0eb5-4f9a-9e82-24a76eacf8de/users/709c42d2-9f33-4bdd-ab6a-94c9c2dd4e5e/light/analysisForms('vNBJ8bUOmk-egiSnbqz43tJCnHAzn91Lq2qUycLdTl5UOFFCQ0lXME85UlFKT1dBTFJPSllFUkkzVy4u')?$expand=questions($expand=choices)

Okay, now we need to see what information is uploaded to get this requisition. If you notice on the link you sent, it has a token for the access form:

AnalyzerToken=qTmVTXSAWoyMXQcd56doC9W6W20G51UR

This token is the permission the site will require to extract server information.

With this now you will be able to make your bot. Reminder that it is good to always have 'user-agent' for you to bypass the page as it is common for the site to have some lock for scrappers and crawlers.

import requests

url = "https://forms.office.com/formapi/api/f149d0bc-0eb5-4f9a-9e82-24a76eacf8de/users/709c42d2-9f33-4bdd-ab6a-94c9c2dd4e5e/light/analysisForms('vNBJ8bUOmk-egiSnbqz43tJCnHAzn91Lq2qUycLdTl5UOFFCQ0lXME85UlFKT1dBTFJPSllFUkkzVy4u')?$expand=questions($expand=choices)"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
           'AnalyzerToken': 'qTmVTXSAWoyMXQcd56doC9W6W20G51UR'}
print(requests.get(url, headers=headers).text)

I recommend studying about HTTP protocols to better understand how to do web crawling/Scrapping, because most sites nowadays use Apis, and in most cases it is more documenting how the site API works, than being brute force that is much less performatic, and much more boring. (In case, make Scrapping from HTML...)

Hello, thank you so much for your help! I found your solution very interesting, what is missing in my Crapping is knowing how to use requests with headers. I will take your advice and study more about HTTP protocols.

– Jonathan Cardoso

2020/01/22 at 12:23

Browser other questions tagged python web-scraping

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2020-01-20T20:17:53+00:00

Unable to retrieve information as it is being done.

In doing requests.get you retrieve the html from the server to that URL - but that’s it - any external resources of the page are not recovered - be it images, or, data that is obtained by the page when it executes some Javascript code in the browser.

It is easy to check that the information is not present on the page - you open the URL above, and use the option to view page source browser - it is easy to see that part of the <body> the page is minimal, despite dozens of Kbytes of javascript on <head>. It could still be that the form data is embedded in javascript, without relying on javascript requests to the server - but to search, for example, by the name "Ademir" that appears filled in the form, does not find any occurrence in the raw text of the page.

From here you have two options: Reverse engineer client-side code and see which requests are made to the server using javascript - and you can replicate these requests using the requests python. This can vary from medium to virtually impossible difficulty (if the page author, or the programs that run on the page are willing to hide this data - that’s not the case, it should be closer to "average difficulty")

The other way is to use instead of requests the selenium - Selenium is a tool that integrates a "real" browser with a Python library - when the page opens via Selenium, the javascript code within it runs - and the data is retrieved on the server and populated in the associated browser (which may or may not be visible on the screen, depending on how you config Selenium) - and then yes, you access the page’s DOM after filling in the server data.