0
First I imported the packages and created a class and its settings:
class Scraper:
def __init__(self):
self.visited = set()
self.session = requests.Session()
self.session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36"}
requests.packages.urllib3.disable_warnings()
def:
def visit_url(self, url, level):
print(url)
if url in self.visited:
return
self.visited.add(url)
content = self.session.get(url, verify=False).content
soup = BeautifulSoup(content, "lxml")
for img in soup.select("img[src]"):
image_url = img["src"]
if not image_url.startswith(("data:image", "javascript")):
self.download_image(urljoin(url, image_url))
if level > 0:
for link in soup.select("[/html/body/div/div/div[2]/div/div[1]/div[1]/div/div[1]/div[3]/div[1]/a[1]/img]"):
self.visit_url(urljoin(url, link["/html/body/div/div/div[2]/div/div[1]/div[1]/div/div[1]/div[3]/div[1]/a[1]/img"]), level - 1)
The download:
def download_image(self, image_url):
local_filename = image_url.split('/')[-1].split("?")[0]
r = self.session.get(image_url, stream=True, verify=False)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
The link:
if __name__ == '__main__':
scraper = Scraper()
scraper.visit_url('https://mbasic.facebook.com/story.php?story_fbid=2498834290232454&id=198123623636877&refid=17&_ft_=mf_story_key.2498834290232454%3Atop_level_post_id.2498834290232454%3Atl_objid.2498834290232454%3Acontent_owner_id_new.198123623636877%3Athrowback_story_fbid.2498834290232454%3Apage_id.198123623636877%3Aphoto_attachments_list.%5B2498828320233051%2C2498828993566317%2C2498829400232943%5D%3Astory_location.4%3Astory_attachment_style.album%3Apage_insights.%7B%22198123623636877%22%3A%7B%22page_id%22%3A198123623636877%2C%22actor_id%22%3A198123623636877%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1574031070%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B2498834290232454%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A198123623636877%2C%22page_id%22%3A198123623636877%2C%22post_id%22%3A2498834290232454%2C%22role%22%3A1%2C%...', -1)
But would you like to pass a.txt list with multiple links instead of the link, using a loop? Images that are downloaded from the link in the code:
Apparently you’ve made it harder, if you want to use a list of urls for a file, it wouldn’t be enough for you to make one
for linha in arquivo
and uselinha
as an argument fordownload_image
? What is your difficulty?– fernandosavio
Because I made another piece of code, but when I put the list.txt in place of, scraper.visit_url(site here), download the images, but corrupted, because it downloads directly from the link, and each link needs to be accessed to find the image path and then download the image.
– Hudson Souza
@Hudsonsouza do not put the answer in the question. You can answer your own question as another user.
– Augusto Vasques