Find and Extract a line or words from a source code with Python 3

Asked

Viewed 525 times

1

I need to write a script that can extract a certain line in a source code with Python, I was able to read the source code, but I can’t filter, I’ve read the documentation and I got a little lost.

So far I’ve come up with that result:

import urllib.request
url = "https://www.youtube.com/watch?v=2MpUj-Aua48"
f = urllib.request.urlopen(url)
print(f.read().decode('utf-8'))
keyword = f.search(r'<meta name="keywords"(.*)">')

I wanted to extract the information inside the line:

<meta name="keywords" content="4k video test, 4k video demo, ultra tv video, video 4k for shop mode, ultra video tv demo play, 2160p video test, hd sourround video test, samsung tv demo, s...">

And capture only the source code keywords.

2 answers

2

You can use the library Beautifulsoup to do the Parsing of HTML.

Just install with:

pip install beautifulsoup4

There in your code you get the HTML as you already did:

import urllib.request

url = "https://www.youtube.com/watch?v=2MpUj-Aua48"
f = urllib.request.urlopen(url)
html = f.read().decode('utf-8')

Now Beautifulsoup does the most complex job, which is to read HTML and fetch the tags you need:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
meta_tag = soup.head.find('meta', attrs={'name': 'keywords'})
keywords = [keyword.strip() for keyword in meta_tag['content'].split(',')]

Explaining:

  • Create Beautifulsoup object

    soup = BeautifulSoup(html, 'html.parser')
    
  • Search for the first tag <meta> within the <head> that has the attribute name and this contains the value keywords

    meta_tag = soup.head.find('meta', attrs={'name': 'keywords'})
    

    The method Soup.() returns the first tag found or None if no tag matches past filters. In the example above I am asking Beautifulsoup to return an element whose tag be it <meta> and containing the attribute name with the value keywords. If this element does not exist in the past HTML, meta_tag had received None as a value.

  • Break the string into a list with Keywords (I use the method str split.() and str.strip() to break the string and remove the excess spaces)

    keywords = [keyword.strip() for keyword in meta_tag['content'].split(',')]
    

Upshot:

[
    "4k video test",
    "4k video demo",
    "ultra tv video",
    "video 4k for shop mode",
    "ultra video tv demo play",
    "2160p video test",
    "hd sourround video test",
    "samsung tv demo",
    "s...",
]

Full example:

from bs4 import BeautifulSoup
import urllib.request

url = "https://www.youtube.com/watch?v=2MpUj-Aua48"
f = urllib.request.urlopen(url)
html = f.read().decode('utf-8')

soup = BeautifulSoup(html, 'html.parser')
meta_tag = soup.head.find('meta', attrs={'name': 'keywords'})
keywords = [keyword.strip() for keyword in meta_tag['content'].split(',')]

print('=== Keywords ===')
for k in keywords:
    print(f' - {k}')

Code working on Repl.it

Upshot:

=== Keywords ===
 - 4k video test
 - 4k video demo
 - ultra tv video
 - video 4k for shop mode
 - ultra video tv demo play
 - 2160p video test
 - hd sourround video test
 - samsung tv demo
 - s...

0

A way to get what you want:

import urllib.request
url = "https://www.youtube.com/watch?v=2MpUj-Aua48"
f = urllib.request.urlopen(url)

alvo = None
lines = f.readlines()
for line in lines:
   if '<meta name="keywords"' in str(line):
       alvo = line
       break

print('Alvo: ',alvo, sep='\n')

Exit:

b'      <meta name="keywords" content="4k video test, 4k video demo, ultra tv video, video4k for shop mode, ultra video tv demo play, 2160p video test, hd sourround video test, samsung tv demo, s...">\n'

Convert alvo for str and extract what you need in the way that you find most convenient.

Using the example of your attempt:

import re
result = re.search(r'<meta name="keywords"(.*)">', str(l))
print(result.group(0))

Exit:

<meta name="keywords" content="4k video test, 4k video demo, ultra tv video, video 4k for shop mode, ultra video tv demo play, 2160p video test, hd sourround video test, samsung tv demo, s...">

See working on repl.it.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.