Urllib2, exception handling

Asked

Viewed 437 times

5

I’m a beginner in the art of programming. I’m learning to code in Python through a book,

Learning to Program: The art of teaching the computer (Cesar Brod - Novatec Editora)

In one of the exercises, I should use the Urllib2 function library to search for a particular web page and check if there is a certain word or expression within that page (the idea is to use this process in a verb conjugator, checking in an online dictionary if the verb typed by the user is regular).

Basically, this is what should happen:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> verbo = 'amar'
>>> pagina = urllib2.urlopen('http://pt.wiktionary.org/wiki/' + verbo)
>>> pagina = pagina.read()
>>> "Verbo regular" in pagina
True
>>>

So far, so good. However, if there is no page corresponding to the word typed by the user, the following error appears:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> verbo = '123ar'
>>> pagina = urllib2.urlopen('http://pt.wiktionary.org/wiki/' + verbo)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 437, in open
    response = meth(req, response)
  File "C:\Python27\lib\urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python27\lib\urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
>>> pagina = pagina.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'pagina' is not defined
>>>

Well, like the page in question (http://pt.wiktionary.org/wiki/123ar) has a source code, I imagined that the program would read it and do the verification the same way, however this does not happen. Could someone suggest me a solution?

P.S.: In case I haven’t been very clear, or left out any important information, please let me know. Another thing, I usually use a Linux for programming, but I’m using a Windows at the moment, however, the error occurs in both systems.

P.S.2: Forgive me for any conceptual error, as I said earlier, I am still an incisor in the art of programming. Speaking of which, I’m open to tips too =)

  • @Thanks for the editing! =)

  • Everything to improve cominity & #Xa;:D

1 answer

4


You should treat this error in a block try...except, when the function urlopen cannot open a page, an exception HTTPError is launched (is a subclass of URLError), so to treat it do the following:

import urllib2

try:
    verbo = '123ar'
    pagina = urllib2.urlopen('http://pt.wiktionary.org/wiki/{0}'.format(verbo)).read()
    print ("Verbo regular" in pagina)
except urllib2.HTTPError as e:
    print ("Nao foi possivel abrir a pagina. Erro {0}".format(e.code))

Well, like the page in question (http://pt.wiktionary.org/wiki/123ar) has a source code, I imagined that the program would read it and make the verification in the same way, however this does not happen. Someone could suggest me a solution?

This is because the urllib2 works differently than urllib, to documentation quote the following:

For error codes other than 200, the job goes to the manipulator method protocol_error_code, via OpenerDirector.error(). Eventually, urllib2.HTTPDefaultErrorHandler will generate a HTTPError if no other handler deals with the error.

To get around this, there are two ways, the first is to get the source code in the block except:

try:
    verbo = '123ar'
    pagina = urllib2.urlopen('http://pt.wiktionary.org/wiki/{0}'.format(verbo)).read()
except urllib2.HTTPError as e:
    pagina = e.fp.read()

And the second is to use the urllib:

import urllib

verbo = '123ar'
pagina = urllib.urlopen('http://pt.wiktionary.org/wiki/{0}'.format(verbo)).read()
print ("Verbo regular" in pagina)
  • Solved my problem. Thank you very much, @qmechanik

  • @Pedroa. Please. I updated the answer, I had not seen the second part of your question.

  • 1

    Ah, that’s fine. The other answer has already helped me a lot, but the new one has made my job even easier, as well as teaching me even more things. Again, thank you.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.