1

I have problems with my code.

#!/usr/bin/env python3.1 import urllib.request; # Disguise as a Mozila browser on a Windows OS userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'; URL = "www.example.com/img"; req = urllib.request.Request(URL, headers={'User-Agent' : userAgent}); # Counter for the filename. i = 0; while True: fname = str(i).zfill(3) + '.png'; req.full_url = URL + fname; f = open(fname, 'wb'); try: response = urllib.request.urlopen(req); except: break; else: f.write(response.read()); i+=1; response.close(); finally: f.close(); 

The problem seems to come when I create the urllib.request.Request object (called req). I create it with a non-existing url but later I change the url to what it should be. I'm doing this so that I can use the same urllib.request.Request object and not have to create new ones on each iteration. There is probably a mechanism for doing exactly that in python but I'm not sure what it is.

EDIT Error message is:

>>> response = urllib.request.urlopen(req); Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.1/urllib/request.py", line 121, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python3.1/urllib/request.py", line 356, in open response = meth(req, response) File "/usr/lib/python3.1/urllib/request.py", line 468, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python3.1/urllib/request.py", line 394, in error return self._call_chain(*args) File "/usr/lib/python3.1/urllib/request.py", line 328, in _call_chain result = func(*args) File "/usr/lib/python3.1/urllib/request.py", line 476, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden 

EDIT 2: My solution is the following. Probably should have done this at the start as I knew it would work:

import urllib.request; # Disguise as a Mozila browser on a Windows OS userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'; # Counter for the filename. i = 0; while True: fname = str(i).zfill(3) + '.png'; URL = "www.example.com/img" + fname; f = open(fname, 'wb'); try: req = urllib.request.Request(URL, headers={'User-Agent' : userAgent}); response = urllib.request.urlopen(req); except: break; else: f.write(response.read()); i+=1; response.close(); finally: f.close(); 
4
  • 1
    And what is the error message ? Also, python don't need the semi-colon to end a line. Commented Mar 28, 2012 at 2:37
  • I've added the error message. I know that I don't need semicolons but I prefer to add them. The url and file exist. The only problem is that I'm creating the req object with an invalid url and then before I use req I correct the url. That seems to be causing the error. Commented Mar 28, 2012 at 2:41
  • It is. The url is valid. It's how it's set that's causing the problem. I can also access the url, wget it and download it with Python if I don't have a loop and so I set the url in req object correctly when I create it. Commented Mar 28, 2012 at 2:44
  • why would anyone prefer to add spurious semicolons everywhere? Commented Mar 28, 2012 at 2:49

3 Answers 3

5

urllib2 is fine for small scripts that only need to do one or two network interactions, but if you are doing a lot more work, you will likely find that either urllib3, or requests (which not coincidentally is built on the former), may suit your needs better. Your particular example might look like:

from itertools import count import requests HEADERS = {'user-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} URL = "http://www.example.com/img%03d.png" # with a session, we get keep alive session = requests.session() for n in count(): full_url = URL % n ignored, filename = URL.rsplit('/', 1) with file(filename, 'wb') as outfile: response = session.get(full_url, headers=HEADERS) if not response.ok: break outfile.write(response.content) 

Edit: If you can use regular HTTP authentication (for which the 403 Forbidden response strongly suggests), then you can add that to a requests.get with the auth parameter, as in:

response = session.get(full_url, headers=HEADERS, auth=('username','password)) 
Sign up to request clarification or add additional context in comments.

2 Comments

I like this answer, instead of just fixing a bug for OP you actually demonstrate a much better way of doing it, thus solving his and maybe other people problems.
It know it is a long time since the original post, but the filename should read ignored, filename = full_url.rsplit('/', 1) instead of ignored, filename = URL.rsplit('/', 1). Otherwise the filename will be img%03d.png.
0

If you want to use the custom user agent with every request, you can subclass FancyURLopener.

Here's an example: http://wolfprojects.altervista.org/changeua.php

Comments

-2

Don't break when you receive an exception. Change

except: break 

to

except: #Probably should log some debug information here. pass 

This will skip all problematic request, so that one doesn't bring down the whole process.

4 Comments

That will change the logic considerably. He most likely does not wish to loop forever.
I'm using the exception as a way to terminate the loop. A pass will result in an infinite loop. I don't know how many files there are so I'm downloading till I run into an exception.
Won't prevent the server from throttling though.
I don't think that would be a problem. I'm trying to resolve this problem first.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.