Python download multiple files in a loop

Question

I have problems with my code.

#!/usr/bin/env python3.1 import urllib.request; # Disguise as a Mozila browser on a Windows OS userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'; URL = "www.example.com/img"; req = urllib.request.Request(URL, headers={'User-Agent' : userAgent}); # Counter for the filename. i = 0; while True: fname = str(i).zfill(3) + '.png'; req.full_url = URL + fname; f = open(fname, 'wb'); try: response = urllib.request.urlopen(req); except: break; else: f.write(response.read()); i+=1; response.close(); finally: f.close();

The problem seems to come when I create the urllib.request.Request object (called req). I create it with a non-existing url but later I change the url to what it should be. I'm doing this so that I can use the same urllib.request.Request object and not have to create new ones on each iteration. There is probably a mechanism for doing exactly that in python but I'm not sure what it is.

EDIT Error message is:

>>> response = urllib.request.urlopen(req); Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.1/urllib/request.py", line 121, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python3.1/urllib/request.py", line 356, in open response = meth(req, response) File "/usr/lib/python3.1/urllib/request.py", line 468, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python3.1/urllib/request.py", line 394, in error return self._call_chain(*args) File "/usr/lib/python3.1/urllib/request.py", line 328, in _call_chain result = func(*args) File "/usr/lib/python3.1/urllib/request.py", line 476, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

EDIT 2: My solution is the following. Probably should have done this at the start as I knew it would work:

import urllib.request; # Disguise as a Mozila browser on a Windows OS userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'; # Counter for the filename. i = 0; while True: fname = str(i).zfill(3) + '.png'; URL = "www.example.com/img" + fname; f = open(fname, 'wb'); try: req = urllib.request.Request(URL, headers={'User-Agent' : userAgent}); response = urllib.request.urlopen(req); except: break; else: f.write(response.read()); i+=1; response.close(); finally: f.close();

And what is the error message ? Also, python don't need the semi-colon to end a line. — Kien Truong
– Kien Truong, Commented Mar 28, 2012 at 2:37
I've added the error message. I know that I don't need semicolons but I prefer to add them. The url and file exist. The only problem is that I'm creating the req object with an invalid url and then before I use req I correct the url. That seems to be causing the error. — s5s
– s5s, Commented Mar 28, 2012 at 2:41
It is. The url is valid. It's how it's set that's causing the problem. I can also access the url, wget it and download it with Python if I don't have a loop and so I set the url in req object correctly when I create it. — s5s
– s5s, Commented Mar 28, 2012 at 2:44
why would anyone prefer to add spurious semicolons everywhere? — wim
– wim, Commented Mar 28, 2012 at 2:49

SingleNegationElimination · Accepted Answer · 2012-03-28 02:52:08Z

urllib2 is fine for small scripts that only need to do one or two network interactions, but if you are doing a lot more work, you will likely find that either urllib3, or requests (which not coincidentally is built on the former), may suit your needs better. Your particular example might look like:

from itertools import count import requests HEADERS = {'user-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} URL = "http://www.example.com/img%03d.png" # with a session, we get keep alive session = requests.session() for n in count(): full_url = URL % n ignored, filename = URL.rsplit('/', 1) with file(filename, 'wb') as outfile: response = session.get(full_url, headers=HEADERS) if not response.ok: break outfile.write(response.content)

Edit: If you can use regular HTTP authentication (for which the 403 Forbidden response strongly suggests), then you can add that to a requests.get with the auth parameter, as in:

response = session.get(full_url, headers=HEADERS, auth=('username','password))

I like this answer, instead of just fixing a bug for OP you actually demonstrate a much better way of doing it, thus solving his and maybe other people problems.
It know it is a long time since the original post, but the filename should read ignored, filename = full_url.rsplit('/', 1) instead of ignored, filename = URL.rsplit('/', 1). Otherwise the filename will be img%03d.png.

W1N9Zr0 · Accepted Answer · 2012-03-28 02:57:53Z

If you want to use the custom user agent with every request, you can subclass FancyURLopener.

Here's an example: http://wolfprojects.altervista.org/changeua.php

Kien Truong · Accepted Answer · 2012-03-28 02:47:55Z

-2

Don't break when you receive an exception. Change

except: break

to

except: #Probably should log some debug information here. pass

This will skip all problematic request, so that one doesn't bring down the whole process.

answered Mar 28, 2012 at 2:47

Kien Truong

11.4k2 gold badges34 silver badges36 bronze badges

4 Comments

SingleNegationElimination Over a year ago

That will change the logic considerably. He most likely does not wish to loop forever.

s5s Over a year ago

I'm using the exception as a way to terminate the loop. A pass will result in an infinite loop. I don't know how many files there are so I'm downloading till I run into an exception.

Ignacio Vazquez-Abrams Over a year ago

Won't prevent the server from throttling though.

s5s Over a year ago

I don't think that would be a problem. I'm trying to resolve this problem first.

Collectives™ on Stack Overflow

Python download multiple files in a loop

3 Answers 3

2 Comments

Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

4 Comments

Related