How to download a file using python in a 'smarter' way?

Question

I need to download several files via http in Python.

The most obvious way to do it is just using urllib2:

import urllib2 u = urllib2.urlopen('http://server.com/file.html') localFile = open('file.html', 'w') localFile.write(u.read()) localFile.close()

But I'll have to deal with the URLs that are nasty in some way, say like this: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf. When downloaded via the browser, the file has a human-readable name, ie. accounts.pdf.

Is there any way to handle that in python, so I don't need to know the file names and hardcode them into my script?

Is the filename on the server relevant? Presumably these files have some meaning to you, so you ought to be able to name them yourself. If the names don't have meaning, come up with a random unique name yourself (uuids perhaps?) — Dominic Rodger
– Dominic Rodger, Commented May 14, 2009 at 8:33
I'd love to have file names readable and meaningful. The issue is, the script will take URLs to download from from a text file, and the URLs will be added and removed by a non-technical person. — kender
– kender, Commented May 14, 2009 at 12:26

Community · Accepted Answer · 2017-05-23 11:46:50Z

Download scripts like that tend to push a header telling the user-agent what to name the file:

Content-Disposition: attachment; filename="the filename.ext"

If you can grab that header, you can get the proper filename.

There's another thread that has a little bit of code to offer up for Content-Disposition-grabbing.

remotefile = urllib2.urlopen('http://example.com/somefile.zip') remotefile.info()['Content-Disposition']

No, they might be redirecting to a plain file. But if it's like most download scripts, they're pushing the content-disposition. By all means check.
If it redirects me to a plain file it's easy too, I can access actual url via remotefile.url, can't I?

kender · Accepted Answer · 2013-03-28 07:59:14Z

Based on comments and @Oli's anwser, I made a solution like this:

from os.path import basename from urlparse import urlsplit def url2name(url): return basename(urlsplit(url)[2]) def download(url, localFileName = None): localName = url2name(url) req = urllib2.Request(url) r = urllib2.urlopen(req) if r.info().has_key('Content-Disposition'): # If the response has Content-Disposition, we take file name from it localName = r.info()['Content-Disposition'].split('filename=')[1] if localName[0] == '"' or localName[0] == "'": localName = localName[1:-1] elif r.url != url: # if we were redirected, the real file name we take from the final URL localName = url2name(r.url) if localFileName: # we can force to save the file as specified name localName = localFileName f = open(localName, 'wb') f.write(r.read()) f.close()

It takes file name from Content-Disposition; if it's not present, uses filename from the URL (if redirection happened, the final URL is taken into account).

I found this useful. But to download bigger files, without storing them full content in memory, I had to find out this, copying your 'r' to 'f': import shutil shutil.copyfileobj(r, f)
Worked very well, but I would wrap urlsplit(url)[2] with a call to urllib.unquote, otherwise the filenames would be percent-encoded. Here is how I'm doing: return basename(urllib.unquote(urlsplit(url)[2]))

Michael Waterfall · Accepted Answer · 2012-04-23 14:48:00Z

Combining much of the above, here is a more pythonic solution:

import urllib2 import shutil import urlparse import os def download(url, fileName=None): def getFileName(url,openUrl): if 'Content-Disposition' in openUrl.info(): # If the response has Content-Disposition, try to get filename from it cd = dict(map( lambda x: x.strip().split('=') if '=' in x else (x.strip(),''), openUrl.info()['Content-Disposition'].split(';'))) if 'filename' in cd: filename = cd['filename'].strip("\"'") if filename: return filename # if no filename was found above, parse it out of the final URL. return os.path.basename(urlparse.urlsplit(openUrl.url)[2]) r = urllib2.urlopen(urllib2.Request(url)) try: fileName = fileName or getFileName(url,r) with open(fileName, 'wb') as f: shutil.copyfileobj(r,f) finally: r.close()

Denis Barmenkov · Accepted Answer · 2010-03-23 21:12:58Z

2 Kender:

if localName[0] == '"' or localName[0] == "'": localName = localName[1:-1]

it is not safe -- web server can pass wrong formatted name as ["file.ext] or [file.ext'] or even be empty and localName[0] will raise exception. Correct code can looks like this:

localName = localName.replace('"', '').replace("'", "") if localName == '': localName = SOME_DEFAULT_FILE_NAME

Even better: local_name.strip('\'"') -- that will only strip from the beginning and end and is also more succinct.

Jaydev · Accepted Answer · 2016-09-19 12:37:58Z

Using wget:

custom_file_name = "/custom/path/custom_name.ext" wget.download(url, custom_file_name)

Using urlretrieve:

urllib.urlretrieve(url, custom_file_name)

urlretrieve also creates the directory structure if not exists.

Flair · Accepted Answer · 2022-01-03 22:38:30Z

You need to look into 'Content-Disposition' header, see the solution by kender.

How to download a file using python in a 'smarter' way?

Posting his solution modified with a capability to specify an output folder:

from os.path import basename import os from urllib.parse import urlsplit import urllib.request def url2name(url): return basename(urlsplit(url)[2]) def download(url, out_path): localName = url2name(url) req = urllib.request.Request(url) r = urllib.request.urlopen(req) if r.info().has_key('Content-Disposition'): # If the response has Content-Disposition, we take file name from it localName = r.info()['Content-Disposition'].split('filename=')[1] if localName[0] == '"' or localName[0] == "'": localName = localName[1:-1] elif r.url != url: # if we were redirected, the real file name we take from the final URL localName = url2name(r.url) localName = os.path.join(out_path, localName) f = open(localName, 'wb') f.write(r.read()) f.close() download("https://example.com/demofile", '/home/username/tmp')

I have just updated the answer of kender for python3

Instead of parsing r.info() yourself, you probably can use r.info().get_filename() or r.headers().get_filename()

Collectives™ on Stack Overflow

How to download a file using python in a 'smarter' way?

6 Answers 6

2 Comments

3 Comments

Comments

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

3 Comments

Comments

1 Comment

Comments

1 Comment

Linked

Related