urllib2 file name

Question

If I open a file using urllib2, like so:

remotefile = urllib2.urlopen('http://example.com/somefile.zip')

Is there an easy way to get the file name other then parsing the original URL?

EDIT: changed openfile to urlopen... not sure how that happened.

EDIT2: I ended up using:

filename = url.split('/')[-1].split('#')[0].split('?')[0]

Unless I'm mistaken, this should strip out all potential queries as well.

Do make sure you know what you want in these two cases: trailing slash (http://example.com/somefile/) and no path: http://example.com Your example will fail on the latter for sure (returning "example.com"). So will @insin's final answer. That's another reason why using urlsplit is good advice. — nealmcb
– nealmcb, Commented Feb 8, 2012 at 23:53
from the response headers: stackoverflow.com/questions/11783269/… — jozxyqk
– jozxyqk, Commented Nov 1, 2015 at 12:24
Lots of answers here miss the fact that there are two places to look for a file name: the URL and the Content-Disposition header field. All the current answers that mention the header neglect to mention that cgi.parse_header() will parse it correctly. There is a better answer here: stackoverflow.com/a/11783319/205212 — ʇsәɹoɈ
– ʇsәɹoɈ, Commented Oct 11, 2016 at 17:09

Jonny Buchanan · Accepted Answer · 2008-10-02 15:43:12Z

Did you mean urllib2.urlopen?

You could potentially lift the intended filename if the server was sending a Content-Disposition header by checking remotefile.info()['Content-Disposition'], but as it is I think you'll just have to parse the url.

You could use urlparse.urlsplit, but if you have any URLs like at the second example, you'll end up having to pull the file name out yourself anyway:

>>> urlparse.urlsplit('http://example.com/somefile.zip') ('http', 'example.com', '/somefile.zip', '', '') >>> urlparse.urlsplit('http://example.com/somedir/somefile.zip') ('http', 'example.com', '/somedir/somefile.zip', '', '')

Might as well just do this:

>>> 'http://example.com/somefile.zip'.split('/')[-1] 'somefile.zip' >>> 'http://example.com/somedir/somefile.zip'.split('/')[-1] 'somefile.zip'

Use posixpath.basename() instead of manually splitting on '/'.
I would always use urlsplit() and never straight string splitting. The latter will choke if you have an URL that has a fragment or query appended, say example.com/filename.html?cookie=55#Section_3.
What about escaped characters? Should those be decoded first?

Jay · Accepted Answer · 2008-10-02 16:06:16Z

If you only want the file name itself, assuming that there's no query variables at the end like http://example.com/somedir/somefile.zip?foo=bar then you can use os.path.basename for this:

[user@host]$ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.path.basename("http://example.com/somefile.zip") 'somefile.zip' >>> os.path.basename("http://example.com/somedir/somefile.zip") 'somefile.zip' >>> os.path.basename("http://example.com/somedir/somefile.zip?foo=bar") 'somefile.zip?foo=bar'

Some other posters mentioned using urlparse, which will work, but you'd still need to strip the leading directory from the file name. If you use os.path.basename() then you don't have to worry about that, since it returns only the final part of the URL or file path.

Using os.path to parse URLs seems to rely on current OS splitting paths in the same way as URLs are split. I don't think it's guaranteed for every OS.
This won't work on Windows. Use import posixpath; posixpath.basename instead.

Rafał Dowgird · Accepted Answer · 2008-10-02 15:43:10Z

I think that "the file name" isn't a very well defined concept when it comes to http transfers. The server might (but is not required to) provide one as "content-disposition" header, you can try to get that with remotefile.headers['Content-Disposition']. If this fails, you probably have to parse the URI yourself.

TMF Wolfman · Accepted Answer · 2015-03-20 18:38:47Z

6

Just saw this I normally do..

filename = url.split("?")[0].split("/")[-1]

answered Mar 20, 2015 at 18:38

TMF Wolfman

611 silver badge2 bronze badges

Comments

Filipe Correia · Accepted Answer · 2013-03-31 20:05:36Z

Using urlsplit is the safest option:

url = 'http://example.com/somefile.zip' urlparse.urlsplit(url).path.split('/')[-1]

Dan Lenski · Accepted Answer · 2008-10-02 15:42:59Z

Do you mean urllib2.urlopen? There is no function called openfile in the urllib2 module.

Anyway, use the urllib2.urlparse functions:

>>> from urllib2 import urlparse >>> print urlparse.urlsplit('http://example.com/somefile.zip') ('http', 'example.com', '/somefile.zip', '', '')

Voila.

Yth · Accepted Answer · 2016-04-28 14:52:52Z

You could also combine both of the two best-rated answers : Using urllib2.urlparse.urlsplit() to get the path part of the URL, and then os.path.basename for the actual file name.

Full code would be :

>>> remotefile=urllib2.urlopen(url) >>> try: >>> filename=remotefile.info()['Content-Disposition'] >>> except KeyError: >>> filename=os.path.basename(urllib2.urlparse.urlsplit(url).path)

Régis B. · Accepted Answer · 2016-06-26 20:52:07Z

The os.path.basename function works not only for file paths, but also for urls, so you don't have to manually parse the URL yourself. Also, it's important to note that you should use result.url instead of the original url in order to follow redirect responses:

import os import urllib2 result = urllib2.urlopen(url) real_url = urllib2.urlparse.urlparse(result.url) filename = os.path.basename(real_url.path)

miracle2k · Accepted Answer · 2008-10-02 15:45:47Z

I guess it depends what you mean by parsing. There is no way to get the filename without parsing the URL, i.e. the remote server doesn't give you a filename. However, you don't have to do much yourself, there's the urlparse module:

In [9]: urlparse.urlparse('http://example.com/somefile.zip') Out[9]: ('http', 'example.com', '/somefile.zip', '', '', '')

Corey Goldberg · Accepted Answer · 2008-10-02 15:46:49Z

not that I know of.

but you can parse it easy enough like this:

 url = 'http://example.com/somefile.zip' print url.split('/')[-1]

tshepang · Accepted Answer · 2014-07-11 12:45:58Z

using requests, but you can do it easy with urllib(2)

import requests from urllib import unquote from urlparse import urlparse sample = requests.get(url) if sample.status_code == 200: #has_key not work here, and this help avoid problem with names if filename == False: if 'content-disposition' in sample.headers.keys(): filename = sample.headers['content-disposition'].split('filename=')[-1].replace('"','').replace(';','') else: filename = urlparse(sample.url).query.split('/')[-1].split('=')[-1].split('&')[-1] if not filename: if url.split('/')[-1] != '': filename = sample.url.split('/')[-1].split('=')[-1].split('&')[-1] filename = unquote(filename)

Vovan Kuznetsov · Accepted Answer · 2015-09-10 22:31:37Z

You probably can use simple regular expression here. Something like:

In [26]: import re In [27]: pat = re.compile('.+[\/\?#=]([\w-]+\.[\w-]+(?:\.[\w-]+)?$)') In [28]: test_set ['http://www.google.com/a341.tar.gz', 'http://www.google.com/a341.gz', 'http://www.google.com/asdasd/aadssd.gz', 'http://www.google.com/asdasd?aadssd.gz', 'http://www.google.com/asdasd#blah.gz', 'http://www.google.com/asdasd?filename=xxxbl.gz'] In [30]: for url in test_set: ....: match = pat.match(url) ....: if match and match.groups(): ....: print(match.groups()[0]) ....: a341.tar.gz a341.gz aadssd.gz aadssd.gz blah.gz xxxbl.gz

Adam Nelson · Accepted Answer · 2016-04-11 19:28:35Z

Using PurePosixPath which is not operating system—dependent and handles urls gracefully is the pythonic solution:

>>> from pathlib import PurePosixPath >>> path = PurePosixPath('http://example.com/somefile.zip') >>> path.name 'somefile.zip' >>> path = PurePosixPath('http://example.com/nested/somefile.zip') >>> path.name 'somefile.zip'

Notice how there is no network traffic here or anything (i.e. those urls don't go anywhere) - just using standard parsing rules.

Nick Blexrud · Accepted Answer · 2016-04-29 16:45:44Z

import os,urllib2 resp = urllib2.urlopen('http://www.example.com/index.html') my_url = resp.geturl() os.path.split(my_url)[1] # 'index.html'

This is not openfile, but maybe still helps :)

Collectives™ on Stack Overflow

urllib2 file name

14 Answers 14

3 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

14 Answers 14

3 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related