Skip to main content
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link

The most commonly used url-retrieving library is requests, and I prefer this package for all my scraping needs. But the requests package is not suggested for downloading large non-HTML files (sourcesource). Instead, urllib2.urlopen can be used. In case .zip files contain multiple files or folders, it's best to unzip to a new folder. The following code is OS-agnostic (and adapted from herehere):

If you want to check if the file exists on the remote server (more detailsmore details):

If you want to check if the .zip file is updated since the last download, you can first read the headers (more detailsmore details). The 'last-modified' date and 'content-length' can tell you if it has been updated:

The most commonly used url-retrieving library is requests, and I prefer this package for all my scraping needs. But the requests package is not suggested for downloading large non-HTML files (source). Instead, urllib2.urlopen can be used. In case .zip files contain multiple files or folders, it's best to unzip to a new folder. The following code is OS-agnostic (and adapted from here):

If you want to check if the file exists on the remote server (more details):

If you want to check if the .zip file is updated since the last download, you can first read the headers (more details). The 'last-modified' date and 'content-length' can tell you if it has been updated:

The most commonly used url-retrieving library is requests, and I prefer this package for all my scraping needs. But the requests package is not suggested for downloading large non-HTML files (source). Instead, urllib2.urlopen can be used. In case .zip files contain multiple files or folders, it's best to unzip to a new folder. The following code is OS-agnostic (and adapted from here):

If you want to check if the file exists on the remote server (more details):

If you want to check if the .zip file is updated since the last download, you can first read the headers (more details). The 'last-modified' date and 'content-length' can tell you if it has been updated:

Often data is downloaded from the web in a compressed form, or data folders are joined in one zip file. Before processing, the files must be uncompressed. If already working in a pythonPython environment, it's useful to be able to download the .zip file and also unzip it, all in the same piece of code.

# set remote and local file location url = 'http://www.colorado.edu/conflict/peace/download/peace_essay.ZIP' filename = 'data.zip' folderpath = 'data' def donwloaddownload(url,filename): import urllib2 page=urllib2.urlopen(url) open(filename,'wb').write(page.read()) def unzip(source_filename, dest_dir): import zipfile,os.path with zipfile.ZipFile(source_filename) as zf: for member in zf.infolist(): words = member.filename.split('/') path = dest_dir for word in words[:-1]: drive, word = os.path.splitdrive(word) head, word = os.path.split(word) if word in (os.curdir, os.pardir, ''): continue path = os.path.join(path, word) zf.extract(member, path) donwloaddownload(url,filename) unzip(filename,folderpath) 

Often data is downloaded from the web in a compressed form, or data folders are joined in one zip file. Before processing, the files must be uncompressed. If already working in a python environment, it's useful to be able to download the .zip file and also unzip it, all in the same piece of code.

# set remote and local file location url = 'http://www.colorado.edu/conflict/peace/download/peace_essay.ZIP' filename = 'data.zip' folderpath = 'data' def donwload(url,filename): import urllib2 page=urllib2.urlopen(url) open(filename,'wb').write(page.read()) def unzip(source_filename, dest_dir): import zipfile,os.path with zipfile.ZipFile(source_filename) as zf: for member in zf.infolist(): words = member.filename.split('/') path = dest_dir for word in words[:-1]: drive, word = os.path.splitdrive(word) head, word = os.path.split(word) if word in (os.curdir, os.pardir, ''): continue path = os.path.join(path, word) zf.extract(member, path) donwload(url,filename) unzip(filename,folderpath) 

Often data is downloaded from the web in a compressed form, or data folders are joined in one zip file. Before processing, the files must be uncompressed. If already working in a Python environment, it's useful to be able to download the .zip file and also unzip it, all in the same piece of code.

# set remote and local file location url = 'http://www.colorado.edu/conflict/peace/download/peace_essay.ZIP' filename = 'data.zip' folderpath = 'data' def download(url,filename): import urllib2 page=urllib2.urlopen(url) open(filename,'wb').write(page.read()) def unzip(source_filename, dest_dir): import zipfile,os.path with zipfile.ZipFile(source_filename) as zf: for member in zf.infolist(): words = member.filename.split('/') path = dest_dir for word in words[:-1]: drive, word = os.path.splitdrive(word) head, word = os.path.split(word) if word in (os.curdir, os.pardir, ''): continue path = os.path.join(path, word) zf.extract(member, path) download(url,filename) unzip(filename,folderpath) 
Source Link
philshem
  • 17.8k
  • 7
  • 70
  • 173

Downloading and Uncompressing Archive (.zip) Files and Folders:

Often data is downloaded from the web in a compressed form, or data folders are joined in one zip file. Before processing, the files must be uncompressed. If already working in a python environment, it's useful to be able to download the .zip file and also unzip it, all in the same piece of code.

The most commonly used url-retrieving library is requests, and I prefer this package for all my scraping needs. But the requests package is not suggested for downloading large non-HTML files (source). Instead, urllib2.urlopen can be used. In case .zip files contain multiple files or folders, it's best to unzip to a new folder. The following code is OS-agnostic (and adapted from here):

# set remote and local file location url = 'http://www.colorado.edu/conflict/peace/download/peace_essay.ZIP' filename = 'data.zip' folderpath = 'data' def donwload(url,filename): import urllib2 page=urllib2.urlopen(url) open(filename,'wb').write(page.read()) def unzip(source_filename, dest_dir): import zipfile,os.path with zipfile.ZipFile(source_filename) as zf: for member in zf.infolist(): words = member.filename.split('/') path = dest_dir for word in words[:-1]: drive, word = os.path.splitdrive(word) head, word = os.path.split(word) if word in (os.curdir, os.pardir, ''): continue path = os.path.join(path, word) zf.extract(member, path) donwload(url,filename) unzip(filename,folderpath) 

If you want to check if the file exists on the remote server (more details):

import urllib2 page=urllib2.urlopen(url) if page.code == 200: print "Exists!" 

If you want to check if the .zip file is updated since the last download, you can first read the headers (more details). The 'last-modified' date and 'content-length' can tell you if it has been updated:

print page.headers.items() >> [('content-length', '39600'), ('set-cookie', 'f5_persistence=1729145024.20480.0000; path=/'), ('accept-ranges', 'bytes'), ('server', 'Apache'), ('last-modified', 'Fri, 18 Dec 1998 23:27:52 GMT'), ('connection', 'close'), ('etag', '"a0a66ce5-9ab0-33f4cb8492e00"'), ('date', 'Tue, 15 Apr 2014 07:27:54 GMT'), ('content-type', 'application/zip')] 

For simple .zip files with single .txt contents, a much simpler piece of code can be used. .zip file

import zipfile with ZipFile('spam.zip', 'w') as myzip: myzip.write('eggs.txt') 

For .gzip files, the process is similar. .gzip:

import gzip f = gzip.open('file.txt.gz', 'rb') file_content = f.read() f.close() 

Other formats: .tar.gz, .7z and .bz2 files can also be processed.