Speeding up process speed of file downloads from the web

Question

I'm writing a program that has to download a bunch of files from the web before it can even run, so I created a function that will download all the files and "initialize" the program called init_program, how it works is it runs through a couple dicts that have urls to a gistfiles on github. It pulls the urls and uses urllib2 to download them. I won't be able to add all the files but you can try it out by cloning the repo here. Here's the function that will create the files from the gists:

def init_program(): """ Initialize the program and allow all the files to be downloaded This will take awhile to process, but I'm working on the processing speed """ downloaded_wordlists = [] # Used to count the amount of items downloaded downloaded_rainbow_tables = [] print("\n") banner("Initializing program and downloading files, this may take awhile..") print("\n") # INIT_FILE is a file that will contain "false" if the program is not initialized # And "true" if the program is initialized with open(INIT_FILE) as data: if data.read() == "false": for item in GIST_DICT_LINKS.keys(): sys.stdout.write("\rDownloading {} out of {} wordlists.. ".format(len(downloaded_wordlists) + 1, len(GIST_DICT_LINKS.keys()))) sys.stdout.flush() new_wordlist = open("dicts/included_dicts/wordlists/{}.txt".format(item), "a+") # Download the wordlists and save them into a file wordlist_data = urllib2.urlopen(GIST_DICT_LINKS[item]) new_wordlist.write(wordlist_data.read()) downloaded_wordlists.append(item + ".txt") new_wordlist.close() print("\n") banner("Done with wordlists, moving to rainbow tables..") print("\n") for table in GIST_RAINBOW_LINKS.keys(): sys.stdout.write("\rDownloading {} out of {} rainbow tables".format(len(downloaded_rainbow_tables) + 1, len(GIST_RAINBOW_LINKS.keys()))) new_rainbowtable = open("dicts/included_dicts/rainbow_tables/{}.rtc".format(table)) # Download the rainbow tables and save them into a file rainbow_data = urllib2.urlopen(GIST_RAINBOW_LINKS[table]) new_rainbowtable.write(rainbow_data.read()) downloaded_rainbow_tables.append(table + ".rtc") new_rainbowtable.close() open(data, "w").write("true").close() # Will never be initialized again else: pass return downloaded_wordlists, downloaded_rainbow_tables

This works, yes, however it's extremely slow, due to the size of the files, each file has at least 100,000 lines in it. How can I speed up this function to make it faster and more user friendly?

Hmmm, It depends on your wifi connection. There is almost no way you can speed this up except improve your wifi. Sorry to say. — Qwerty
– Qwerty, Commented Dec 7, 2016 at 2:26
@Qwerty even with threading? I mean this is slow, yeah it will be worth it in the end, but it's a slow initialization process.. — papasmurf
– papasmurf, Commented Dec 7, 2016 at 2:29

GustavoIP · Accepted Answer · 2016-12-07 03:34:00Z

Some weeks ago I faced a similar situation where it was needed to download many huge files but all simple pure Python solutions that I found was not good enough in terms of download optimization. So I found Axel — Light command line download accelerator for Linux and Unix

What is Axel?

Axel tries to accelerate the downloading process by using multiple connections for one file, similar to DownThemAll and other famous programs. It can also use multiple mirrors for one download.

Using Axel, you will get files faster from Internet. So, Axel can speed up a download up to 60% (approximately, according to some tests).

Usage: axel [options] url1 [url2] [url...] --max-speed=x -s x Specify maximum speed (bytes per second) --num-connections=x -n x Specify maximum number of connections --output=f -o f Specify local output file --search[=x] -S [x] Search for mirrors and download from x servers --header=x -H x Add header string --user-agent=x -U x Set user agent --no-proxy -N Just don't use any proxy server --quiet -q Leave stdout alone --verbose -v More status information --alternate -a Alternate progress indicator --help -h This information --version -V Version information

As axel is written in C and there's no C extension for Python, so I used the subprocess module to execute him externally and works perfectly for me.

You can do something like this:

cmd = ['/usr/local/bin/axel', '-n', str(n_connections), '-o', "{0}".format(filename), url] process = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE)

You can also parse the progress of each download parsing the output of the stdout.

 while True: line = process.stdout.readline() progress = YOUR_GREAT_REGEX.match(line).groups() ...

This only works if the hosting site supports parallel downloads
It's true, but can be useful in 'the most' cases. But it's not a silver bullet, unfortunately.
@GustavolP I'm also working on a Windows machine.. This is a genius work around though so +1

Paul Rudin · Accepted Answer · 2016-12-07 08:20:08Z

You're blocking whilst you wait for each download. So the total time is the sum of the round trip time for each download. Your code will likely spend a lot of time waiting for the network traffic. One way to improve this is not to block whilst you wait for each response. You can do this in several ways. For example by handing off each request to a separate thread (or process), or by using an event loop and coroutines. Read up on the threading and asyncio modules.

Elaborate what you mean by blocking while waiting for each download?
urlopen() followed by read() means that you're waiting for a connection to be opened, the request to be sent and the response to arrive. This network traffic is likely to a significant amount of time, and most of the time taken by your code is waiting for the network traffic. When you've got lots of requests to make you don't want to wait for the response to the first, before you initiate the next.
So how do you propose I do that? Create a queue of threads, and just pull them when I need them?

Collectives™ on Stack Overflow

Speeding up process speed of file downloads from the web

2 Answers 2

3 Comments

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Linked

Related