Parallel downloading with urlretrieve

Question

I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:

import shutil import os import sys import socket socket.setdefaulttimeout(5) file_read = open(my_file, "r") lines = file_read.readlines() for line in lines: try: import urllib.request sl = line.strip().split(";") url = sl[0] newname = str(sl[1])+".html" urllib.request.urlretrieve(url, newname) except: pass file_read.close()

This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?

Does this answer yout question?

sudden_appearance
– sudden_appearance

2022-03-03 10:08:20 +00:00
Commented Mar 3, 2022 at 10:08 — sudden_appearance
– sudden_appearance, Commented Mar 3, 2022 at 10:08

user3666197 · Accepted Answer · 2022-03-03 16:36:41Z

Q :
_{" I regularly have to ...
What would be the simplest and best way to speed it up ? "}

A :
The SIMPLEST ( what the commented approach is not ) &
the BEST way
is to at least :
(a)
minimise all overheads ( 50k times Thread-instantiation costs being one such class of costs ),
(b)
harness embarrasing independence ( yet, not a being a True-[PARALLEL] ) in process-flow
(c)
go as close as possible to bleeding edges of a just-[CONCURRENT], latency-masked process-flow

Given
both the simplicity & performance seem to be the measure of "best"-ness:

Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.

Therefore
I could not promote using GIL-lock (by-design even a just-[CONCURRENT]-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]-ised chain of about 100 [ms]-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),
so
rather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).

Sketched process-flow framework :

from joblib import Parallel, delayed MAX_WORKERs = ( n_CPU_cores - 1 ) def main( files_in ): """ __doc__ .INIT worker-processes, each with a split-scope of tasks """ IDs = range( max( 1, MAX_WORKERs ) ) RES_if_need = Parallel( n_jobs = MAX_WORKERs )( delayed( block_processor_FUN #-- fun CALLABLE )( my_file, #---------- fun PAR1 wPROC #---------- fun PAR2 ) for wPROC in IDs ) def block_processor_FUN( file_with_URLs = None, file_from_PART = 0 ): """ __doc__ .OPEN file_with_URLs .READ file_from_PART, row-wise - till next part starts - ref. global MAX_WORKERs """ ...

This is the initial Python-interpreter __main__-side trick to spawn just-enough worker-processes, that start crawling the my_file-"list" of URL-s independently AND an indeed just-[CONCURENT]-flow of work starts, one being independent of any other.

The block_processor_FUN(), passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from ( wPROC / MAX_WORKERs ) to ( ( wPROC + 1 ) / MAX_WORKERs ) of it's number of lines.

That simple.

If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a { __main__ | main() }-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.

More complex & more robust to uneven distribution of URL-serving durations "across" the my_file-ordered list of URL-s to serve.

The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.

For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.

Collectives™ on Stack Overflow

Parallel downloading with urlretrieve

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related