1

I regularly have to download and rename HTML pages in bulk and wrote this simple code for it a while ago:

import shutil import os import sys import socket socket.setdefaulttimeout(5) file_read = open(my_file, "r") lines = file_read.readlines() for line in lines: try: import urllib.request sl = line.strip().split(";") url = sl[0] newname = str(sl[1])+".html" urllib.request.urlretrieve(url, newname) except: pass file_read.close() 

This works well enough for a few hundred websites, but takes waaaaay too long for a larger number of downloads (20-50k). What would be the simplest and best way to speed it up?

1
  • Does this answer yout question? Commented Mar 3, 2022 at 10:08

1 Answer 1

1

Q :
" I regularly have to ...
What would be the simplest and best way to speed it up ? "

A :
The SIMPLEST ( what the commented approach is not ) &
the BEST way
is to at least :
(a)
minimise all overheads ( 50k times Thread-instantiation costs being one such class of costs ),
(b)
harness embarrasing independence ( yet, not a being a True-[PARALLEL] ) in process-flow
(c)
go as close as possible to bleeding edges of a just-[CONCURRENT], latency-masked process-flow

Given
both the simplicity & performance seem to be the measure of "best"-ness:

Any costs, that do not first justify the costs of introducing themselves by so much increased performance, and second, that do not create additional positive net-effect on performance ( speed-up ) are performance ANTI-patterns & unforgivable Computer Science sins.

Therefore
I could not promote using GIL-lock (by-design even a just-[CONCURRENT]-processing prevented) bound & performance-suffocated step-by-step round-robin stepping of any amount of Python-threads in a one-after-another-after-another-...-re-[SERIAL]-ised chain of about 100 [ms]-quanta of code-interpretation time-blocks a one and only one such Python-thread is being let to run ( where all others are blocked-waiting ... being rather a performance ANTI-pattern, isn't it? ),
so
rather go in for process-based concurrency of work-flow ( performance gains a lot here, for ~ 50k url-fetches, where large hundreds / thousands of [ms]-latencies ( protocol-and-security handshaking setup + remote url-decode + remote content-assembly + remote content-into- protocol-encapsulation + remote-to-local network-flows + local protocol-decode + ... ).

Sketched process-flow framework :

from joblib import Parallel, delayed MAX_WORKERs = ( n_CPU_cores - 1 ) def main( files_in ): """ __doc__ .INIT worker-processes, each with a split-scope of tasks """ IDs = range( max( 1, MAX_WORKERs ) ) RES_if_need = Parallel( n_jobs = MAX_WORKERs )( delayed( block_processor_FUN #-- fun CALLABLE )( my_file, #---------- fun PAR1 wPROC #---------- fun PAR2 ) for wPROC in IDs ) def block_processor_FUN( file_with_URLs = None, file_from_PART = 0 ): """ __doc__ .OPEN file_with_URLs .READ file_from_PART, row-wise - till next part starts - ref. global MAX_WORKERs """ ... 

This is the initial Python-interpreter __main__-side trick to spawn just-enough worker-processes, that start crawling the my_file-"list" of URL-s independently AND an indeed just-[CONCURENT]-flow of work starts, one being independent of any other.

The block_processor_FUN(), passed by reference to the workers does simlpy open the file, and starts fetching/processing only its "own"-fraction, being from ( wPROC / MAX_WORKERs ) to ( ( wPROC + 1 ) / MAX_WORKERs ) of it's number of lines.

That simple.

If willing to tune-up corner-cases, where some URL may take and takes longer, than one may improve the form of load-balancing fair-queueing, yet at a cost of more complex design ( many process-to-process messaging queues are available ), having a { __main__ | main() }-side FQ/LB-feeder and making worker-processes retrieve their next task from such job-request FQ/LB-facility.

More complex & more robust to uneven distribution of URL-serving durations "across" the my_file-ordered list of URL-s to serve.

The choices of levels of simplicity / complexity compromises, that impact the resulting performance / robustness are yours.

For more details you may like to read this and code from this and there directed examples or tips for further performace-boosting.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.