4

I'm trying to create a multithreaded downloader using python. Lets say I have a link to a video of size 100MB and I want to download it using 5 threads with each thread downloading 20MB simultaneously. For that to happen I have to divide the initial response to 5 parts which represents different parts of the file (like this 0-20MB, 20-40MB, 40-60MB, 60-80MB, 80-100MB), I searched and found http range headers might help. Here's the sample code

from urllib.request import urlopen,Request url= some video url header = {'Range':'bytes=%d-%d' % (5000,10000)} # trying to capture all the bytes in between 5000th and 1000th byte. req=Request(url,headers=header) res=urlopen(req) r=res.read() 

But the above code is reading the whole video instead of the bytes I wanted and it clearly isn't working. So is there any way to read specified range of bytes in any part of the video instead of reading from the start ? Please try to explain in simple words.

9
  • Multithreading your downloader may not make things faster if the bottleneck is the bandwidth of the connection. Commented Sep 22, 2016 at 17:03
  • See the Wikipedia article titled Byte serving about the subject. The Content-Range header of the response will tell you what bytes are being delivered. Commented Sep 22, 2016 at 17:08
  • Yes, I agree with you. I just wanna give it a try. Commented Sep 22, 2016 at 17:16
  • You've got the article explaining how it works in simple words...go for it! Commented Sep 22, 2016 at 17:19
  • 2
    The server many not support what you want to do. "Byte serving begins when an HTTP server advertises its willingness to serve partial requests using the Accept-Ranges response header." Commented Sep 22, 2016 at 18:24

1 Answer 1

1

But the above code is reading the whole video instead of the bytes I wanted and it clearly isn't working.

The core problem is the the default request uses the HTTP GET method which pulls down the entire file all at once.

This can be fixed by adding request.get_method = lambda : 'HEAD'. This uses the HTTP HEAD method to fetch the Content-Length and to verify than range requests are supported.

Here is a working example of chunked requests. Just change the url to your url of interest:

from urllib.request import urlopen, Request url = 'http://www.jython.org' # This is an example. Use your own url here. n = 5 request = Request(url) request.get_method = lambda : 'HEAD' r = urlopen(request) # Verify that the server supports Range requests assert r.headers.get('Accept-Ranges', '') == 'bytes', 'Range requests not supported' # Compute chunk size using a double negation for ceiling division total_size = int(r.headers.get('Content-Length')) chunk_size = -(-total_size // n) # Showing chunked downloads. This should be run in multiple threads. chunks = [] for i in range(n): start = i * chunk_size end = start + chunk_size - 1 # Bytes ranges are inclusive headers = dict(Range = 'bytes=%d-%d' % (start, end)) request = Request(url, headers=headers) chunk = urlopen(request).read() chunks.append(chunk) 

The separate requests in the for-loop can be done in parallel using threads or processes. This will give a nice speed-up when run in an environment with multiple physical connections to the internet. But if you only have one physical connection, that is likely to be the bottleneck, so parallel requests won't help as much as expected.

Sign up to request clarification or add additional context in comments.

2 Comments

Off-topic: The comment for chunk_size = -(-total_size // n) is pretty worthless. Why is it being done this way?
@martineau That is explained in the comment above it. The double negation computes "ceiling division". If the total_size were 1602 and n were 4, you would want the chunk_size to be 401 (ceiling division) rather than 400 (floor division) which wouldn't cover the whole dataset (400 * 4 < 1602 <= 401 * 4).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.