Python Urllib UrlOpen Read

Question

Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect.

I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make the server a bit busy). How could I optimize the number of threads so I could get a legit speed and not pushing the server too hard.

Thanks!

Updated: I increased the number of threads one by one and monitored the total time (units: minutes) spent to scrape 100 URLs. and it turned out that the total time dramatically decreased when you change the number of threads to 2, and keep decreasing as you increase the number of threads, but the 'improvement' caused by threading become less and less obvious. (the total time even shows a bounce back when you build too many threads) I know this is only a specific case for the web server that I harvest but I decided to share just to show the power of threading and hope would be helpful for somebody one day.

enter image description here

miku · Accepted Answer · 2013-09-12 20:48:15Z

There are a few things you can do. If the URLs are on different domains, then you might just fan out the work to threads, each downloading a page from a different domain.

If your URLs all point to the same server and you do not want stress the server, then you can just retrieve the URLs sequentially. If the server is happy with a couple of parallel requests, the you can look into pools of workers. You could start, say a pool of four workers and add all your URL to a queue, from which the workers will pull new URLs.

Since you tagged the question with "screen-scraping" as well, scrapy is a dedicated scraping framework, which can work in parallel.

Python 3 comes with a set of new builtin concurrency primitives under concurrent.futures.

Actually, it is all pointed to the same server, I am not quite sure what is the true difference between the package of Threading and Multiprocessing in this case. "The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine". Does that mean python is actually only using one processor or ...
In IO-bound cases, you can use both. For CPU-bound tasks, multiprocessing will utilize all available core, while threading will run on a single core due to the GIL.
I did a small experiment and record the total time to scrape 100 URLs with different numbers of threads. Result is pretty interesting and I would try the multiprocessing library sometime and update my post. Thanks a lot for your explanation.

Fred Mitchell · Accepted Answer · 2013-09-12 22:09:00Z

Here is a caveat. I have encountered a number of servers powered by somewhat "elderly" releases of IIS. They often will not service a request if there is not a one second delay between requests.

Collectives™ on Stack Overflow

Python Urllib UrlOpen Read

2 Answers 2

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Linked

Related