How to simultaneously download webpages using python?

Question

I am writing a web scraping application in Python. The website I am scraping has urls of the form www.someurl.com/getPage?id=x where x is a number identifying the page. Now, I am downloading all the pages using urlretrieve

Here is the basic form of my script:

for i in range(1,1001): urlretrieve('http://someurl.com/getPage?id='+str(i) , str(i)+".html)

Now, my question - is it possible to download the pages simultaneously? Because, here I am blocking the script and waiting for the page to download. Can I ask Python to open more than one connection to the server?

You could use threads docs.python.org/3.4/library/threading.html — Paco
– Paco, Commented May 18, 2015 at 9:47
You should prob be aware that some servers don't "take too kindly" to you hammering them with requests... — Abd Azrad
– Abd Azrad, Commented May 18, 2015 at 9:50

Trevor Merrifield · Accepted Answer · 2015-05-18 10:24:09Z

4

Getting some google searches concurrently in Python 2:

from multiprocessing.pool import ThreadPool from urllib import urlretrieve def loadpage(x): urlretrieve('http://google.com/search?q={}'.format(x), '{}.html'.format(x)) p = ThreadPool(10) # the max number of webpages to get at once p.map(loadpage, range(50))

You could just as easily use Pool instead of ThreadPool. That would make it run on multiple processes/CPU cores. But since this is IO bound I think the concurrency that threading offers is enough.

answered May 18, 2015 at 10:24

Trevor Merrifield

4,7212 gold badges23 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

xrisk Over a year ago

erm, sorry i;m a bit of a newbie - what does p.map do?

Trevor Merrifield Over a year ago

The given function is applied to each element in the itertable. So it is like calling loadpage(0), loadpage(1), ..., loadpage(49). It is called on separate threads, up to 10 at once, since that is the size of the thread pool.

xrisk Over a year ago

so p.map automatically takes values from range and supplies them to loadpage ?

xrisk Over a year ago

Ah. this is perfect, exactly what I wanted

Trevor Merrifield Over a year ago

no problem, and you're right with your understanding of p.map

firelynx · Accepted Answer · 2015-05-18 10:00:55Z

No, you cannot ask python to open more than one connection, you have to use either a framework for doing this or program a threaded application youself.

scrapy is a framework for downloading multiple pages at the same time.

twisted is a framework for threading, and it does handle multiple protocols. It is alot simpler to just use scrapy, but if you insist on building stuff yourself, this is probably what you want to use.

Vaulstein · Accepted Answer · 2015-05-18 10:05:47Z

You could use multi-threading to web scrape as it was used on the link Threading

OR

you could check the simple example for threading on this link.

Collectives™ on Stack Overflow

How to simultaneously download webpages using python?

3 Answers 3

5 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Related