1

I am writing a web scraping application in Python. The website I am scraping has urls of the form www.someurl.com/getPage?id=x where x is a number identifying the page. Now, I am downloading all the pages using urlretrieve

Here is the basic form of my script:

for i in range(1,1001): urlretrieve('http://someurl.com/getPage?id='+str(i) , str(i)+".html) 

Now, my question - is it possible to download the pages simultaneously? Because, here I am blocking the script and waiting for the page to download. Can I ask Python to open more than one connection to the server?

6
  • 1
    You could use threads docs.python.org/3.4/library/threading.html Commented May 18, 2015 at 9:47
  • @Paco, how many should I use? Commented May 18, 2015 at 9:48
  • look at libraries: requests, requests.futures Commented May 18, 2015 at 9:49
  • @Paco, can you provide a small example? Commented May 18, 2015 at 9:49
  • 2
    You should prob be aware that some servers don't "take too kindly" to you hammering them with requests... Commented May 18, 2015 at 9:50

3 Answers 3

4

Getting some google searches concurrently in Python 2:

from multiprocessing.pool import ThreadPool from urllib import urlretrieve def loadpage(x): urlretrieve('http://google.com/search?q={}'.format(x), '{}.html'.format(x)) p = ThreadPool(10) # the max number of webpages to get at once p.map(loadpage, range(50)) 

You could just as easily use Pool instead of ThreadPool. That would make it run on multiple processes/CPU cores. But since this is IO bound I think the concurrency that threading offers is enough.

Sign up to request clarification or add additional context in comments.

5 Comments

erm, sorry i;m a bit of a newbie - what does p.map do?
The given function is applied to each element in the itertable. So it is like calling loadpage(0), loadpage(1), ..., loadpage(49). It is called on separate threads, up to 10 at once, since that is the size of the thread pool.
so p.map automatically takes values from range and supplies them to loadpage ?
Ah. this is perfect, exactly what I wanted
no problem, and you're right with your understanding of p.map
2

No, you cannot ask python to open more than one connection, you have to use either a framework for doing this or program a threaded application youself.

scrapy is a framework for downloading multiple pages at the same time.

twisted is a framework for threading, and it does handle multiple protocols. It is alot simpler to just use scrapy, but if you insist on building stuff yourself, this is probably what you want to use.

Comments

2

You could use multi-threading to web scrape as it was used on the link Threading

OR

you could check the simple example for threading on this link.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.