0

Can anyone tell me why this code generates queue after starting the threads? Basically, queue is generated after the for loop but in ThreadUrl class it already uses queue.get() method. How does this work? How can it get the values from a queue that is not yet generated?

for i in range(5): t = ThreadUrl(queue, out_queue) t.setDaemon(True) t.start() # This is what confuses me! Shouldn't it be above the for loop?? for host in hosts: queue.put(host) for i in range(5): dt = DatamineThread(out_queue) dt.setDaemon(True) dt.start() #wait on the queue until everything has been processed queue.join() out_queue.join() 

Here is the full source

import Queue import threading import urllib2 import time from BeautifulSoup import BeautifulSoup hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com", "http://ibm.com", "http://apple.com"] queue = Queue.Queue() out_queue = Queue.Queue() class ThreadUrl(threading.Thread): """Threaded Url Grab""" def __init__(self, queue, out_queue): threading.Thread.__init__(self) self.queue = queue self.out_queue = out_queue def run(self): while True: #grabs host from queue host = self.queue.get() #grabs urls of hosts and then grabs chunk of webpage url = urllib2.urlopen(host) chunk = url.read() #place chunk into out queue self.out_queue.put(chunk) #signals to queue job is done self.queue.task_done() class DatamineThread(threading.Thread): """Threaded Url Grab""" def __init__(self, out_queue): threading.Thread.__init__(self) self.out_queue = out_queue def run(self): while True: #grabs host from queue chunk = self.out_queue.get() #parse the chunk soup = BeautifulSoup(chunk) print soup.findAll(['title']) #signals to queue job is done self.out_queue.task_done() start = time.time() def main(): #spawn a pool of threads, and pass them queue instance for i in range(5): t = ThreadUrl(queue, out_queue) t.setDaemon(True) t.start() #populate queue with data for host in hosts: queue.put(host) for i in range(5): dt = DatamineThread(out_queue) dt.setDaemon(True) dt.start() #wait on the queue until everything has been processed queue.join() out_queue.join() main() print "Elapsed Time: %s" % (time.time() - start) 

1 Answer 1

6

Line host = self.queue.get() blocks executing thread until some element appear in the queue.

So

#spawn a pool of threads, and pass them queue instance for i in range(5): t = ThreadUrl(queue, out_queue) t.setDaemon(True) t.start() 

creates 5 threads that are waiting for any element in the queue.

#populate queue with data for host in hosts: queue.put(host) 

fills the queue. After this threads start their processing.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! Is there any differencee between populating the queue before the loop and after the loop?
After your first loop (that creates the ThreadUrls), you have 6 threads. Your main thread feeds the queue; the other threads consume from that queue, and if the queue is empty, they block until something appears in the queue. With populating the queue BEFORE creating the threads, the first thread will see 5 jobs, the second will see 4, etc. So each thread can immediately consume from the queue. With populating the queue AFTER starting the threads, all threads initially block, since the queue is empty. Only after you add in an element does one thread obtain from the queue.
@Kui Tang, thank you for the explanation! So basically there is no difference between populating the queue in the beginning and at the end since the threads will always wait for any possible item in the queue during the program is working right?
@Shaokan You are correct; both forms are functionally the same in your program.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.