How to improve performance through Python multithreading

Question

I'm new to Python and multithreading, so please bear with me.

I'm writing a script to process domains in a list through Web of Trust, a service that ranks websites from 1-100 on a scale of "trustworthiness", and write them to a CSV. Unfortunately Web of Trust's servers can take quite a while to respond, and processing 100k domains can take hours.

My attempts at multithreading so far have been disappointing -- attempting to modify the script from this answer gave threading errors, I believe because some threads took too long to resolve.

Here's my unmodified script. Can someone help me multithread it, or point me to a good multithreading resource? Thanks in advance.

import urllib import re text = open("top100k", "r") text = text.read() text = re.split("\n+", text) out = open('output.csv', 'w') for element in text: try: content = urllib.urlopen("http://api.mywot.com/0.4/public_query2?target=" + element) content = content.read() content = content[content.index('<application name="0" r="'):content.index('" c')] content = element + "," + content[25] + content[26] + "\n" out.write(content) except: pass

Threading in Python is often a wash unless you work around the GIL (e.g. write a Python C extension); in the case above, it may work okay because of the time spent in the IO blocks... anyway, have you considered using a (single-threaded) event-framework like twisted instead? — user166390
– user166390, Commented Jun 25, 2010 at 18:29
As this isn't running on my server, I would prefer doing this without having to install 3rd-party frameworks. — Tom
– Tom, Commented Jun 25, 2010 at 18:35
As might be expected, WOT isn't fond of copying their database in such a way, and so may start to block your requests (mywot.com/pl/terms/api )... Maybe you should use their commercial service? — mbq
– mbq, Commented Jun 25, 2010 at 18:41
Also, if you use Jython or IronPython, the problems with the GIL don't apply (it's a C-Python implementation detail). — Nick Bastin
– Nick Bastin, Commented Jun 25, 2010 at 19:03

Dave Kirby · Accepted Answer · 2010-06-25 19:10:20Z

A quick scan through the WoT API documentation shows that as well as the public_query2 request that you are using, there is a public_query_json request that lets you get the data in batches of up to 100. I would suggest using that before you start flooding their server with lots of requests in parallel.

Collectives™ on Stack Overflow

How to improve performance through Python multithreading

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related