1

I'm new to Python and multithreading, so please bear with me.

I'm writing a script to process domains in a list through Web of Trust, a service that ranks websites from 1-100 on a scale of "trustworthiness", and write them to a CSV. Unfortunately Web of Trust's servers can take quite a while to respond, and processing 100k domains can take hours.

My attempts at multithreading so far have been disappointing -- attempting to modify the script from this answer gave threading errors, I believe because some threads took too long to resolve.

Here's my unmodified script. Can someone help me multithread it, or point me to a good multithreading resource? Thanks in advance.

import urllib import re text = open("top100k", "r") text = text.read() text = re.split("\n+", text) out = open('output.csv', 'w') for element in text: try: content = urllib.urlopen("http://api.mywot.com/0.4/public_query2?target=" + element) content = content.read() content = content[content.index('<application name="0" r="'):content.index('" c')] content = element + "," + content[25] + content[26] + "\n" out.write(content) except: pass 
4
  • 1
    Threading in Python is often a wash unless you work around the GIL (e.g. write a Python C extension); in the case above, it may work okay because of the time spent in the IO blocks... anyway, have you considered using a (single-threaded) event-framework like twisted instead? Commented Jun 25, 2010 at 18:29
  • As this isn't running on my server, I would prefer doing this without having to install 3rd-party frameworks. Commented Jun 25, 2010 at 18:35
  • 1
    As might be expected, WOT isn't fond of copying their database in such a way, and so may start to block your requests (mywot.com/pl/terms/api )... Maybe you should use their commercial service? Commented Jun 25, 2010 at 18:41
  • Also, if you use Jython or IronPython, the problems with the GIL don't apply (it's a C-Python implementation detail). Commented Jun 25, 2010 at 19:03

1 Answer 1

1

A quick scan through the WoT API documentation shows that as well as the public_query2 request that you are using, there is a public_query_json request that lets you get the data in batches of up to 100. I would suggest using that before you start flooding their server with lots of requests in parallel.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.