1

I'm trying to make my program run faster, using threads but it takes too many time. The code must compute two kinds of matrices (word_level where I compare every two words of the query and a document, sequence_level: where I compare the query to different sequences on the document. Here are the principal functions:

import threading from threading import Thread def sim_QxD_word(query, document, model, alpha, outOfVocab, lock): #word_level sim_w = {} for q in set(query.split()): sim_w[q] = {} qE = [] if q in model.vocab: qE = model[q] elif q in outOfVocab: qE = outOfVocab[q] else: qE = numpy.random.rand(model.layer1_size) # random vector lock.acquire() outOfVocab[q] = qE lock.release() for d in set(document.split()): dE = [] if d in model.vocab: dE = model[d] elif d in outOfVocab: dE = outOfVocab[d] else: dE = numpy.random.rand(model.layer1_size) # random vector lock.acquire() outOfVocab[d] = dE lock.release() sim_w[q][d] = sim(qE,dE,alpha) return (sim_w, outOfVocab) def sim_QxD_sequences(query, document, model, outOfVocab, alpha, lock): #sequence_level # 1. extract document sequences document_sequences = [] for i in range(len(document.split())-len(query.split())): document_sequences.append(" ".join(document.split()[i:i+len(query.split())])) # 2. compute similarities with a query sentence lock.acquire() query_vec, outOfVocab = avg_sequenceToVec(query, model, outOfVocab, lock) lock.release() sim_QxD = {} for s in document_sequences: lock.acquire() s_vec, outOfVocab = avg_sequenceToVec(s, model, outOfVocab, lock) lock.release() sim_QxD[s] = sim(query_vec, s_vec, alpha) return (sim_QxD, outOfVocab) def word_level(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, lock): #print("in word_level") sim_w, outOfVocab = sim_QxD_word(q_clean, d_text, model, alpha, outOfVocab, lock) numpy.save(join(out_w, str(q)+ext_id+"word_interactions.npy"), sim_w) def sequence_level(q_clean, d_text, model, outOfVocab, alpha, out_s, q, ext_id, lock): #print("in sequence_level") sim_s, outOfVocab = sim_QxD_sequences(q_clean, d_text, model, outOfVocab, alpha, lock) numpy.save(join(out_s, str(q)+ext_id+"sequence_interactions.npy"), sim_s) def extract_AllFeatures_parall(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, out_s, lock): #print("in extract_AllFeatures") thW=Thread(target = word_level, args=(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, lock)) thW.start() thS=Thread(target = sequence_level, args=(q_clean, d_text, model, outOfVocab, alpha, out_s, q, ext_id, lock)) thS.start() thW.join() thS.join() def process_documents(documents, index, model, alpha, outOfVocab, out_w, out_s, queries, stemming, stoplist, q): #print("in process_documents") q_clean = clean(queries[q],stemming, stoplist) lock = threading.Lock() for d in documents: ext_id, d_text = reaDoc(d, index) extract_AllFeatures_parall(q_clean, d_text, model, alpha, outOfVocab, out_w, q, ext_id, out_s, lock) outOfVocab={} # shared variable over all threads queries = {"1":"first query", ...} # can contain 200 elements .... threadsList = [] for q in queries.keys(): thread = Thread(target = process_documents, args=(documents, index, model, alpha, outOfVocab, out_w, out_s, queries, stemming, stoplist, q)) thread.start() threadsList.append(thread) for th in threadsList: th.join() 

How can I optimize the different functions to make it run faster? Thanks in advance for responding.

2
  • 2
    don't use threads, use processes. see proposed duplicate Commented Dec 13, 2017 at 14:51
  • use lambda to avoid calling when passing params as proposed answer states Commented Dec 13, 2017 at 15:08

1 Answer 1

1

I'm just going to focus on these lines of code in this answer

thread = Thread(target = process_documents(documents, index, model, alpha, outOfVocab, out_w, out_s, queries, stemming, stoplist, q)) thread.start() 

From the documentation https://docs.python.org/2/library/threading.html

target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.

Target should be a callable. In your code you are passing in the result of a call to process_documents. What you want to do is say target=process_documents (i.e. pass in the function itself - which is a callable) and also pass in the args/kwargs as needed.

At the moment your code is running sequentially, every call to process_documents is happening the same thread. You need to give the thread the job you want it to do, not the result of the job.

Sign up to request clarification or add additional context in comments.

3 Comments

you're right. That's a classic issue. But even like this this isn't going to be much faster because of python GIL, since all processes are executing pure python code.
Ok, I'll try to take into account your comments, I'm sorry if my question is stupide, I'm just getting started with parallel programming on python.
I've just corrected the use of Thread() class, I notice that just one node on my computer which is used up to 100% and the program is slower than previously, what can be a problem is?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.