I have writen a program that can be summarized as follows:
def loadHugeData(): #load it return data def processHugeData(data, res_queue): for item in data: #process it res_queue.put(result) res_queue.put("END") def writeOutput(outFile, res_queue): with open(outFile, 'w') as f res=res_queue.get() while res!='END': f.write(res) res=res_queue.get() res_queue = multiprocessing.Queue() if __name__ == '__main__': data=loadHugeData() p = multiprocessing.Process(target=writeOutput, args=(outFile, res_queue)) p.start() processHugeData(data, res_queue) p.join() The real code (especially writeOutput()) is a lot more complicated. writeOutput() only uses these values that it takes as its arguments (meaning it does not reference data)
Basically it loads a huge dataset into memory and processes it. Writing of the output is delegated to a sub-process (it writes into multiple files actually and this takes a lot of time). So each time one data item gets processed it is sent to the sub-process trough res_queue which in turn writes the result into files as needed.
The sub-process does not need to access, read or modify the data loaded by loadHugeData() in any way. The sub-process only needs to use what the main process sends it trough res_queue. And this leads me to my problem and question.
It seems to me that the sub-process gets its own copy of the huge dataset (when checking memory usage with top). Is this true? And if so then how can i avoid id (using double memory essentially)?
I am using Python 2.6 and program is running on linux.