I have extremely large files. Each file is almost 2GB. Therefore, I would like to run multiple files in parallel. And I can do that because all of the files have similar format therefore, file reading can be done in parallel. I know I should use multiprocessing library but I am really confused how to use it with my code.
My code for file reading is:
def file_reading(file,num_of_sample,segsites,positions,snp_matrix): with open(file,buffering=2000009999) as f: ###I read file here. I am not putting that code here. try: assert len(snp_matrix) == len(positions) return positions,snp_matrix ## return statement except: print('length of snp matrix and length of position vector not the same.') sys.exit(1) My main function is:
if __name__ == "__main__": segsites = [] positions = [] snp_matrix = [] path_to_directory = '/dataset/example/' extension = '*.msOut' num_of_samples = 162 filename = glob.glob(path_to_directory+extension) ###How can I use multiprocessing with function file_reading number_of_workers = 10 x,y,z = [],[],[] array_of_number_tuple = [(filename[file], segsites,positions,snp_matrix) for file in range(len(filename))] with multiprocessing.Pool(number_of_workers) as p: pos,snp = p.map(file_reading,array_of_number_tuple) x.extend(pos) y.extend(snp) So my input to the function is as follows:
- file - list containing filenames
- num_of_samples - int value
- segsites - initially an empty list to which I want to append as I am reading the file.
- positions - initially an empty list to which I want to append as I am reading the file.
- snp_matrix - initially an empty list to which I want to append as I am reading the file.
The function returns positions list and snp_matrix list at the end. How can I use multiprocessing for this where my arguments are lists and integer? The way I've used multiprocessing gives me following error:
TypeError: file_reading() missing 3 required positional arguments: 'segsites', 'positions', and 'snp_matrix'