I am looking to optimize the performance of a big data parsing problem I have using Python. The example data I show are segments of whole genome DNA sequence alignments for six primate species.
Each file contains multiple segments with the following format:
<MULTI-LINE HEADER> # number of header lines mirrors number of data columns <DATA BEGIN FLAG> # the word 'DATA' <DATA COLUMNS> # variable number of columns <DATA END FLAG> # the pattern '//' <EMPTY LINE> Example:
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621) SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691) SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203) SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017) SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215) SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243) DATA GGGGGG CCCCTC ...... # continue for 10-100 thousand lines // SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621) SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691) SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203) SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017) DATA CCCC GGGG .... # continue for 10-100 thousand lines // <ETC> I will only process segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6 (a quality control flag) equals '1' for both of these species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely and continue to the next header to check the species and flags.
Fields 4 and 5 are start and end coordinates, which I will record from the homo_sapiens line. I count from start for another process (described below).
I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta at each position. The line on which a species appears in the header indexes the data column for that species, so in example 1 where both target species are present, the (0-based) indices for homo_sapiens and macaca_mulatta would be 0 and 4, respectively.
Importantly, the number of species/columns is not always the same, hence the importance of processing the header for each segment.
For a segment containing info for homo_sapiens and macaca_mulatta, I scan for positions where the two DO NOT match and store those positions in a list.
Finally, some positions have "gaps" or lower quality data, i.e. aaa--A and these I skip: both species must have a letter from the set {'A', 'C', 'G', 'T'}. Whether they match or not, I record all positions where both species' letters are in the set with the counter valid and include this in the dictionary key for that segment.
My final data structure for a given file is a dictionary which looks like this:
{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions), (segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...} Here is the function I have written to carry this out using a for-loop:
def human_macaque_divergence(compara_file): """ A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts :param compara_file: a gzipped, multi-species genomic alignment broken into segments with new headers for each segment :return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values GUIDE TO HEADER FIELDS: header_flag species chromosome start end quality_flag chromosome_info SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621) SEGMENT ORGANIZATION: < multi-line 'SEQ' header > < 'DATA' start data block flag > < multi-line data block > < '//' end data block flag > < '\n' single blank line > """ div_dict = {} ch = re.search('chr([1-9]\d?)', compara_file).group(1) # extract the chromosome number for later use with gz.open(compara_file, 'rb') as f: # flags, containers, counters and indices: species = [] starts = [] ends = [] mismatch = [] valid = 0 pos = -1 hom = None mac = None species_data = False # a flag signalling that we should parse a data block skip_block = False # a flag signalling that a data block should be skipped for line in f: if species_data and '//' not in line: # for most lines, 'species_data' should be True assert skip_block is False assert pos > 0 human = line[hom] macaque = line[mac] if {human, macaque}.issubset(bases): valid += 1 if human != macaque: mismatch += [pos] pos += 1 elif skip_block and '//' not in line: # second most common condition, 'skip_block' is true assert species_data is False continue elif 'SEQ' in line: # 'SEQ' signifies a segment header assert species_data is False assert skip_block is False line = line.split() if line[2] == ch and line[5] == '1': # check that chromosome is correct for the block and quality flag is '1' species += [line[1]] # collect species into a list starts += [int(line[3])] # collect starts and ends into a list ends += [int(line[4])] elif {'homo_sapiens', 'macaca_mulatta'}.issubset(species) and 'DATA' in line: # test that target species are in the following data block assert species_data is False assert skip_block is False species_data = True hom = species.index('homo_sapiens') mac = species.index('macaca_mulatta') pos = starts[hom] elif not {'homo_sapiens', 'macaca_mulatta'}.issubset(species) and 'DATA' in line: assert species_data is False assert skip_block is False skip_block = True elif '//' in line: # '//' signifies segment boundary # store segment results if data was collected from the segment: if species_data: assert skip_block is False div_dict[(starts[hom], ends[hom], valid)] = mismatch else: assert skip_block # reset flags, containers, counters and indices for the next data block species = [] starts = [] ends = [] mismatch = [] valid = 0 pos = -1 hom = None mac = None species_data = False skip_block = False return div_dict The code works fine presently and processes a 300MB file in a little over a minute, which means that the full data set I am working with will take probably an hour or two.
I am looking for suggested improvements on the current code or perhaps a totally different approach? In particular, the for-loop kind of bugs me for so many lines and I wonder if there is a more bulk approach that would be more efficient since the data is more or less uniformly formatted. Also, I don't personally think regular expressions will improve any of the simple string operations, although maybe for "bulk" extraction of some kind they might be helpful...
Thanks for any suggestions!