High performance parsing for large, well-formatted text files

Question

I am looking to optimize the performance of a big data parsing problem I have using Python. The example data I show are segments of whole genome DNA sequence alignments for six primate species.

Each file contains multiple segments with the following format:

<MULTI-LINE HEADER> # number of header lines mirrors number of data columns <DATA BEGIN FLAG> # the word 'DATA' <DATA COLUMNS> # variable number of columns <DATA END FLAG> # the pattern '//' <EMPTY LINE>

Example:

SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621) SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691) SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203) SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017) SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215) SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243) DATA GGGGGG CCCCTC ...... # continue for 10-100 thousand lines // SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621) SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691) SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203) SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017) DATA CCCC GGGG .... # continue for 10-100 thousand lines // <ETC>

I will only process segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6 (a quality control flag) equals '1' for both of these species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely and continue to the next header to check the species and flags.

Fields 4 and 5 are start and end coordinates, which I will record from the homo_sapiens line. I count from start for another process (described below).

I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta at each position. The line on which a species appears in the header indexes the data column for that species, so in example 1 where both target species are present, the (0-based) indices for homo_sapiens and macaca_mulatta would be 0 and 4, respectively.

Importantly, the number of species/columns is not always the same, hence the importance of processing the header for each segment.

For a segment containing info for homo_sapiens and macaca_mulatta, I scan for positions where the two DO NOT match and store those positions in a list.

Finally, some positions have "gaps" or lower quality data, i.e. aaa--A and these I skip: both species must have a letter from the set {'A', 'C', 'G', 'T'}. Whether they match or not, I record all positions where both species' letters are in the set with the counter valid and include this in the dictionary key for that segment.

My final data structure for a given file is a dictionary which looks like this:

{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions), (segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...}

Here is the function I have written to carry this out using a for-loop:

def human_macaque_divergence(compara_file): """ A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts :param compara_file: a gzipped, multi-species genomic alignment broken into segments with new headers for each segment :return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values GUIDE TO HEADER FIELDS: header_flag species chromosome start end quality_flag chromosome_info SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621) SEGMENT ORGANIZATION: < multi-line 'SEQ' header > < 'DATA' start data block flag > < multi-line data block > < '//' end data block flag > < '\n' single blank line > """ div_dict = {} ch = re.search('chr([1-9]\d?)', compara_file).group(1) # extract the chromosome number for later use with gz.open(compara_file, 'rb') as f: # flags, containers, counters and indices: species = [] starts = [] ends = [] mismatch = [] valid = 0 pos = -1 hom = None mac = None species_data = False # a flag signalling that we should parse a data block skip_block = False # a flag signalling that a data block should be skipped for line in f: if species_data and '//' not in line: # for most lines, 'species_data' should be True assert skip_block is False assert pos > 0 human = line[hom] macaque = line[mac] if {human, macaque}.issubset(bases): valid += 1 if human != macaque: mismatch += [pos] pos += 1 elif skip_block and '//' not in line: # second most common condition, 'skip_block' is true assert species_data is False continue elif 'SEQ' in line: # 'SEQ' signifies a segment header assert species_data is False assert skip_block is False line = line.split() if line[2] == ch and line[5] == '1': # check that chromosome is correct for the block and quality flag is '1' species += [line[1]] # collect species into a list starts += [int(line[3])] # collect starts and ends into a list ends += [int(line[4])] elif {'homo_sapiens', 'macaca_mulatta'}.issubset(species) and 'DATA' in line: # test that target species are in the following data block assert species_data is False assert skip_block is False species_data = True hom = species.index('homo_sapiens') mac = species.index('macaca_mulatta') pos = starts[hom] elif not {'homo_sapiens', 'macaca_mulatta'}.issubset(species) and 'DATA' in line: assert species_data is False assert skip_block is False skip_block = True elif '//' in line: # '//' signifies segment boundary # store segment results if data was collected from the segment: if species_data: assert skip_block is False div_dict[(starts[hom], ends[hom], valid)] = mismatch else: assert skip_block # reset flags, containers, counters and indices for the next data block species = [] starts = [] ends = [] mismatch = [] valid = 0 pos = -1 hom = None mac = None species_data = False skip_block = False return div_dict

The code works fine presently and processes a 300MB file in a little over a minute, which means that the full data set I am working with will take probably an hour or two.

I am looking for suggested improvements on the current code or perhaps a totally different approach? In particular, the for-loop kind of bugs me for so many lines and I wonder if there is a more bulk approach that would be more efficient since the data is more or less uniformly formatted. Also, I don't personally think regular expressions will improve any of the simple string operations, although maybe for "bulk" extraction of some kind they might be helpful...

Thanks for any suggestions!

Eric · Accepted Answer · 2015-07-27 18:44:55Z

You should separate your "business logic" (ie, the rules you're using to analyse and filter the file) from your file parsing:

def parse(stream): while True: headers = [] for line in stream: if not line.strip() and not headers: # empty line before chunk start pass elif line.startswith('SEQ'): headers.append(line) elif line.startswith('DATA'): break else: raise ValueError("Unexpected line %s" % line) else: if headers: # stream ended, after finding some headers raise ValueError("No data section found") else: # stream ended normally return def data_iter_gen(): for line in stream: if line.startswith("//"): break yield line data_it = data_iter_gen() yield headers, data_it # consume the data iterator, so that we're at # the end of a record in the next iteration of this while loop for d in data_it: pass def headers_look_ok(headers): # headers (a `list`) is small, so this doesn't need much care def process(headers, data): # headers is a list, data is an iterator with gz.open(compara_file, 'rb') as f: for headers, data in parse(f): if headers_look_ok(headers): process(headers, data)

Note the careful use of iterators here to prevent loading more than is needed into ram.

This also has the benefit of easier file format validation, rather than failing silently and unexpectedly

Thanks for your code, it has given me some better intuition for iterator/generator processing. Although I find your solution more elegant than the function I posted, is there any major performance gain here or are you just suggesting stylistic and "best practice" improvements? My understanding is that the f in for line in f is already an iterator, so I am not loading anything excessive into RAM from the file, and I flush out all of the counters, flags etc. for each segment. Also, could you elaborate on what you meant by "business logic"? Thanks again for your help :) — isosceleswheel
– isosceleswheel, Commented Jul 27, 2015 at 13:46
@isosceleswheel: This doesn't directly give a performance improvement, but it allows you to pipeline your processing, and maybe do parts in parallel (ie, move sections you want to keep into separate files, and then use multiprocess) — Eric
– Eric, Commented Jul 27, 2015 at 18:37

Stack Exchange Network

High performance parsing for large, well-formatted text files

1 Answer 1

You must log in to answer this question.

Hot Network Questions

High performance parsing for large, well-formatted text files

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions