Count the occurence of nucleobases in DNA string

Question

Inspired by this meta question I decided to take a look at Rosalind. Their first challenge seemed easy enough:

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

Sample Dataset
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC 
Sample Output
20 12 17 21 

Since I'm still on my quest to learn both regex and Ruby, I decided to go that route:

def countACGT(str) list = [0,0,0,0] str.scan(/A|C|G|T/) do |sub| if sub == "A" list[0] += 1 end if sub == "C" list[1] += 1 end if sub == "G" list[2] += 1 end if sub == "T" list[3] += 1 end end return list end

I'm not a big fan of long if chains. Luckily, Ruby has a case statement as well:

def countACGT(str) list = [0,0,0,0] str.scan(/A|C|G|T/) do |sub| puts case sub when "A" list[0] += 1 when "C" list[1] += 1 when "G" list[2] += 1 when "T" list[3] += 1 end end return list end

Both can be invoked like:

p countACGT("TCCCACTTCAGGGTCAGGGAGCTCCAAACTCTCTTTCTAGAGATGACAATCGAGAGTGAGATAAGGTGGATAGCAATCGTTATGGGATGTAAGCGCCAAGCGTTCGGGTAGCCCACGTTGCGGGCTAATCGCTAGGCTAGAACCTCTAAGCTGTACTTCTGTCAAAACGGAAAGAATCATACCGCACACCAACACTCGATGTAATGTAAGGATATCCTGTGCAGATGAGGTGCTTGGTACGCTAGATACTAGTATTACTAACACACAACATTACCGCCCAAGCGTGTCAGCCACGGACCAGATGACTCTTGCCGATTGAATACCTATCATCCTTACGGTCCGGAATCAGTATATCGCGTGCACAGTTACAGTGGTTAACTTGAGCTAGAGCAAGATAATGTGCGATCTGCGCACTCGGTGGGCTTGGATCACCCTACTTCCAATTGCCCGCGTATGATAGTTCCACCACTCACAAGTCTGTCATAGTGATTATCAAGAGTAGGCGTAGTGGGCACCCAAGAAATTAATGAATCTCACAGTCGAGTGTATCTTCGGCCATATCCCTACGGCAAATGGTCGCTCAGCTTGTCTCCGAGAGTTCGTTGGTTCAGAACCTCCGAAGGGTTGGGTGATTGTTGCGGCGCGCATGCGAGCTATGGTGGCTGTGTGTGGAGGTATTATCAGGGGAAATTTATTCCGAGTACTTGCTTGACTGCTCTTTTGTAAGCCGTTTGGGGTGCGTCCTCTGTATAGTCGTCGCCGCGAAGCCGATTCCCTCTAATCAAACACGCCAGAGGATCACTGGTCTTCTTCAAATTCCTGTATACCTCTGGCTAAATGACCCGACGGTACGAGCGTTTACTTCGAAGTG")

Yes, the dataset I had to solve was that long.

However, my interpreter suddenly feels the need to print the updated list[x] every time a new value gets assigned. This leads to a significant decrease in performance.

So I downloaded the official Ruby interpreter via the downloader (version 2.2.3 (x64)) and the exact same thing happens.

It almost looks like case is not the preferred way of doing this, but that's not intuitive.

I'm mainly looking for a definitive answer on which version I should stick with (and why) and general maintainability improvements. I'm perfectly aware regex may not be the optimal solution here, but I'd like to stick with it anyway.

Barry · Accepted Answer · 2015-12-31 14:23:36Z

Your code is fine and readable from a C/Java perspective. I don't think it's particularly Ruby-ic to use return statements. Just put list at the end on its own.

Why your case is slow

You have this extra puts here:

str.scan(/A|C|G|T/) do |sub| puts case sub ^^^^ when "A" ...

You may want to get rid of that :) That's why you interpreter prints the list every time. You told it to...

Functionally better

But functionally, we can do way better. Ruby's enumerables have a group_by method:

def countACGT(str) str.chars.group_by(&:chr) end

That'll give you a hash from each key (nucleotide base) to the values in the collection (a list of each occurence of each nucleotide base). All you have to do then is map it to just give you the size:

def countACGT(str) str.chars.group_by(&:chr).map { |k, v| [k, v.size] } end

That'll give you, for your example, the list:

[["T", 228], ["C", 209], ["A", 214], ["G", 220]]

If you want to get it in ACGT order like your original, you can just sort it and map off the key:

def countACGT(str) str.chars.group_by(&:chr).sort.map{|k, v| v.size} end

I'm feeling quite stupid about putting the puts before the case. Thanks for catching that and the rest of the advice! — Mast
– Mast ♦, Commented Dec 31, 2015 at 14:27

Sergii K · Accepted Answer · 2016-01-04 17:13:18Z

I've make small research on your question. My first idea was:

def countACGT(str) str.chars.sort.join.scan(/A+|C+|G+|T+/).map(&:length) end

Then I've write few another methods (Benchmark used to asses their performance). Also include method proposed by Barry.

require 'benchmark' REP_NUM = 10_000 # number of tests runs. PATTERN = "..." # here is string from your example. def countACGT(str) list = [0,0,0,0] str.scan(/A|C|G|T/) do |sub| # code of original method return list end def countACGT_with_object(str) str.chars.each.with_object(Hash.new(0)) do |letter, hash| hash[letter] += 1 end.sort.map(&:last) end def countACGT_scan(str) str.chars.sort.join.scan(/A+|C+|G+|T+/).map(&:length) end def countACGT_pack(str) str.unpack("c*").sort.pack("c*").scan(/A+|C+|G+|T+/).map(&:length) end def countACGT_group_by(str) str.chars.group_by(&:chr).sort.map{|k, v| v.size} end def count_unpack_group_by(str) str.unpack("c*").group_by{|i| i}.values.sort.map(&:size) end Benchmark.bm do |x| x.report('countACGT') { REP_NUM.times { countACGT(PATTERN) } } x.report('countACGT_with_object') { REP_NUM.times { countACGT_with_object(PATTERN) } } x.report('countACGT_scan') { REP_NUM.times { countACGT_scan(PATTERN) } } x.report('countACGT_pack') { REP_NUM.times { countACGT_pack(PATTERN) } } x.report('countACGT_group_by') { REP_NUM.times { countACGT_group_by(PATTERN) } } x.report('count_unpack_group_by') { REP_NUM.times { count_unpack_group_by(PATTERN) } } end

Results:

Method user system total real countACGT 8.76 0.00 8.76 8.777035 countACGT_with_object 4.73 0.00 4.73 4.745783 countACGT_scan 4.56 0.00 4.56 4.563267 countACGT_pack 3.37 0.00 3.37 3.380988 countACGT_group_by 4.93 0.00 4.93 4.941314 count_unpack_group_by 2.67 0.01 2.68 2.706674

Last method is the fastest (for now). It uses unpack and there is no additional method call in group_by block.

Welcome to Code Review! Your post looks useful. But is it a review of the code in the question? Not really... Could you please kindly adjust your answer in a way so that looks more like a review of the code in the question? Thanks! — janos
– janos, Commented Dec 31, 2015 at 15:26
At some point you're right. I've not posted first part of my answer (with actual review) because Barry is few minutes ahead of me. I don't like when answers repeating each other(but with other words). And thanks to Mast, because of his question I've found some interesting use-cases of unpack/pack. — Sergii K
– Sergii K, Commented Dec 31, 2015 at 16:23
@janos, I think you're applying the rules too literally. This is fantastic post, and his rewrite of the OP is very helpful. The "review" is implicit. Sometimes over-explanation with words -- when the code speaks -- makes a post worse. — Jonah
– Jonah, Commented Jan 2, 2016 at 2:17

Stack Exchange Network

Count the occurence of nucleobases in DNA string

2 Answers 2

Why your case is slow

Functionally better

You must log in to answer this question.

Linked

Hot Network Questions

Count the occurence of nucleobases in DNA string

2 Answers 2

Why your case is slow

Functionally better

You must log in to answer this question.

Linked

Related

Hot Network Questions