Rust program to one hot encode genetic sequences from .fa files

Question

I wanted to write some code which reads in a FASTA file and one hot encodes the sequence which is consequentially saved to a file. A FASTA file is a text based file format commonly used in bioinformatics. Here's an example:

>chr1 ACGTCTCGTGC NNNNNNNNNNN ACGTTGGGGGG >chr2 AGGGTCTGGGT NNNNNNNNNNN

It has a header string with > and then some genetic sequence following. Every character in that sequence represents a nucleotide. For example "A" stands for Adenine and "N" for unknown. The number of possible characters in the sequence is limited, but there are some encodings allowing different representations. But this is not that important for me, since I will probably filter beforehand anyway.

The output should be a text file with the one hot encoded sequence and the header for identification. Depending on how one would choose the encoding this could look something like this for the first three nucleotides(ACG):

>chr1 0,0,0,1 0,1,0,0 0,0,1,0

I wrote some Rust code together to achieve this and it works, but it seems to be pretty slow. Maybe I should also mention at this point that the FASTA files can get really big (many GB) in no time. I tried my code (cargo build --release) with an ~60MB .fa file and took about ~1m.30s and some generic Python code I wrote together in 10 minutes took only ~40s.

Now I wanted to know why my Rust Code is slower then it's Python equivalent. Did I write it inn an idiomatic Rust way? I do not want to store the data in something like HDF5. I just want to understand Rust better. Especially I/O heavy text processing tasks, since these are common in bioinformatics.

Here's the code I wrote:

use std::fs::File; use std::io::{BufReader, Write}; use std::io::prelude::*; use std::env; use std::fs::OpenOptions; fn main() -> std::io::Result<()>{ // gather cmd arguments let args: Vec<String> = env::args().collect(); let filename: &String = &args[1]; let outfile: &String = &args[2]; // Open the file to read from let file = File::open(&filename).unwrap(); // Create a buffered reader -> instead of reading everything to ram let reader = BufReader::new(file); let lines = reader.lines(); // this allows to open a file in file append mode to write (or create it) let mut to_write_file = OpenOptions::new() .append(true) .create(true) // Optionally create .open(outfile) .expect("Unable to open file"); // iterate over every line for l in lines { // check if we get a string back (Result<String, Error>) if let Ok(line_string) = l { // check for header lines if line_string.contains("chr"){ println!("Processing: {}",line_string); // write to file so we can keep track of chroms writeln!(to_write_file, "{}",line_string).unwrap(); } else { // iterate over nucleotides for character in line_string.chars() { // one hot encode the nucleotide let nucleotide: String = one_hot_encode_string(String::from(character)); writeln!(to_write_file,"{}", nucleotide)?; } } } } Ok(()) } fn one_hot_encode_string(nucleotide: String) -> String { // One hot encodes a Nucleotide: String let encoding: String; // create return variable // OHE the nucleotide match nucleotide.as_str() { "A" => encoding = String::from("0,0,0,1"), "G" => encoding = String::from("0,0,1,0"), "C" => encoding = String::from("0,1,0,0"), "T" => encoding = String::from("1,0,0,0"), "N" => encoding = String::from("0,0,0,0"), _ => encoding = String::from("NA"), } encoding }

Her's also the "naive" python code:

import sys # one hot encode nulceotide def ohe(nucleotide: str) -> str: if nucleotide == "A": return("0,0,0,1") if nucleotide == "G": return("0,0,1,0") if nucleotide == "T": return("0,1,0,0") if nucleotide == "C": return("1,0,0,0") if nucleotide == "N": return("0,0,0,0") else: return("NA") def main() -> None: # get arguments input: str = sys.argv[1] output: str = sys.argv[2] # open in file with open(input) as in_f: # open out file with open(output, 'a') as out_f: for line in in_f: # get header if line.startswith(">"): out_f.write(f"{line.rstrip()}\n") else: # iterate over every nucleotide for character in line.rstrip(): # one hot encode character # print(ohe(character)) out_f.write(f"{ohe(character)}\n") if __name__ == "__main__": main()

Do you have a good link or summary to what the hot-encoding format is? — Davislor
– Davislor, Commented Dec 28, 2021 at 13:29
On a modern OS, you shouldn’t worry about “reading the file contents into RAM.” It doesn’t work that way any longer: when you memory-map a file, only the pages that you actually touch will be loaded into physical memory. This is typically faster than copying the data to a buffer. — Davislor
– Davislor, Commented Dec 28, 2021 at 13:31
@Davislor That is if you memory-map a file. mmap isn't in Rust's standard library. — AKX
– AKX, Commented Dec 28, 2021 at 13:47
@Davislor: Hre's a link to an explanation of one hot encoding — dry-leaf
– dry-leaf, Commented Dec 28, 2021 at 13:50

AKX · Accepted Answer · 2021-12-29 09:35:25Z

Your programs aren't quite equivalent; one looks at whether a line starts with >, the other looks for chr in each line. That doesn't make much of a difference in the big picture, but the play area should be level, shouldn't it?

Secondly, as I mentioned over on SO, you should probably avoid String::from calls wherever possible. Since those one-hot encodings are constant, you can just return a &'static str.

Thirdly, and this is the thing that really makes all the difference, your Python program is implicitly using buffered writes instead of directly writing to the output file every time.

Wrapping to_write_file in a BufReader::new makes quite the difference, see below. u60.fasta is an arbitrary 60 megabyte FASTA file.

original

(original) $ time cargo run --release -- u60.fasta /dev/null Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415) Finished release [optimized] target(s) in 0.35s Running `target/release/cr272415 u60.fasta /dev/null` Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly ________________________________________________________ Executed in 6.66 secs fish external usr time 4.74 secs 113.00 micros 4.74 secs sys time 2.44 secs 799.00 micros 2.44 secs

optimized

~/b/cr272415 (master) $ time cargo run --release -- u60.fasta /dev/null Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415) Finished release [optimized] target(s) in 0.38s Running `target/release/cr272415 u60.fasta /dev/null` Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly ________________________________________________________ Executed in 650.30 millis fish external usr time 1.21 secs 147.00 micros 1.21 secs sys time 0.17 secs 939.00 micros 0.16 secs

I'm not an experienced Rustacean, so someone can probably improve on this, but the optimized version is below. Some of the changes were suggested by cargo clippy (which you should also heed :-) ).

use std::env; use std::fs::File; use std::fs::OpenOptions; use std::io::{BufRead, BufReader, BufWriter, Write}; fn main() -> std::io::Result<()> { // gather cmd arguments let args: Vec<String> = env::args().collect(); let filename: &String = &args[1]; let outfile: &String = &args[2]; // Open the file to read from let file = File::open(&filename).unwrap(); // Create a buffered reader -> instead of reading everything to ram let reader = BufReader::new(file); let lines = reader.lines(); // this allows to open a file in file append mode to write (or create it) let mut to_write_file = BufWriter::new( OpenOptions::new() .append(true) .create(true) // Optionally create .open(outfile) .expect("Unable to open file"), ); // iterate over every line for line_string in lines.flatten() { // check for header lines if line_string.starts_with('>') { println!("Processing: {}", line_string); // write to file so we can keep track of chroms writeln!(to_write_file, "{}", line_string).unwrap(); } else { // iterate over nucleotides for character in line_string.chars() { // one hot encode the nucleotide let nucleotide = one_hot_encode_string(character); writeln!(to_write_file, "{}", nucleotide)?; } } } Ok(()) } fn one_hot_encode_string(nucleotide: char) -> &'static str { // One hot encodes a Nucleotide character match nucleotide { 'A' => "0,0,0,1", 'G' => "0,0,1,0", 'C' => "0,1,0,0", 'T' => "1,0,0,0", 'N' => "0,0,0,0", _ => "NA", } }

Edit ⚡️

Some background thread in my brain realized one more way to make this faster still: don't use writeln!. There's no need to spend time parsing format strings and allocating memory just to add a newline. If you precompose the \n into the output, you can just use BufWriter.write_all() – i.e. change the end of the program above to

 let nucleotide = one_hot_encode_string(character); to_write_file.write_all(nucleotide.as_ref())?; } } } Ok(()) } fn one_hot_encode_string(nucleotide: char) -> &'static str { // One hot encodes a Nucleotide character match nucleotide { 'A' => "0,0,0,1\n", 'G' => "0,0,1,0\n", 'C' => "0,1,0,0\n", 'T' => "1,0,0,0\n", 'N' => "0,0,0,0\n", _ => "NA\n", } }

A benchmark run with hyperfine speaks more than a thousand words, as they say:

~/b/cr272415 (master) $ hyperfine --warmup 3 './original/release/cr272415 u60.fasta /dev/null' './opt1/release/cr272415 u60.fasta /dev/null' './opt2/release/cr272415 u60.fasta /dev/null' Benchmark 1: ./original/release/cr272415 u60.fasta /dev/null Time (mean ± σ): 6.401 s ± 0.441 s [User: 4.004 s, System: 2.385 s] Range (min … max): 5.978 s … 7.292 s 10 runs Benchmark 2: ./opt1/release/cr272415 u60.fasta /dev/null Time (mean ± σ): 170.8 ms ± 4.2 ms [User: 167.0 ms, System: 2.8 ms] Range (min … max): 164.5 ms … 178.4 ms 17 runs Benchmark 3: ./opt2/release/cr272415 u60.fasta /dev/null Time (mean ± σ): 59.2 ms ± 2.0 ms [User: 55.8 ms, System: 2.5 ms] Range (min … max): 56.5 ms … 65.6 ms 45 runs Summary './opt2/release/cr272415 u60.fasta /dev/null' ran 2.89 ± 0.12 times faster than './opt1/release/cr272415 u60.fasta /dev/null' 108.21 ± 8.29 times faster than './original/release/cr272415 u60.fasta /dev/null'

Edit ⚡️ x 2

One more thing I realized (though this could turn out to be CPU cache-inefficient in a larger program): inline the whole darned write call, and mark the constants as bytestrings to avoid .as_ref() altogether. Hyperfine says this is 1.32 ± 0.11 times faster than opt2 (the above edit).

macro_rules! wf { ($s:expr) => { to_write_file.write_all($s) }; } for character in line_string.chars() { match character { 'A' => wf!(b"0,0,0,1\n"), 'G' => wf!(b"0,0,1,0\n"), 'C' => wf!(b"0,1,0,0\n"), 'T' => wf!(b"1,0,0,0\n"), 'N' => wf!(b"0,0,0,0\n"), _ => wf!(b"NA\n"), } .unwrap(); }

~/b/cr272415 (vecw *) $ ./bench.sh Benchmark 1: ./targets/original/release/cr272415 u60.fasta /dev/null Time (mean ± σ): 6.294 s ± 0.182 s [User: 3.942 s, System: 2.340 s] Range (min … max): 6.060 s … 6.592 s 10 runs Benchmark 2: ./targets/opt1/release/cr272415 u60.fasta /dev/null Time (mean ± σ): 176.7 ms ± 6.9 ms [User: 172.1 ms, System: 3.2 ms] Range (min … max): 166.1 ms … 188.5 ms 16 runs Benchmark 3: ./targets/opt2/release/cr272415 u60.fasta /dev/null Time (mean ± σ): 61.5 ms ± 2.9 ms [User: 57.8 ms, System: 2.6 ms] Range (min … max): 58.4 ms … 73.8 ms 43 runs Benchmark 4: ./targets/vecw/release/cr272415 u60.fasta /dev/null Time (mean ± σ): 46.0 ms ± 2.9 ms [User: 42.6 ms, System: 2.5 ms] Range (min … max): 42.2 ms … 55.7 ms 60 runs Summary './targets/vecw/release/cr272415 u60.fasta /dev/null' ran 1.33 ± 0.11 times faster than './targets/opt2/release/cr272415 u60.fasta /dev/null' 3.84 ± 0.29 times faster than './targets/opt1/release/cr272415 u60.fasta /dev/null' 136.71 ± 9.55 times faster than './targets/original/release/cr272415 u60.fasta /dev/null'

Wow! Thank you very much. I implemented your changes and it runs on my machine about 2s now. This is phenomenal. And thank you for the clippy suggestions. I did not know about it. My only question would be why do we have to make the return of the one_hot_encode_string function &'static str? The Rust documentation says it's reference with static lifetime. But wouldn't a normal str work too? Many thanks in advance! — dry-leaf
– dry-leaf, Commented Dec 28, 2021 at 13:47
&'static str is the type of a string literal in code. str has no compile-time-known size (since it's really a slice), so it can't be used by itself. (You could try – the compiler will complain that you simply can't.) — AKX
– AKX, Commented Dec 28, 2021 at 13:50
@dry-leaf Hey, I figured out one more way to make things even faster - see my edit :) — AKX
– AKX, Commented Dec 28, 2021 at 18:06
Tbh, I am really impressed. I did not expect to get such a serious performance boost. This is something I did not think about all, while beeing quite obvious. Thank you very much for this insight. This will prove useful not just here, but also in future projects! :) — dry-leaf
– dry-leaf, Commented Dec 29, 2021 at 6:23

Stack Exchange Network

Rust program to one hot encode genetic sequences from .fa files

1 Answer 1

original

optimized

Edit ⚡️

Edit ⚡️ x 2

You must log in to answer this question.

Hot Network Questions

Rust program to one hot encode genetic sequences from .fa files

1 Answer 1

original

optimized

Edit ⚡️

Edit ⚡️ x 2

You must log in to answer this question.

Related

Hot Network Questions