I wanted to write some code which reads in a FASTA file and one hot encodes the sequence which is consequentially saved to a file. A FASTA file is a text based file format commonly used in bioinformatics. Here's an example:
>chr1 ACGTCTCGTGC NNNNNNNNNNN ACGTTGGGGGG >chr2 AGGGTCTGGGT NNNNNNNNNNN It has a header string with > and then some genetic sequence following. Every character in that sequence represents a nucleotide. For example "A" stands for Adenine and "N" for unknown. The number of possible characters in the sequence is limited, but there are some encodings allowing different representations. But this is not that important for me, since I will probably filter beforehand anyway.
The output should be a text file with the one hot encoded sequence and the header for identification. Depending on how one would choose the encoding this could look something like this for the first three nucleotides(ACG):
>chr1 0,0,0,1 0,1,0,0 0,0,1,0 I wrote some Rust code together to achieve this and it works, but it seems to be pretty slow. Maybe I should also mention at this point that the FASTA files can get really big (many GB) in no time. I tried my code (cargo build --release) with an ~60MB .fa file and took about ~1m.30s and some generic Python code I wrote together in 10 minutes took only ~40s.
Now I wanted to know why my Rust Code is slower then it's Python equivalent. Did I write it inn an idiomatic Rust way? I do not want to store the data in something like HDF5. I just want to understand Rust better. Especially I/O heavy text processing tasks, since these are common in bioinformatics.
Here's the code I wrote:
use std::fs::File; use std::io::{BufReader, Write}; use std::io::prelude::*; use std::env; use std::fs::OpenOptions; fn main() -> std::io::Result<()>{ // gather cmd arguments let args: Vec<String> = env::args().collect(); let filename: &String = &args[1]; let outfile: &String = &args[2]; // Open the file to read from let file = File::open(&filename).unwrap(); // Create a buffered reader -> instead of reading everything to ram let reader = BufReader::new(file); let lines = reader.lines(); // this allows to open a file in file append mode to write (or create it) let mut to_write_file = OpenOptions::new() .append(true) .create(true) // Optionally create .open(outfile) .expect("Unable to open file"); // iterate over every line for l in lines { // check if we get a string back (Result<String, Error>) if let Ok(line_string) = l { // check for header lines if line_string.contains("chr"){ println!("Processing: {}",line_string); // write to file so we can keep track of chroms writeln!(to_write_file, "{}",line_string).unwrap(); } else { // iterate over nucleotides for character in line_string.chars() { // one hot encode the nucleotide let nucleotide: String = one_hot_encode_string(String::from(character)); writeln!(to_write_file,"{}", nucleotide)?; } } } } Ok(()) } fn one_hot_encode_string(nucleotide: String) -> String { // One hot encodes a Nucleotide: String let encoding: String; // create return variable // OHE the nucleotide match nucleotide.as_str() { "A" => encoding = String::from("0,0,0,1"), "G" => encoding = String::from("0,0,1,0"), "C" => encoding = String::from("0,1,0,0"), "T" => encoding = String::from("1,0,0,0"), "N" => encoding = String::from("0,0,0,0"), _ => encoding = String::from("NA"), } encoding } Her's also the "naive" python code:
import sys # one hot encode nulceotide def ohe(nucleotide: str) -> str: if nucleotide == "A": return("0,0,0,1") if nucleotide == "G": return("0,0,1,0") if nucleotide == "T": return("0,1,0,0") if nucleotide == "C": return("1,0,0,0") if nucleotide == "N": return("0,0,0,0") else: return("NA") def main() -> None: # get arguments input: str = sys.argv[1] output: str = sys.argv[2] # open in file with open(input) as in_f: # open out file with open(output, 'a') as out_f: for line in in_f: # get header if line.startswith(">"): out_f.write(f"{line.rstrip()}\n") else: # iterate over every nucleotide for character in line.rstrip(): # one hot encode character # print(ohe(character)) out_f.write(f"{ohe(character)}\n") if __name__ == "__main__": main()