A Rust implementation of the StarCode algorithm for efficient sequence clustering and barcode collapsing. This library provides fast, memory-efficient clustering of DNA/RNA sequences with configurable edit distance thresholds.
StarCode is an algorithm designed to cluster sequences (particularly barcodes or short DNA sequences) based on edit distance. This Rust implementation uses an optimized trie data structure with partial dynamic programming matrices to efficiently find similar sequences within a specified edit distance threshold.
- Efficient Trie-based Storage: Sequences sharing common prefixes are stored efficiently
- Partial Needleman-Wunsch: Optimized dynamic programming for edit distance calculation
- Message Passing Clustering: Advanced clustering algorithm that collapses low-frequency sequences into high-frequency ones
- Memory Optimized: Uses custom allocator (jemalloc) for better memory performance
- Thread-safe: Designed with shared ownership patterns using
Rc<RefCell<T>>
Sequences are inserted into a trie where each node represents a prefix. Each node maintains:
- Partial dynamic programming matrix: Contains row and column fragments of the Needleman-Wunsch alignment matrix
- Parent/child relationships: Allows efficient inheritance of alignment data
- Terminal status: Marks nodes representing complete sequences
- Visited tracking: Prevents redundant processing during searches
The trie structure enables sharing of computation for sequences with common prefixes, significantly reducing the overall complexity of edit distance calculations.
The implementation uses a space-optimized version of the Needleman-Wunsch algorithm:
- Matrix inheritance: Child nodes inherit partial DP matrices from parents
- Row/column computation: Only the necessary row and column fragments are computed
- Best value tracking: Maintains the minimum edit distance found so far
- Early termination: Stops exploration when edit distance exceeds threshold
This approach reduces space complexity from O(m×n) to O(max_edit_distance) per node while maintaining correctness.
The chained_search method efficiently finds sequences within edit distance by:
- Leveraging shared prefixes: Reuses computation from common sequence prefixes
- Using partial DP matrices: Extends existing alignment data rather than recomputing from scratch
- Pruning search space: Abandons branches when minimum possible edit distance exceeds threshold
- Depth-based optimization: Uses prefix overlap information to start searches at optimal depths
The clustering algorithm implements a graph-based approach:
- Distance graph construction: Builds a graph where nodes are sequences and edges connect sequences within edit distance
- Message passing iteration: Repeatedly collapses low-count sequences into high-count neighbors
- Ratio-based filtering: Only collapses sequences when the count ratio exceeds the minimum threshold
- Validity tracking: Maintains which sequences remain as cluster representatives
This approach ensures that error sequences (low-count variants) are absorbed into their most likely correct sequences (high-count variants) while preventing over-clustering of legitimate sequence variants.
use rust_star::LinkedDistances; // Prepare sequence data: (sequence, count) pairs let sequences = vec![ (b"ACGTACGT".to_vec(), 100), (b"ACGTACGA".to_vec(), 1), // 1 mismatch from first (b"ACGTACGC".to_vec(), 1), // 1 mismatch from first (b"TTTTAAAA".to_vec(), 50), ]; // Cluster sequences with edit distance ≤ 1 and minimum ratio 5.0 let max_length = 8; let max_edit_distance = 1; let min_ratio = 5.0; let clusters = LinkedDistances::cluster_string_vector_list( &max_length, sequences, &max_edit_distance, &min_ratio ); // Process results for (seq, node) in clusters { if node.borrow().valid { println!("Cluster representative: {} (count: {})", String::from_utf8(seq).unwrap(), node.borrow().count); // Show absorbed sequences for (absorbed_seq, absorbed_count) in &node.borrow().swallowed_links { println!(" └─ {}: {}", String::from_utf8(absorbed_seq.clone()).unwrap(), absorbed_count); } } }use rust_star::Trie; use std::collections::HashSet; let mut trie = Trie::new(10); // max sequence length = 10 let max_mismatches = 2; // Insert sequences trie.insert(b"ACGTACGT", None, &max_mismatches); trie.insert(b"ACGTACGA", None, &max_mismatches); // Search for similar sequences let search_nodes = HashSet::new(); let (matches, _) = trie.chained_search( 0, // start depth None, // future depth b"ACGTACGC", // search sequence &max_mismatches, &search_nodes ); for (seq, distance) in matches { println!("Found: {} (distance: {})", String::from_utf8(seq).unwrap(), distance); }Main trie data structure for sequence storage and search.
Methods:
new(max_height: usize) -> Self- Create new trieinsert(&mut self, seq: &[u8], depth_to_return: Option<usize>, max_mismatch: &usize) -> Vec<Link<TrieNode>>- Insert sequencechained_search(&mut self, start_depth: usize, future_depth: Option<usize>, sequence: &[u8], max_mismatches: &usize, search_nodes: &HashSet<Link<TrieNode>>) -> (Vec<(Vec<u8>, usize)>, HashSet<Link<TrieNode>>)- Search for similar sequences
Graph structure for sequence clustering.
Methods:
cluster_string_vector_list(max_string_length: &usize, strings: Vec<(Vec<u8>, usize)>, max_mismatch: &usize, minimum_ratio: &f64) -> Vec<(Vec<u8>, Link<DistanceGraphNode>)>- Main clustering function
Individual node in the trie structure.
Fields:
sequence: Vec<u8>- Sequence represented by this nodeis_terminal: bool- Whether this represents a complete sequencechildren: HashMap<u8, Rc<RefCell<TrieNode>>>- Child nodes
Partial Needleman-Wunsch alignment data.
Fields:
column: Vec<usize>- Column values in DP matrixrow: Vec<usize>- Row values in DP matrixbest_value: usize- Best alignment score
- Memory: Uses jemalloc for improved allocation performance
- Time Complexity: O(nmk) where n=number of sequences, m=sequence length, k=edit distance
- Space Complexity: O(n*m) for trie storage plus O(k²) per node for DP matrices
Run the test suite:
cargo testFor performance testing with release optimizations:
cargo test --releaseWhen contributing to this codebase:
- Ensure all tests pass
- Add tests for new functionality
- Follow existing code style and documentation patterns
- Consider performance implications of changes
See LICENSE file for licensing information.