Revision cf704c3d-e9ee-4251-ba41-d2bd9515e997

I needed to do transformations on a data stream, not on an element-by-element basis, but on subsequences. This is what I've come up with:
https://github.com/peterjoel/rust-iter-replace/blob/master/src/lib.rs

I am relatively new to Rust, and would love some feedback on any areas of the code: style, design, memory/performance no-nos. But performance (throughput) is particularly important in this application, as I'll be using it to "compress" very large (multi GB) pieces of data, in memory and on disc.

Design Overview
-------

The central struct `Replace`, keeps track of two buffers: `buffer_out` and `buffer_in`, a set of partial match `candidates` and some other members for keeping track of state between invocations of `next()`.

`buffer_out` is data that is fully processed and ready to pass to the next iterator adapter - this will either contain unmatched data, or the full replacement sequence. `buffer_in` contains data that may or may not match, and gets copied to `buffer_out` as soon as it can be shown that it doesn't match and gets erased when it does. I chose a `VecDeque` for `buffer_out` because it generally gets written to at the back and read from at the start. As I write this, I realise that `buffer_in` could have been just a `Vec`. Maybe I'll change that.

The `BTreeSet`, `candidates`, keeps track of the index when the first item of a partial match occurred. As soon as a partial match no longer matches then it is discarded. I chose a `BTreeSet` because I needed to access the smallest element to know when I can flush any part of `buffer_in`. But actually its elements are added in size order - which I'm not taking advantage of here - so there could be a better data structure which can exploit that invariant. 

The section of code where I remove elements from the `candidates` set isn't great. I originally wrote a trait and had it as a method of `BTreeSet`, but I had some errors (I forgot now) to do with the type signature, where I couldn't get it to match the expected type of the predicate used by `filter`. I may revisit that.


Source
-------


 use std::collections::{BTreeSet};
 use std::collections::VecDeque;
 
 pub struct Replace <'a, I, T: 'a + Ord > {
 iter: I,
 buffer_out: VecDeque<T>,
 buffer_in: VecDeque<T>,
 replace_from: &'a [T],
 replace_with: &'a [T],
 candidates: BTreeSet<usize>,
 index: usize,
 flushed_index: usize,
 }
 
 impl <'a, I, T> Replace <'a, I, T> where
 I: Iterator<Item = T>,
 T: Eq + Ord + Copy {
 pub fn adapt(iter: I, replace_from: &'a [T], replace_with: &'a [T]) -> Replace<'a, I, T> {
 Replace {
 iter: iter,
 buffer_out: VecDeque::new(),
 buffer_in: VecDeque::new(),
 replace_from: replace_from,
 replace_with: replace_with,
 candidates: BTreeSet::new(),
 index: 0,
 flushed_index: 0,
 }
 }
 
 fn fill_buffer(&mut self) {
 'consume: while let Some(item) = self.iter.next() {
 self.index += 1;
 // buffer all incoming items
 self.buffer_in.push_back(item);
 // Prune existing partial match candidates that don't match the next item
 let removes: Vec<_> = self.candidates.iter().cloned()
 .filter(|start_index| {
 self.replace_from[self.index - *start_index] != item
 }).collect();
 for r in removes {
 self.candidates.remove(&r);
 }
 // Keep track of new partial match candidates
 if self.replace_from[0] == item {
 self.candidates.insert(self.index);
 }
 // if the length of the first match is the length of the replace sequence then it's a complete match
 match self.candidates.iter().cloned()
 .next()
 .into_iter()
 .find(|x| self.index - x + 1 == self.replace_from.len()) {
 None => {
 // We can flush the inbound buffer up to the first partial match
 // (or the full buffer if there are no partial matches)
 let flush_index = self.candidates.iter().next().map(|x| x - 1).unwrap_or(self.index);
 if flush_index > self.flushed_index {
 let mut flush: VecDeque<_> = self.buffer_in.drain(0 .. flush_index - self.flushed_index).collect();
 self.buffer_out.append(&mut flush);
 self.flushed_index = flush_index;
 break 'consume;
 }
 },
 Some(_) => {
 // A match! So replace it and clear all the partial matches
 self.candidates.clear();
 for &x in self.replace_with.iter() {
 self.buffer_out.push_back(x);
 }
 self.buffer_in.clear();
 self.flushed_index = self.index;
 break 'consume;
 }
 }
 }
 }
 
 }
 
 
 pub trait ReplaceIter<'a, I, T> where
 I: Iterator<Item = T>,
 T: Ord {
 
 // fn replace_iter(self, map: BTreeMap<&'a [T], &'a [T]>) -> Replace<'a, I, T>;
 fn replace(self, from: &'a [T], to: &'a [T]) -> Replace<'a, I, T>;
 }
 
 impl <'a, I, T> ReplaceIter<'a, I, T> for I where
 I: Iterator<Item = T>,
 T: Eq + Ord + Copy {
 
 // fn replace_iter(self, map: BTreeMap<&'a [T], &'a [T]>) -> Replace<'a, I, T> {
 fn replace(self, from: &'a [T], to: &'a [T]) -> Replace<'a, I, T> {
 Replace::adapt(self, from, to)
 }
 }
 
 
 impl <'a, I, T> Iterator for Replace <'a, I, T> where
 I: Iterator<Item = T>,
 T: Eq + Ord + Copy {
 
 type Item = T;
 
 fn next(&mut self) -> Option<T> {
 if self.buffer_out.len() == 0 {
 self.fill_buffer();
 }
 self.buffer_out.pop_front()
 }
 
 }
 

 
And tests, which show the usage:
 
 #[cfg(test)]
 mod tests {
 use super::*;
 
 #[test]
 pub fn test_replace_simple() {
 let v: Vec<u32> = vec![1,2,3].into_iter().replace(&[2], &[10]).collect();
 assert_eq!(v, vec![1,10,3]);
 }

 #[test]
 pub fn test_replace_longer() {
 let v: Vec<u32> = vec![3,4,5,6,7,8,9].into_iter().replace(&[4,5], &[100]).collect();
 assert_eq!(v, vec![3,100,6,7,8,9]);
 }
 
 #[test]
 pub fn test_replace_multi() {
 let v: Vec<u32> = vec![3,4,5,6,4,5,9].into_iter().replace(&[4,5], &[100,200,300]).collect();
 assert_eq!(v, vec![3,100,200,300,6,100,200,300,9]);
 }
 
 #[test]
 pub fn test_nearly_match() {
 let v: Vec<u32> = vec![3,4,5,6].into_iter().replace(&[4,5,1], &[100,200]).collect();
 assert_eq!(v, vec![3,4,5,6]);
 }
 
 #[test]
 pub fn test_replace_overlapping() {
 let v: Vec<u32> = vec![3,4,5,4,5,4,9].into_iter().replace(&[4,5,4,5], &[100]).collect();
 assert_eq!(v, vec![3,100,4,9]);
 }
 }