Fastest way to implement Duplicate Character Removal in String (C#)

Question

In C#, what is the fastest way to detect duplicate characters in a String and remove them (removal including 1st instance of the duplicated character)?

Example Input: nbHHkRvrXbvkn

Example Output: RrX

dtb · Accepted Answer · 2009-08-27 22:43:23Z

Fastest as in fewest-lines-of-code:

var s = "nbHHkRvrXbvkn"; var duplicates = s.Where(ch => s.Count(c => c == ch) > 1); var result = new string(s.Except(duplicates).ToArray()); // = "RrX"

Fastest as in fastest-performance would probably be something like this (does not preserve order):

var h1 = new HashSet<char>(); var h2 = new HashSet<char>(); foreach (var ch in "nbHHkRvrXbvkn") { if (!h1.Add(ch)) { h2.Add(ch); } } h1.ExceptWith(h2); // remove duplicates var chars = new char[h1.Count]; h1.CopyTo(chars); var result = new string(chars); // = "RrX"

Performance test

When in doubt -- test it :)

 Yuriy Faktorovich's answer 00:00:00.2360900 Luke's answer 00:00:00.2225683 My 'few lines' answer 00:00:00.5318395 My 'fast' answer 00:00:00.1842144

Very nice. Great performance comparison too. Performance variation is probably even more visible with very large strings.
I've repeated the performance test in Release build with detached debugger (but same input string). I'm surpised by the performance of Yuriy's answer; it's pretty fast!
@dtb: The thing that slows down my answer compared to yours is that I'm preserving the original order in the output string, which necessitates an additional loop through the input string. The technique that you and I use for actually finding the dupes is exactly the same.
see my solution below ... you had the right idea but using an array for known data is 4 times faster than HashSet in my tests
c# has var? ... i've never seen the use of var in c# before... BTW great piece of code good job! +1

Yuriy Faktorovich · Accepted Answer · 2009-08-28 15:55:25Z

Here is a pretty fast one preserving order. But I'd be a little worried about how LINQ does Group and Where:

string s = "nbHHkRvrXbvkn"; Console.WriteLine( s.ToCharArray() .GroupBy(c => c) .Where(g => g.Count() == 1) .Aggregate(new StringBuilder(), (b, g) => b.Append(g.Key)));

Edit: This one beats Luke's in some cases still slower than dtb's, but it preserves the order

private static string MyMethod(string s) { StringBuilder sb = new StringBuilder(s.Length); foreach (var g in s.ToCharArray().GroupBy(c => c)) if (g.Count() == 1) sb.Append(g.Key); return sb.ToString(); }

LukeH · Accepted Answer · 2009-08-27 22:28:05Z

4

This one should be pretty quick (and it preserves the original order):

public static string RemoveDuplicates(string source) { HashSet<char> found = new HashSet<char>(); HashSet<char> dupes = new HashSet<char>(); foreach (char c in source) { if (!found.Add(c)) { dupes.Add(c); } } StringBuilder sb = new StringBuilder(source.Length); foreach (char c in source) { if (!dupes.Contains(c)) { sb.Append(c); } } return sb.ToString(); }

answered Aug 27, 2009 at 22:28

LukeH

271k59 gold badges373 silver badges411 bronze badges

3 Comments

Yuriy Faktorovich Over a year ago

Why do you think creating a StringBuilder that is probably too large will take less time than allowing it to get space on the fly?

LukeH Over a year ago

@Yuri: I benchmarked! I tested with millions of random strings and pre-sizing the StringBuilder was faster in most cases. Of course, in the real-world the strings probably wouldn't be purely random. In that situation the performance difference would depend on the ratio of dupes to non-dupes in the source string.

LukeH Over a year ago

@Yuriy: I just benchmarked on a different machine (Vista64 vs XP32) and the results were much closer. On the 64-bit machine it appears to make no real difference whether the StringBuilder is pre-allocated or not. (In which case it probably makes sense to not bother pre-allocating and save ourselves some RAM.)

gabe · Accepted Answer · 2009-08-28 16:13:10Z

This preserves order and, based on my tests, is 4 times faster than using a HashSet. This assumes your character range is 0-255 but you can extend that easily. If you're planning on using this in a loop, move the int[] c = new int[255]; out and in the function do an Array.Clear(c,0,255).

 private static string RemoveDuplicates(string s) { int[] c = new int[255]; for (int i = 0; i < s.Length; i++) { c[s[i]]++; } StringBuilder sb = new StringBuilder(); for (int i = 0; i < s.Length; i++) { if (c[s[i]] == 1) sb.Append(s[i]); } return sb.ToString(); }

also, i don't know if the compiler will unroll those loops for you, but you can try that too en.wikipedia.org/wiki/Loop_unwinding
What is your test timing/stopwatch result with the sample string?
With the sample string, this method would be 1/4 or less compared to others.
@Yuriy: Is that with the array size correctly set to 65536? In my tests there's not really much difference between this method and using a HashSet once the array is sized correctly.
No, once the Array is set correctly on average working on a 100 character string it was 60% worse.

Diaa Sami · Accepted Answer · 2009-08-27 21:59:03Z

This algorithm is general, can be applied to any language

create a map(HashTable) char->int that holds the count of each character found, initially empty
scan the string once to populate the map.
create a new empty string that'll hold the output, may be you'll need to use a StringBuilder.
scan the string(orthe map, whichever is shorter) copying only characters which an occurrence of 1 to the output string/StringBuilder

Collectives™ on Stack Overflow

Fastest way to implement Duplicate Character Removal in String (C#)

5 Answers 5

6 Comments

Comments

3 Comments

5 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

Comments

3 Comments

5 Comments

Comments

Linked

Related