8

In C#, what is the fastest way to detect duplicate characters in a String and remove them (removal including 1st instance of the duplicated character)?

Example Input: nbHHkRvrXbvkn

Example Output: RrX

5 Answers 5

21

Fastest as in fewest-lines-of-code:

var s = "nbHHkRvrXbvkn"; var duplicates = s.Where(ch => s.Count(c => c == ch) > 1); var result = new string(s.Except(duplicates).ToArray()); // = "RrX" 

Fastest as in fastest-performance would probably be something like this (does not preserve order):

var h1 = new HashSet<char>(); var h2 = new HashSet<char>(); foreach (var ch in "nbHHkRvrXbvkn") { if (!h1.Add(ch)) { h2.Add(ch); } } h1.ExceptWith(h2); // remove duplicates var chars = new char[h1.Count]; h1.CopyTo(chars); var result = new string(chars); // = "RrX" 

Performance test

When in doubt -- test it :)

 Yuriy Faktorovich's answer 00:00:00.2360900 Luke's answer 00:00:00.2225683 My 'few lines' answer 00:00:00.5318395 My 'fast' answer 00:00:00.1842144 
Sign up to request clarification or add additional context in comments.

6 Comments

Very nice. Great performance comparison too. Performance variation is probably even more visible with very large strings.
I've repeated the performance test in Release build with detached debugger (but same input string). I'm surpised by the performance of Yuriy's answer; it's pretty fast!
@dtb: The thing that slows down my answer compared to yours is that I'm preserving the original order in the output string, which necessitates an additional loop through the input string. The technique that you and I use for actually finding the dupes is exactly the same.
see my solution below ... you had the right idea but using an array for known data is 4 times faster than HashSet in my tests
c# has var? ... i've never seen the use of var in c# before... BTW great piece of code good job! +1
|
9

Here is a pretty fast one preserving order. But I'd be a little worried about how LINQ does Group and Where:

string s = "nbHHkRvrXbvkn"; Console.WriteLine( s.ToCharArray() .GroupBy(c => c) .Where(g => g.Count() == 1) .Aggregate(new StringBuilder(), (b, g) => b.Append(g.Key))); 

Edit: This one beats Luke's in some cases still slower than dtb's, but it preserves the order

private static string MyMethod(string s) { StringBuilder sb = new StringBuilder(s.Length); foreach (var g in s.ToCharArray().GroupBy(c => c)) if (g.Count() == 1) sb.Append(g.Key); return sb.ToString(); } 

Comments

4

This one should be pretty quick (and it preserves the original order):

public static string RemoveDuplicates(string source) { HashSet<char> found = new HashSet<char>(); HashSet<char> dupes = new HashSet<char>(); foreach (char c in source) { if (!found.Add(c)) { dupes.Add(c); } } StringBuilder sb = new StringBuilder(source.Length); foreach (char c in source) { if (!dupes.Contains(c)) { sb.Append(c); } } return sb.ToString(); } 

3 Comments

Why do you think creating a StringBuilder that is probably too large will take less time than allowing it to get space on the fly?
@Yuri: I benchmarked! I tested with millions of random strings and pre-sizing the StringBuilder was faster in most cases. Of course, in the real-world the strings probably wouldn't be purely random. In that situation the performance difference would depend on the ratio of dupes to non-dupes in the source string.
@Yuriy: I just benchmarked on a different machine (Vista64 vs XP32) and the results were much closer. On the 64-bit machine it appears to make no real difference whether the StringBuilder is pre-allocated or not. (In which case it probably makes sense to not bother pre-allocating and save ourselves some RAM.)
2

This preserves order and, based on my tests, is 4 times faster than using a HashSet. This assumes your character range is 0-255 but you can extend that easily. If you're planning on using this in a loop, move the int[] c = new int[255]; out and in the function do an Array.Clear(c,0,255).

 private static string RemoveDuplicates(string s) { int[] c = new int[255]; for (int i = 0; i < s.Length; i++) { c[s[i]]++; } StringBuilder sb = new StringBuilder(); for (int i = 0; i < s.Length; i++) { if (c[s[i]] == 1) sb.Append(s[i]); } return sb.ToString(); } 

5 Comments

also, i don't know if the compiler will unroll those loops for you, but you can try that too en.wikipedia.org/wiki/Loop_unwinding
What is your test timing/stopwatch result with the sample string?
With the sample string, this method would be 1/4 or less compared to others.
@Yuriy: Is that with the array size correctly set to 65536? In my tests there's not really much difference between this method and using a HashSet once the array is sized correctly.
No, once the Array is set correctly on average working on a 100 character string it was 60% worse.
0

This algorithm is general, can be applied to any language

  1. create a map(HashTable) char->int that holds the count of each character found, initially empty
  2. scan the string once to populate the map.
  3. create a new empty string that'll hold the output, may be you'll need to use a StringBuilder.
  4. scan the string(orthe map, whichever is shorter) copying only characters which an occurrence of 1 to the output string/StringBuilder

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.