In C#, what is the fastest way to detect duplicate characters in a String and remove them (removal including 1st instance of the duplicated character)?
Example Input: nbHHkRvrXbvkn
Example Output: RrX
Fastest as in fewest-lines-of-code:
var s = "nbHHkRvrXbvkn"; var duplicates = s.Where(ch => s.Count(c => c == ch) > 1); var result = new string(s.Except(duplicates).ToArray()); // = "RrX" Fastest as in fastest-performance would probably be something like this (does not preserve order):
var h1 = new HashSet<char>(); var h2 = new HashSet<char>(); foreach (var ch in "nbHHkRvrXbvkn") { if (!h1.Add(ch)) { h2.Add(ch); } } h1.ExceptWith(h2); // remove duplicates var chars = new char[h1.Count]; h1.CopyTo(chars); var result = new string(chars); // = "RrX" Performance test
When in doubt -- test it :)
Yuriy Faktorovich's answer 00:00:00.2360900 Luke's answer 00:00:00.2225683 My 'few lines' answer 00:00:00.5318395 My 'fast' answer 00:00:00.1842144
Here is a pretty fast one preserving order. But I'd be a little worried about how LINQ does Group and Where:
string s = "nbHHkRvrXbvkn"; Console.WriteLine( s.ToCharArray() .GroupBy(c => c) .Where(g => g.Count() == 1) .Aggregate(new StringBuilder(), (b, g) => b.Append(g.Key))); Edit: This one beats Luke's in some cases still slower than dtb's, but it preserves the order
private static string MyMethod(string s) { StringBuilder sb = new StringBuilder(s.Length); foreach (var g in s.ToCharArray().GroupBy(c => c)) if (g.Count() == 1) sb.Append(g.Key); return sb.ToString(); } This one should be pretty quick (and it preserves the original order):
public static string RemoveDuplicates(string source) { HashSet<char> found = new HashSet<char>(); HashSet<char> dupes = new HashSet<char>(); foreach (char c in source) { if (!found.Add(c)) { dupes.Add(c); } } StringBuilder sb = new StringBuilder(source.Length); foreach (char c in source) { if (!dupes.Contains(c)) { sb.Append(c); } } return sb.ToString(); } StringBuilder was faster in most cases. Of course, in the real-world the strings probably wouldn't be purely random. In that situation the performance difference would depend on the ratio of dupes to non-dupes in the source string.StringBuilder is pre-allocated or not. (In which case it probably makes sense to not bother pre-allocating and save ourselves some RAM.)This preserves order and, based on my tests, is 4 times faster than using a HashSet. This assumes your character range is 0-255 but you can extend that easily. If you're planning on using this in a loop, move the int[] c = new int[255]; out and in the function do an Array.Clear(c,0,255).
private static string RemoveDuplicates(string s) { int[] c = new int[255]; for (int i = 0; i < s.Length; i++) { c[s[i]]++; } StringBuilder sb = new StringBuilder(); for (int i = 0; i < s.Length; i++) { if (c[s[i]] == 1) sb.Append(s[i]); } return sb.ToString(); } HashSet once the array is sized correctly.This algorithm is general, can be applied to any language