0

I want to remove all characters from a string, except Unicode letters.

I consider using this code:

public static string OnlyLetters(string text) { return new string (text.Where(c => Char.IsLetter(c)).ToArray()); } 

But maybe Regex will be faster?

public static string OnlyLetters(string text) { Regex rgx = new Regex("[^\p{L}]"); return rgx.Replace(text, ""); } 

Could you verify this code and suggest which one should I choose?

7
  • another option is return string.Concat(text.Where(c => Char.IsLetter(c)); Commented Aug 13, 2022 at 9:17
  • 2
    Please read ericlippert.com/2012/12/17/performance-rant Commented Aug 13, 2022 at 9:21
  • Additionally, I don't think Windows Forms is relevant to what you're asking - while it's entirely feasible that you're using this within a WinForms app, it's not really a WinForms-oriented question. Commented Aug 13, 2022 at 9:21
  • 1
    Regex will be not faster; on my workstation (.Net 6) regex ~ 2 times slower (440 ms vs. 180 ms) for processing a string with 10_000_000 characters Commented Aug 13, 2022 at 9:32
  • 1
    @DmitryBychenko string.Concat(text.Where(c => Char.IsLetter(c)); will be much slower owing to allocating Linq's enumerators and the double-copy of the data caused by ToArray() and new String. Commented Aug 13, 2022 at 9:36

1 Answer 1

4

If you want to know which horse is faster, you can perform races:

Often, manual manipulations appear to be fast. Let's try this approach:

private static string ManualReplace(string value) { // Let's allocate memory only once - value.Length characters StringBuilder sb = new StringBuilder(value.Length); foreach (char c in value) if (char.IsLetter(c)) sb.Append(c); return sb.ToString(); } 

Races:

// 123 - seed - in order for the text to be the same Random random = new Random(123); // Let's compile the regex Regex rgx = new Regex(@"[^\p{L}]", RegexOptions.Compiled); string result = null; // <- makes the compiler happy string text = string.Concat(Enumerable .Range(1, 10_000_000) .Select(_ => (char)random.Next(32, 128))); Stopwatch sw = new Stopwatch(); // Warming: let .NET compile CIL, fill caches, allocate memory, etc. int warming = 5; for (int i = 0; i < warming; ++i) { if (i == warming - 1) sw.Start(); // result = new string(text.Where(c => char.IsLetter(c)).ToArray()); result = rgx.Replace(text, ""); // result = string.Concat(text.Where(c => char.IsLetter(c))); // result = ManualReplace(text); if (i == warming - 1) sw.Stop(); } Console.WriteLine($"{sw.ElapsedMilliseconds}"); 

Run this several times, and you'll get the results. Mine (.NET 6, Release) are:

new string : 120 ms rgx.Replace : 350 ms string.Concat : 150 ms Manual : 80 ms 

So we have the winner. It's Manual replace; among the others new string (text.Where(c => Char.IsLetter(c)).ToArray()); is the fastest, string.Concat is slightly slower, and Regex.Replace is a loser.

Sign up to request clarification or add additional context in comments.

2 Comments

When I use a StringBuilder with a for loop it runs in 107ms, while your winner "new string(..." runs in 130ms - just sayin' (.NET 6 x64, v6.0.8, Release build)
@Dai: yes, StringBuilder - if we alllocate memory via new StringBuilder(text.Length) can well be even faster. Initially, I've performed races among the mentioned approaches only - new string, Regex.Replace and String.Concat.