Remove all problematic characters in an intelligent way in C#

Question

Is there any .Net library to remove all problematic characters of a string and only leave alphanumeric, hyphen and underscore (or similar subset) in an intelligent way? This is for using in URLs, file names, etc.

I'm looking for something similar to stringex which can do the following:

A simple prelude

"simple English".to_url => "simple-english"

"it's nothing at all".to_url => "its-nothing-at-all"

"rock & roll".to_url => "rock-and-roll"

Let's show off

"$12 worth of Ruby power".to_url => "12-dollars-worth-of-ruby-power"

"10% off if you act now".to_url => "10-percent-off-if-you-act-now"

You don't even wanna trust Iconv for this next part

"kick it en Français".to_url => "kick-it-en-francais"

"rock it Español style".to_url => "rock-it-espanol-style"

"tell your readers 你好".to_url => "tell-your-readers-ni-hao"

BillW, I'm not looking for exactly this, I was just pointing an example of what I've meant by intelligent replacement before someone posted a simple regex (which is the solution I'm already using). Particularly about the translation part I don't care a lot. — Pablo Fernandez
– Pablo Fernandez, Commented Jan 10, 2010 at 17:14
JPF, Sorry, to miss your major intent; glad you got what you needed. I am amazed that the "stringex" library in its "ActsAsUrl" component can even handle the case of one or two non-Roman (Chinese in your example) glyphs to English phonemes ! — BillW
– BillW, Commented Jan 10, 2010 at 17:34

Luke101 · Accepted Answer · 2010-01-10 17:14:48Z

You can try this

string str = phrase.ToLower(); //optional str = str.Trim(); str = Regex.Replace(str, @"[^a-z0-9\s_]", ""); // invalid chars str = Regex.Replace(str, @"\s+", " ").Trim(); // convert multiple spaces into one space str = str.Substring(0, str.Length <= 400 ? str.Length : 400).Trim(); // cut and trim it str = Regex.Replace(str, @"\s", "-");

Community · Accepted Answer · 2017-05-23 12:13:31Z

Perhaps this question here can help you on your way. It gives you code on how Stackoverflow generates its url's (more specifically, how question names are turned into nice urls.

Link to Question here, where Jeff Atwood shows their code

CraigTP · Accepted Answer · 2010-01-10 17:27:00Z

From your examples, the closest thing I've found (although I don't think it does everything that you're after) is:

My Favorite String Extension Methods in C#

and also:

ÜberUtils - Part 3 : Strings

Since neither of these solutions will give you exactly what you're after (going from the examples in your question) and assuming that the goal here is to make your string "safe", I'd second Hogan's advice and go with Microsoft's Anti Cross Site Scripting Library, or at least use that as a basis for something that you create yourself, perhaps deriving from the library.

Here's a link to a class that builds a number of string extension methods (like the first two examples) but leverages Microsoft's AntiXSS Library:

Extension Methods for AntiXss

Of course, you can always combine the algorithms (or similar ones) used within the AntiXSS library with the kind of algorithms that are often used in websites to generate "slug" URL's (much like Stack Overflow and many blog platforms do).

Here's an example of a good C# slug generator:

Improved C# Slug Generator

Dan McClain · Accepted Answer · 2010-01-10 17:07:25Z

You could use HTTPUtility.UrlEncode, but that would encode everything, and not replace or remove problematic characters. So your spaces would be + and ' would be encoded as well. Not a solution, but maybe a starting point

Hogan · Accepted Answer · 2010-01-10 17:10:02Z

0

If the goal is to make the string "safe" I recommend Mirosoft's anti-xss libary

answered Jan 10, 2010 at 17:10

Hogan

70.7k10 gold badges83 silver badges119 bronze badges

1 Comment

Pablo Fernandez Over a year ago

The goal is not safe as in the XSS way, but safe as in copying and pasting URLs with it just works, typing them is easy, being readable, being one string for commands (not requiring any escaping) and so on.

Adam Ralph · Accepted Answer · 2010-01-10 17:22:33Z

There will be no library capable of what you want since you are stating specific rules that you want applied, e.g. $x => x-dollars, x% => x-percent. You will almost certainly have to write your own method to acheive this. It shouldn't be too difficult. A string extension method and use of one or more Regex's for making the replacements would probably be quite a nice concise way of doing it.

e.g.

public static string ToUrl(this string text) { return text.Trim().Regex.Replace(text, ..., ...); }

Porculus · Accepted Answer · 2010-01-10 17:23:00Z

Something the Ruby version doesn't make clear (but the original Perl version does) is that the algorithm it's using to transliterate non-Roman characters is deliberately simplistic -- "better than nothing" in both senses. For example, while it does have a limited capability to transliterate Chinese characters, this is entirely context-insensitive -- so if you feed it Japanese text then you get gibberish out.

The advantage of this simplistic nature is that it's pretty trivial to implement. You just have a big table of Unicode characters and their corresponding ASCII "equivalents". You could pull this straight from the Perl (or Ruby) source code if you decide to implement this functionality yourself.

m3kh · Accepted Answer · 2010-01-10 18:24:47Z

I'm using something like this in my blog.

public class Post { public string Subject { get; set; } public string ResolveSubjectForUrl() { return Regex.Replace(Regex.Replace(this.Subject.ToLower(), "[^\\w]", "-"), "[-]{2,}", "-"); } }

Pablo Fernandez · Accepted Answer · 2010-05-22 12:52:56Z

I couldn't find any library that does it, like in Ruby, so I ended writing my own method. This is it in case anyone cares:

/// <summary> /// Turn a string into something that's URL and Google friendly. /// </summary> /// <param name="str"></param> /// <returns></returns> public static string ForUrl(this string str) { return str.ForUrl(true); } public static string ForUrl(this string str, bool MakeLowerCase) { // Go to lowercase. if (MakeLowerCase) { str = str.ToLower(); } // Replace accented characters for the closest ones: char[] from = "ÂÃÄÀÁÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöøùúûüýÿ".ToCharArray(); char[] to = "AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiidnoooooouuuuyy".ToCharArray(); for (int i = 0; i < from.Length; i++) { str = str.Replace(from[i], to[i]); } // Thorn http://en.wikipedia.org/wiki/%C3%9E str = str.Replace("Þ", "TH"); str = str.Replace("þ", "th"); // Eszett http://en.wikipedia.org/wiki/%C3%9F str = str.Replace("ß", "ss"); // AE http://en.wikipedia.org/wiki/%C3%86 str = str.Replace("Æ", "AE"); str = str.Replace("æ", "ae"); // Esperanto http://en.wikipedia.org/wiki/Esperanto_orthography from = "ĈĜĤĴŜŬĉĝĥĵŝŭ".ToCharArray(); to = "CXGXHXJXSXUXcxgxhxjxsxux".ToCharArray(); for (int i = 0; i < from.Length; i++) { str = str.Replace(from[i].ToString(), "{0}{1}".Args(to[i*2], to[i*2+1])); } // Currencies. str = new Regex(@"([¢€£\$])([0-9\.,]+)").Replace(str, @"$2 $1"); str = str.Replace("¢", "cents"); str = str.Replace("€", "euros"); str = str.Replace("£", "pounds"); str = str.Replace("$", "dollars"); // Ands str = str.Replace("&", " and "); // More aesthetically pleasing contractions str = str.Replace("'", ""); str = str.Replace("’", ""); // Except alphanumeric, everything else is a dash. str = new Regex(@"[^A-Za-z0-9-]").Replace(str, "-"); // Remove dashes at the begining or end. str = str.Trim("-".ToCharArray()); // Compact duplicated dashes. str = new Regex("-+").Replace(str, "-"); // Let's url-encode just in case. return str.UrlEncode(); }

Collectives™ on Stack Overflow

Remove all problematic characters in an intelligent way in C#

A simple prelude

Let's show off

You don't even wanna trust Iconv for this next part

9 Answers 9

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

A simple prelude

Let's show off

You don't even wanna trust Iconv for this next part

9 Answers 9

Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Linked

Related