5

Looking for some black magic that will match any string with "weird" characters in it. Standard ASCII characters are fine. Everything else isn't.

This is for sanitizing various web forms.

5
  • 1
    Seriously? U+0001 START OF HEADING or U+0007 BELL is fine, but plain English isn't? Are you sure that ASCII is what you want to match for? Commented Aug 24, 2010 at 23:47
  • Come on, why are you hating on \a. It's great. But yes, seriously. Last time I checked none of those interferes with page rendering like the mirror char or some of the others. Commented Aug 24, 2010 at 23:52
  • 1
    é doesn't mess with a page either. If messing with page rendering is the issue, then maybe use \p{C}. new Regex(@"\p{C}").Replace(suspect, string.Empty) will clear out both ASCII and non-ASCII controls and formatters, while not damaging normal text a more naïve (or as you would have it, nave) approach would mangle. Particularly if you have names or people or places appearing anywhere (proper names being both places where non-ASCII letters crop up a lot in English, and places where users get particularly upset if you mangle them). Commented Aug 25, 2010 at 0:28
  • ï is ASCII, you know ;-) Commented Aug 25, 2010 at 16:20
  • I just stumbled into this very problem, and for some frameworks, such as ASP.NET MVC, the answer is not exactly a simple exclusion regex - see here for more: nimblegecko.com/… Commented Mar 11, 2016 at 7:55

2 Answers 2

7

This gets anything out of the ASCII range

[^\x00-\x7F] 

There are still some "weird" characters like x00 (NULL), but they are valid ASCII.
For reference, see the ASCII table

Sign up to request clarification or add additional context in comments.

2 Comments

That "ASCII table" page is crap (pardon my French). It presents that second chart as "the most popular" of the "extended ASCII sets"--come again? It's Cp850! Nobody uses that on purpose; it just happens to be the default encoding of the Windows command line. Also, the tables are images, and they look like hell (pardon my Italian) on an LCD display. Send them to Wikipedia instead: en.wikipedia.org/wiki/ASCII
For "printable ASCII" (which is what I would argue almost everybody searching for this actually wants) I would use [^\x20-\x7E]. That cuts out the control characters 0x0 through 0x31 and the 0x7F control character. Alternately, [^\x20-\x7E\r\n\t], which adds back in the common line ending characters and tabs, which may or may not be desirable.
2

[^\p{IsBasicLatin}] for what is asked for, [^\x00-\x7F] for concision over self-documentation, or \p{C} for clearing out formatters and controls without hurting other non-ASCIIs (and with greater concision yet).

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.