How to use Unicode in Regex

Question

I am writing one regex to find rows which matches the Unicode char in text file

!Regex.IsMatch(colCount.line, @"^"[\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"+$")

below is the full code which I have written

var _fileName = @"C:\text.txt"; BadLinesLst = File .ReadLines(_fileName, Encoding.UTF8) .Select((line, index) => { var count = line.Count(c => Delimiter == c) + 1; if (NumberOfColumns < 0) NumberOfColumns = count; return new { line = line, count = count, index = index }; }) .Where(colCount => colCount.count != NumberOfColumns || (Regex.IsMatch(colCount.line, @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]"))) .Select(colCount => colCount.line).ToList();

File contains below rows

264162-03,66,JITK,2007,12,874.000 ,0.000 ,0.000

6420œ50-00,67,JITK,2007,12,2292.000 ,0.000 ,0.000

4804¥75-00,67,JITK,2007,12,1810.000 ,0.000 ,0.000

If file of row contains any other char apart from BasicLatin or LatinExtended-A or LatinExtended-B then I need to get those rows. The above Regex is not working properly, this is showing those rows as well which contains LatinExtended-A or B

Delimiter is , (Comma) and If i will not pass number of column then it will take -1. Suppose I have rows which has , separated columns, so I am checking all the rows has same no of columns or not as well as using regex to find the row which has the spcl char or chines char except mention regex — Rocky
– Rocky, Commented Jun 23, 2016 at 11:00
I checked with removing this line of code as well but then also it not working.. — Rocky
– Rocky, Commented Jun 23, 2016 at 11:17
Well, I tried with a file containing 480Œ475-00,67,JITK,2007,12,1810.000 ,0.000 ,0.000, фыв, ыыыы and aaa lines, and got the result: фыв, ыыыы and aaa. Isn't it expected? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 23, 2016 at 11:18
Just note that the encoding is always the tricky part. If it is not ANSI, you just need to pass true to the StreamReader, if not, you should always be aware that your default code page will be used with Encoding.Default. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jun 23, 2016 at 12:11

Wiktor Stribiżew · Accepted Answer · 2016-06-23 10:49:38Z

You need to just put the Unicode category classes into a negated character class:

if (Regex.IsMatch(colCount.line, @"[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]")) { /* Do sth here */ }

This regex will find partial matches (since the Regex.IsMatch finds pattern matches inside larger strings). The pattern will match any character other than the one in \p{IsBasicLatin}, \p{IsLatinExtended-A} and \p{IsLatinExtended-B} Unicode category sets.

You may also want to check the following code:

if (Regex.IsMatch(colCount.line, @"^[^\p{IsBasicLatin}\p{IsLatinExtended-A}\p{IsLatinExtended-B}]*$")) { /* Do sth here */ }

This will return true if the whole colCount.line string does not contain any character from the 3 Unicode category classes specified in the negated character class -or- if the string is empty (if you want to disallow fetching empty strings, replace * with + at the end).

After modifying regex, in one row I kept LatinExtended-A char. but regex is matching that row as well, but in actual scenario regex should not match that row which contain LatinExtended-A
Please share the string that matches the regex but should not. Also, I suggest using some Unicode converter (like this one) to check what Unicode category a character belongs to.
480Œ475-00,67,JITK,2007,12,1810.000 ,0.000 ,0.000 after 480 I kept the LatinExtended-A char this row should not match with regex as I mention that it should ignore the LatinExtended-A en.wikipedia.org/wiki/Latin_script_in_Unicode using this list for choosing char
The string you showed above does not match. Neither does it match here. Also see IDEONE demo at ideone.com/SXCD6O.
what ever link you have shared in that I showing correct but for me I don't know why it is not working, I have updated the question and added full code which I have written

Collectives™ on Stack Overflow

How to use Unicode in Regex

1 Answer 1

7 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Related