How does pattern matching work in grep for accented characters (eg. á, è, ò)

Question

What characters match the following regex :

^[a-zA-Z]$

Specifically, should characters with accents (eg. á, è, ò) match this regex? Or just 26 lower and upper case alphabets? I have tried to check this in an online regex, and accented characters did not match.

When I tried it out with grep on my Linux (Ubuntu) machine it did match the accented characters. While rg(ripgrep) did not match it.

I can match accented characters with rg using [A-zÀ-ÿ]

But is there a way to exclude accented characters from matching in grep?

PS: This question talks about how accented chars can be included, not excluded.

Standard text utilities like grep use the current locale. If your locale's collation order puts accented characters with the plain letters then they are included. C is a standard locale that does NOT collate accented letters together e.g. assuming your data is in UTF-8 (likely nowadays especially if it contains nonASCII) (echo e; echo é) | LANG=C.UTF-8 grep '[a-z]' — dave_thompson_085
– dave_thompson_085, Commented Jul 3, 2022 at 3:13

Ricardo · Accepted Answer · 2024-08-20 13:19:44Z

I think I found a solution for your problem. The issue is that there are many different flavours or dialects of Regex: Basic (BRE), Extended (ERE), and Simple (SRE). grep also understands PCRE, which is Perl-compatible Regular Expression. Regex is a rabbithole.

Solution

My solution, working on Ubuntu 24.04 LTS (Noble Numbat), uses precisely the Perl-flavoured stuff. Due to the collation issues cleverly mentioned in the comment by @dave_thompson_085, any solution that losely addresses character ranges will not be portable.

Here's the solution:

$ echo eá | grep -v -P "[\x80-\xff]" $ echo á | grep -v -P "[\x80-\xff]" $ echo e | grep -v -P "[\x80-\xff]" e

How it works

-P: match Perl-style
"[\x7f-\xff]": match letters above 127
-v: invert match

With these options, it is rejecting all lines where characters outside the 32-126 range are present. Note that this is operating on a line granularity basis. For character granularity basis, a sed option could be devised.

Other uses

Your use case was to filter on echo outputs. Here's an option for files. I'm using https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/master/stopwords-fr.txt, a file with 691 lines, one word per line. Being French, you'll find plenty of accents and cedillas. Please add a newline at the end of this file in order to make the following math consistent:

$ cat -n stopwords-fr.txt | wc -l 691 $ cat -n stopwords-fr.txt | grep -P "^[\x00-\x7e]*$" | wc -l 591 $ cat -n stopwords-fr.txt | grep -v -P "^[\x00-\x7e]*$" | wc -l 100 $ cat stopwords-fr.txt | grep -v -P "[\x7f-\xff]" | wc -l 591

Notice in this last run that commands #2 and #4 yield the same results, as they are opposite sides of the same operation. On #2, you're asking for all lines that are entirely ("^...$") composed of only ASCII (a positive match), whereas on #4 you're asking for all lines that have no character above or including \x7f (negative match).

This second case requires that you leave out both anchors (^ and $) and the repetition (*). Please re fer to De Morgan's laws for why, for reviewing, or just for fun.

Final notes

Control characters

You won't find many cases of characters under 32 (\x20), which is the SPACE. With this in mind, for the positive matches, you will find that [\x00-\x7e] and [\x20-\x7e] are quite interchangeable.

Really only letters

If you're really interested in letters only, you could further constrain the regular expression by replacing [\x20-\x7e] with '[\x41-\x5a\x61-\x7a]`. This means:

[\x41-\x5a]: UPPERCASE
[\x61-\x7a]: lowercase

You could as well separate the two segments with a | (a logical OR), if only to make things more legible. I use this below.

Here are some runs:

$ echo ASD | grep -P "^[\x41-\x5a|\x61-\x7a]*$" ASD $ echo fgh | grep -P "^[\x41-\x5a|\x61-\x7a]*$" fgh $ echo 123 | grep -P "^[\x41-\x5a|\x61-\x7a]*$" $ echo ASDfgh | grep -P "^[\x41-\x5a|\x61-\x7a]*$" ASDfgh $ echo "ASDfgh " | grep -P "^[\x41-\x5a|\x61-\x7a]*$" $

Even SPACE will kill the line. If you want all letters plus SPACE, all you need to do is include the space inside the square brackets:

$ echo "ASDfgh " | grep -P "^ [\x41-\x5a\x61-\x7a]*$" $ echo "ASDfgh " | grep -P "^[ \x41-\x5a\x61-\x7a]*$" ASDfgh ^ ADDED SPACE HERE $

Notice that here I did without the |, for the sake of completeness.

Happy grepping!

Stack Exchange Network

How does pattern matching work in grep for accented characters (eg. á, è, ò)

1 Answer 1

Solution

How it works

Other uses

Final notes

Control characters

Really only letters

You must log in to answer this question.

Linked

Hot Network Questions

How does pattern matching work in grep for accented characters (eg. á, è, ò)

1 Answer 1

Solution

How it works

Other uses

Final notes

Control characters

Really only letters

You must log in to answer this question.

Linked

Related

Hot Network Questions