6

What characters match the following regex :

^[a-zA-Z]$ 

Specifically, should characters with accents (eg. á, è, ò) match this regex? Or just 26 lower and upper case alphabets? I have tried to check this in an online regex, and accented characters did not match.

When I tried it out with grep on my Linux (Ubuntu) machine it did match the accented characters. While rg(ripgrep) did not match it.

enter image description here

I can match accented characters with rg using [A-zÀ-ÿ]

But is there a way to exclude accented characters from matching in grep?

PS: This question talks about how accented chars can be included, not excluded.

1
  • 3
    Standard text utilities like grep use the current locale. If your locale's collation order puts accented characters with the plain letters then they are included. C is a standard locale that does NOT collate accented letters together e.g. assuming your data is in UTF-8 (likely nowadays especially if it contains nonASCII) (echo e; echo é) | LANG=C.UTF-8 grep '[a-z]' Commented Jul 3, 2022 at 3:13

1 Answer 1

3

I think I found a solution for your problem. The issue is that there are many different flavours or dialects of Regex: Basic (BRE), Extended (ERE), and Simple (SRE). grep also understands PCRE, which is Perl-compatible Regular Expression. Regex is a rabbithole.

Solution

My solution, working on Ubuntu 24.04 LTS (Noble Numbat), uses precisely the Perl-flavoured stuff. Due to the collation issues cleverly mentioned in the comment by @dave_thompson_085, any solution that losely addresses character ranges will not be portable.

Here's the solution:

$ echo eá | grep -v -P "[\x80-\xff]" $ echo á | grep -v -P "[\x80-\xff]" $ echo e | grep -v -P "[\x80-\xff]" e 

How it works

  • -P: match Perl-style
  • "[\x7f-\xff]": match letters above 127
  • -v: invert match

With these options, it is rejecting all lines where characters outside the 32-126 range are present. Note that this is operating on a line granularity basis. For character granularity basis, a sed option could be devised.

Other uses

Your use case was to filter on echo outputs. Here's an option for files. I'm using https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/master/stopwords-fr.txt, a file with 691 lines, one word per line. Being French, you'll find plenty of accents and cedillas. Please add a newline at the end of this file in order to make the following math consistent:

$ cat -n stopwords-fr.txt | wc -l 691 $ cat -n stopwords-fr.txt | grep -P "^[\x00-\x7e]*$" | wc -l 591 $ cat -n stopwords-fr.txt | grep -v -P "^[\x00-\x7e]*$" | wc -l 100 $ cat stopwords-fr.txt | grep -v -P "[\x7f-\xff]" | wc -l 591 

Notice in this last run that commands #2 and #4 yield the same results, as they are opposite sides of the same operation. On #2, you're asking for all lines that are entirely ("^...$") composed of only ASCII (a positive match), whereas on #4 you're asking for all lines that have no character above or including \x7f (negative match).

This second case requires that you leave out both anchors (^ and $) and the repetition (*). Please re fer to De Morgan's laws for why, for reviewing, or just for fun.

Final notes

Control characters

You won't find many cases of characters under 32 (\x20), which is the SPACE. With this in mind, for the positive matches, you will find that [\x00-\x7e] and [\x20-\x7e] are quite interchangeable.

Really only letters

If you're really interested in letters only, you could further constrain the regular expression by replacing [\x20-\x7e] with '[\x41-\x5a\x61-\x7a]`. This means:

  • [\x41-\x5a]: UPPERCASE
  • [\x61-\x7a]: lowercase

You could as well separate the two segments with a | (a logical OR), if only to make things more legible. I use this below.

Here are some runs:

$ echo ASD | grep -P "^[\x41-\x5a|\x61-\x7a]*$" ASD $ echo fgh | grep -P "^[\x41-\x5a|\x61-\x7a]*$" fgh $ echo 123 | grep -P "^[\x41-\x5a|\x61-\x7a]*$" $ echo ASDfgh | grep -P "^[\x41-\x5a|\x61-\x7a]*$" ASDfgh $ echo "ASDfgh " | grep -P "^[\x41-\x5a|\x61-\x7a]*$" $ 

Even SPACE will kill the line. If you want all letters plus SPACE, all you need to do is include the space inside the square brackets:

$ echo "ASDfgh " | grep -P "^ [\x41-\x5a\x61-\x7a]*$" $ echo "ASDfgh " | grep -P "^[ \x41-\x5a\x61-\x7a]*$" ASDfgh ^ ADDED SPACE HERE $ 

Notice that here I did without the |, for the sake of completeness.

Happy grepping!

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.