39

Where can I find reference for less regex search patterns?

I want to search file with less using \d to find digits, but it does not seem to understand this wildcard. I tried to find a reference for less regex patterns, but could not find anything, not on man pages and not on the Internet.

0

3 Answers 3

40

less's man page says:

 /pattern Search forward in the file for the N-th line containing the pattern. N defaults to 1. The pattern is a regular expression, as recognized by the regular expression library supplied by your system. 

so the accepted syntax may depend on your system. Off-hand, it seems to accept extended regular expressions on my Debian system, see regex(7), and Why does my regular expression work in X but not in Y?

\d is from Perl, and isn't supported by all regex engines. Use [0-9] or [[:digit:]] to match digits. (Their exact behaviour may depend on the locale.)

4
  • 2
    > as recognized by the regular expression library supplied by your system. < so… any way to direct less to libpcre? Commented Jan 28, 2021 at 20:54
  • I have Debian 10 according to lsb_release. If I run less on a file that contains a's and e's, and use the command /a|e/, less only highlights the a's. If I enter the command /a\|e/ I get pattern not found. This tends to support that less is accepting the basic syntax, according to man re_format. If I made a mistake in trying to invoke extended syntax, please let me know. Commented Jun 9, 2022 at 22:51
  • 2
    @cardiffspaceman, did you put the trailing / there? I don't think less expects that, so it'll look for the string e/ as the other alternative. It has to be extended regexes, since BRE doesn't have the alternation, and in BRE, a|e would look for that literal string Commented Jun 9, 2022 at 23:02
  • 2
    @likkachu leaving off the trailing / does make the command work as expected, extended RE. Commented Jun 15, 2022 at 1:09
20

The expressions supported by less are documented in the re_format(7) manual (man 7 re_format). That manual describes both the extended regular expressions and the basic regular expressions available on your system. The less utility understands extended regular expressions.

To match a digit, you would use [0-9] or [[:digit:]] (there's a slight difference as the former depends on the current locale). The \d pattern is a Perl-like regular expression (PCRE), not supported by less.

2
  • 1
    @DeeNewcum, with that edit, that answer becomes true in even narrower cases (less built with --with-regex=posix on OpenBSD, with the caveat that the [[:<:]], [[:>:]] won't work properly). See my answer for details. Commented Mar 3 at 7:06
  • @StéphaneChazelas I rolled it back, thanks. It passed beneath my radar. Commented Mar 3 at 7:07
1

If you run ./configure --help from within the source distribution of current versions of less, you'll see:

[...] Optional Packages: [...] --with-regex=LIB select regular expression library (LIB is one of auto,none,gnu, pcre,pcre2,posix,regcmp,re_comp,regcomp,regcomp-local) [auto] [...] 

With the default being auto, meaning it will use the first available on the system in this order:

What regexp flavour comes with that will depend on the API selected above and on the system and implementation of the API.

To find out what API less uses, less --version will tell you.

Or when less is dynamically linked against the library that supplies the regex functions, that is when the code is loaded from shared object files at run time, rather than the code of those library functions being embedded in the less executable, you can use nm -D /path/to/less to check what external Dynamic symbol less needs and look for the regexp functions in those. For instance:

$ nm -D /bin/less | grep -E '\<re[g_]|pcre' U re_compile_pattern@GLIBC_2.2.5 U regfree@GLIBC_2.2.5 U re_search@GLIBC_2.2.5 U re_set_syntax@GLIBC_2.2.5 

Would indicate that that less executable was built with --with-regex=gnu (and indeed Debian builds less with --with-regex=gnu, see why).

That's confirmed by:

 $ less --version less 643 (GNU regular expressions) Copyright (C) 1984-2023 Mark Nudelman less comes with NO WARRANTY, to the extent permitted by law. For information about the terms of redistribution, see the file named README in the less distribution. Home page: https://greenwoodsoftware.com/less 

Compare with a less built with the default options:

 $ nm -D less | grep -E '\<re[g_]|pcre' U regcomp@GLIBC_2.2.5 U regexec@GLIBC_2.3.4 U regfree@GLIBC_2.2.5 $ ./less --version less 643 (POSIX regular expressions) Copyright (C) 1984-2023 Mark Nudelman less comes with NO WARRANTY, to the extent permitted by law. For information about the terms of redistribution, see the file named README in the less distribution. Home page: https://greenwoodsoftware.com/less 

Or with --with-regex=pcre2:

 $ nm -D less | grep -E '\<re[g_]|pcre' U pcre2_code_free_8 U pcre2_compile_8 U pcre2_get_error_message_8 U pcre2_get_ovector_pointer_8 U pcre2_match_8 U pcre2_match_data_create_8 U pcre2_match_data_free_8 $ ./less --version less 643 (PCRE2 regular expressions) Copyright (C) 1984-2023 Mark Nudelman less comes with NO WARRANTY, to the extent permitted by law. For information about the terms of redistribution, see the file named README in the less distribution. Home page: https://greenwoodsoftware.com/less 

On most systems, unless specified otherwise with that --with-regex at build time, the API will most likely be the POSIX one with its regcomp()/regexec() functions as that's the first on that list above, and that should be available on all modern systems.

regcomp() is called with the REG_EXTENDED flag if supported by the regexp library, which would mean you get POSIX Extended regular expressions as implemented on the system.

That means you should have at least the extended regexp operators specified by some version or other of the POSIX standard (at least ^$.\()|{}*+? and bracket expressions which have been there since the first POSIX.2 edition in 1992), and that they should work as specified, though note that some of the advanced locale-dependant things in bracket expressions such as [[:class:]], [[=character-equivalence=]] and [[.collating-element.]] are not always implemented especially on embedded systems.

Some implementations will support extensions over what POSIX specifies including some borrowed from perl such as \s, \w, *?, or from ex such as \<, \>, looking at the documentation of your system's regcomp() function should give you pointers as to where to find the documentation for the system's extended regexp syntax.

Beware some like \<, \> (or its BSD equivalent [[:<:]]/[[:>:]] or perl equivalent \b for word boundary) don't work well with the POSIX API. For instance echo ababcab | less +/'\<ab' on systems where \< is supported will highlight the first two abs¹.

Depending on the regexp API, the input may or may not be decoded as text as per the locale's charmap or may only support single byte charsets or only UTF-8 as a multi-byte charmap so a . for instance may not match a multibyte character, but each byte of its encoding.

less's / and ? search operator can also handle the first character(s) you type afterward specially, as indicated in its man page, including:

^R Don't interpret regular expression metacharacters; that is, do a simple textual comparison.

But also printable characters such as @ or ! which you'd need to escape with a \ for them to be passed literally to the regexp engine.

To match an ASCII decimal digit

  • [0123456789] should work with all APIs and regexp flavours that less may use,
  • \d will work with pcre/pcre2 ((*UTF)(*UCP)\d or (*UTF)\p{Nd}² would also match on all UTF-8 encoded characters classified as decimal digit by Unicode) including inside bracket expressions, and might also work in a few others such as ast-open's as used by ksh93 which does provide a POSIX API (though generally not within bracket expressions). POSIX extended regexp specification leaves \d unspecified as long as its outside bracket expressions allowing implementations to give it the special meaning they like.
  • [[:digit:]] would work in a few but depending on locale and/or system may match on some other decimal digits besides the English ASCII ones
  • [0-9] should match on all of 0123456789 but may also match on other characters that happen to sort between 0 and 9 in the locale, including some not normally classified as decimal digits such as 🆛.

¹ Using the GNU or PCRE(2) APIs would normally allow this kind of problem to be avoided, but unfortunately, the way less invokes them when they're used exhibits the problem there as well as when repeating a search, it passes them only the text after the last match like it does (and have to do) using the POSIX API instead of passing the same input but telling them where to start looking. That problem also affects PCRE look-behind operators (edit. Now fixed).

² That (*UTF) to force the input to be interpreted as UTF-8 encoded is only needed for PCRE2 when in UTF-8 locales, not PCRE. There was explicit code to cover for that for PCRE. Its omission for PCRE2 looks like a bug to me (edit. Now fixed).

3
  • Is there a way of figuring out what regular expression library a given less executable was compiled with? Commented Mar 3 at 7:17
  • 1
    @Kusalananda, nm will help. I'll add a paragraph about that. Commented Mar 3 at 7:38
  • 1
    @Kusalananda or just less --version, see edit. Commented Mar 3 at 8:02

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.