In grep command, can I change [:digit:] to [0-9]?

Question

Will grep [0-9] work in the same way as grep [:digit:]?

see stackoverflow.com/a/891741/4957508 and en.wikipedia.org/wiki/Numerals_in_Unicode — Jeff Schaller
– Jeff Schaller ♦, Commented Apr 13, 2016 at 18:16

Guido · Accepted Answer · 2016-04-13 17:45:20Z

17

No, [0-9] is not the same as [:digit:].

[0-9] matches the numerals 0 to 9.

[:digit:] matches 0 to 9, and numerals in non-western languages as well (e.g. Eastern Arabic).

edited Apr 13, 2016 at 17:45

answered Apr 13, 2016 at 17:37

Guido

4,2441 gold badge15 silver badges22 bronze badges

Please, could you provide a specific name for, or a link (with a table) to, the (preferably) shortest such character set?

agc
– agc

2016-04-13 17:50:47 +00:00
Commented Apr 13, 2016 at 17:50
Also are you sure that '[0-9]' always matches only 0 to 9? Perhaps in some other character set, '[0-9]' pulls in other chars.

agc
– agc

2016-04-13 17:55:22 +00:00
Commented Apr 13, 2016 at 17:55
3

@agc [0-9] matches the literal ASCII characters "0", "1", …, "9", the same way that [A-Z] matches ASCII characters "A" through "Z". These patterns are restricted to the ASCII character set by definition. On the other hand, [:digit:] designates a broader character class, which also includes Unicode characters for digits in other languages.

Guido
– Guido

2016-04-13 18:07:54 +00:00
Commented Apr 13, 2016 at 18:07
4

[charX-charY] implies low level library code that looks up the character code for charX in the current character set, and counts from there up to the code for charY. [A-Z] on an ASCII system matches 26 codes: {65,...90}. EBCDIC matches 41 codes: {193,...233}. [:upper:] would always match only 26 codes on ASCII, EBCDIC, etc. Thankfully, [0-9] matches 10 codes with ASCII or EBCDIC codes -- what's uncertain is whether any character sets exist (or shall exist) for which [0-9] matches more than or less than 10 codes. If such a set exists, [:digit:] is useful.

agc
– agc

2016-04-13 21:11:52 +00:00
Commented Apr 13, 2016 at 21:11
4

@Guido Actually [A-Z] matches more than ASCII letters sometimes: unix.stackexchange.com/questions/15980/… On the other hand, I think all extant locales have [0-9] match only the ASCII digits.

Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil'

2016-04-13 22:33:22 +00:00
Commented Apr 13, 2016 at 22:33

| Show 1 more comment

muru · Accepted Answer · 2016-04-13 17:40:17Z

You can change [[:digit:]] to [0-9] - note [:digit:] is inside […]. This depends on the encoding of the input. If it is ASCII, I don't think there will be a problem. With other encodings, the digits may not be contiguous, or the byte range maybe different. You could also miss special numbers in other writing systems.

abligh · Accepted Answer · 2016-04-14 05:11:53Z

To be precise [0-9] is only guaranteed to be equivalent to [:digit:] if:

the regexp parser supports [:digit:] (i.e. if it does not, then the existing [:digit:] probably doesn't do what you think it does), and:
the input character set is one such as ASCII where the only digits are the characters 0 - 9 and they are adjacent. This might not be true in (e.g.) unicode (where the digits may include characters other than the digits 0 - 9), or even in other 8 bit character sets where 0 - 9 may not be adjacent (as it happens in EBCDIC the digits 0 - 9 are adjacent).

Examples of unicode exceptions are shown here. As you can see the set of unicode characters in the category 'Number, Digit, Decimal' includes rather more than the 10 ASCII digits matched by [0-9]; it includes arabic indic, extended arabic, ngo, etc.

More information on numerals in unicode can be found here.

The ASCII digits are adjacent and in order in all extant locales, so [0-9] is a safe way to match them. — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Apr 13, 2016 at 22:34
@Gilles they are adjacent in ASCII but not in all 8 bit character sets. And it is not true that all digits are adjacent in Unicode (as there are digits other than [0-9]) — abligh
– abligh, Commented Apr 14, 2016 at 5:10

agc · Accepted Answer · 2016-04-13 17:57:19Z

'[:digit:]' is theoretically more portable, the advantage being that it would not depend on one's local character set clumping digits all together.

Related example: With '[:upper:]' vs '[A-Z]' there's no difference in ASCII, but there is a difference on an old IBM EBCDIC system, where '[A-Z]' would span 41 chars not 26, (EBCDIC codes 193-233) and would therefore match EBCDIC "}\" et al.

Stack Exchange Network

In grep command, can I change [:digit:] to [0-9]?

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions