Will grep [0-9] work in the same way as grep [:digit:]?
- Can you please explaina bit ?Pacifist– Pacifist2016-04-13 17:35:01 +00:00Commented Apr 13, 2016 at 17:35
- see stackoverflow.com/a/891741/4957508 and en.wikipedia.org/wiki/Numerals_in_UnicodeJeff Schaller– Jeff Schaller ♦2016-04-13 18:16:50 +00:00Commented Apr 13, 2016 at 18:16
4 Answers
No, [0-9] is not the same as [:digit:].
[0-9] matches the numerals 0 to 9.
[:digit:] matches 0 to 9, and numerals in non-western languages as well (e.g. Eastern Arabic).
- Please, could you provide a specific name for, or a link (with a table) to, the (preferably) shortest such character set?agc– agc2016-04-13 17:50:47 +00:00Commented Apr 13, 2016 at 17:50
- Also are you sure that '[0-9]' always matches only 0 to 9? Perhaps in some other character set, '[0-9]' pulls in other chars.agc– agc2016-04-13 17:55:22 +00:00Commented Apr 13, 2016 at 17:55
- 3@agc
[0-9]matches the literal ASCII characters "0", "1", …, "9", the same way that[A-Z]matches ASCII characters "A" through "Z". These patterns are restricted to the ASCII character set by definition. On the other hand,[:digit:]designates a broader character class, which also includes Unicode characters for digits in other languages.Guido– Guido2016-04-13 18:07:54 +00:00Commented Apr 13, 2016 at 18:07 - 4
[charX-charY]implies low level library code that looks up the character code for charX in the current character set, and counts from there up to the code for charY.[A-Z]on an ASCII system matches 26 codes: {65,...90}. EBCDIC matches 41 codes: {193,...233}.[:upper:]would always match only 26 codes on ASCII, EBCDIC, etc. Thankfully,[0-9]matches 10 codes with ASCII or EBCDIC codes -- what's uncertain is whether any character sets exist (or shall exist) for which[0-9]matches more than or less than 10 codes. If such a set exists,[:digit:]is useful.agc– agc2016-04-13 21:11:52 +00:00Commented Apr 13, 2016 at 21:11 - 4@Guido Actually
[A-Z]matches more than ASCII letters sometimes: unix.stackexchange.com/questions/15980/… On the other hand, I think all extant locales have[0-9]match only the ASCII digits.Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2016-04-13 22:33:22 +00:00Commented Apr 13, 2016 at 22:33
You can change [[:digit:]] to [0-9] - note [:digit:] is inside […]. This depends on the encoding of the input. If it is ASCII, I don't think there will be a problem. With other encodings, the digits may not be contiguous, or the byte range maybe different. You could also miss special numbers in other writing systems.
To be precise [0-9] is only guaranteed to be equivalent to [:digit:] if:
the regexp parser supports
[:digit:](i.e. if it does not, then the existing[:digit:]probably doesn't do what you think it does), and:the input character set is one such as ASCII where the only digits are the characters
0-9and they are adjacent. This might not be true in (e.g.) unicode (where the digits may include characters other than the digits0-9), or even in other 8 bit character sets where0-9may not be adjacent (as it happens in EBCDIC the digits0-9are adjacent).
Examples of unicode exceptions are shown here. As you can see the set of unicode characters in the category 'Number, Digit, Decimal' includes rather more than the 10 ASCII digits matched by [0-9]; it includes arabic indic, extended arabic, ngo, etc.
More information on numerals in unicode can be found here.
- The ASCII digits are adjacent and in order in all extant locales, so
[0-9]is a safe way to match them.Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2016-04-13 22:34:12 +00:00Commented Apr 13, 2016 at 22:34 - @Gilles they are adjacent in ASCII but not in all 8 bit character sets. And it is not true that all digits are adjacent in Unicode (as there are digits other than
[0-9])abligh– abligh2016-04-14 05:10:05 +00:00Commented Apr 14, 2016 at 5:10
'[:digit:]' is theoretically more portable, the advantage being that it would not depend on one's local character set clumping digits all together.
Related example: With '[:upper:]' vs '[A-Z]' there's no difference in ASCII, but there is a difference on an old IBM EBCDIC system, where '[A-Z]' would span 41 chars not 26, (EBCDIC codes 193-233) and would therefore match EBCDIC "}\" et al.