How costly is case insensitive search as compared to case sensitive search?

Question

I don't think grep -i is exponentially (with respect to the number of characters to be grepped) more costly (time wise) than a normal grep because the run times are not too different.

But theoretically it should be. For example

egrep -i abc *

is equivalent to

egrep "abc|abC|aBc|aBC|Abc|AbC|ABc|ABC" *

How do utilities like grep avoid the exponential time in case insensitive queries? Is there a case insensitive comparison operator that Unix inherently supports that such utilities can use?

Just because the first is equivalent to the second it doesn't mean that it "translates to" that. — alex
– alex, Commented May 23, 2011 at 15:37
How capital and small letters are mapped depends on the locale and in some locales this function is partial or even irreversible. German: capitalized "ß" is "SS" (officially) and small again is "ss". And there is a discussion about capital "ß", but no one uses this. — Martin Sugioarto
– Martin Sugioarto, Commented Sep 13, 2016 at 18:44

muru · Accepted Answer · 2015-05-24 16:13:25Z

An i-match between abC and aBc can easily be done, if abC is transformed to lowercase (once), and every input like aBc is translated to lowercase too. Then normal matching.

But maybe it is done, just by ignoring some bits. 'A' is 65, and 'a' is 97. The difference is 32, a power of 2, so it can easily be masked. Even 'ä' (228) and 'Ä' (196) have a difference of 32, but I'm not sure whether it holds for all characters in extended ASCII.

This is a fair guess but would only work for fixed string case folding and therefore doesn't fit with the rest of the mechanism most efficient for general regular expression matching. — msw
– msw, Commented May 24, 2011 at 15:10

msw · Accepted Answer · 2011-05-24 15:06:29Z

grep like most regular expression engines translates the pattern that you specify into a deterministic finite-state automaton (DFA).

A common way of expressing case insensitivity is by using character classes for each alphabetic, therefore your example would be more like [aA][bB][cC]. Individual character class matches are often implemented as bit-set lookups where a bit set containing 1s in the positions corresponding to a and A is built at regular expression -> DFA compile time.

That means that to match [aA] the DFA need only take the value of the input character, use it as an index to the bit-set - which is an O(1) operation - so you don't see the combinatoric time explosion that your equivalent

"abc|abC|aBc|aBC|Abc|AbC|ABc|ABC"

would suggest. DFA construction from regular expressions is an application of "if you are willing to spend some time up front (DFA building) you can really save cycles later (DFA recognition).

Two things: Which grep? On linux grep, egrep, fgrep are often or always links t the same executable, but that hasn't traditionally been true. Egrep often used different algorithms than grep, fgrep always used to. See: Andrew Hume, "A tale of two greps", Software Practice and Experience, 18, #11, Nov. 1988. See Russ Cox Regular Expression Matching Can Be Simple And Fast (swtch.com/~rsc/regexp/regexp1.html) for timings of DFA and NFA based regexp matching. — user732
– user732, Commented May 24, 2011 at 15:57

Stack Exchange Network

How costly is case insensitive search as compared to case sensitive search?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How costly is case insensitive search as compared to case sensitive search?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions