3
$\begingroup$

The following is a distilled-down version of a previous sprawling question of mine.

Suppose I place one million copies of the uppercase English alphabet (i.e., the uppercase ISO basic Latin alphabet) into a list alphabets:

charRange = CharacterRange["A", "Z"]; alphabets = Flatten[ConstantArray[charRange, 10^6]]; // AbsoluteTiming ScientificForm[Length[alphabets], DigitBlock -> 3] (* OUTPUT (Mathematica 14.0.0 for Microsoft Windows 64-bit): *) (* {0.245727, Null} *) (* 26,000,000 *) 

Now suppose I wish to generate a "pick list" that has the element True for each corresponding element in the list alphabets that is either "A" or "B", and that has the element False otherwise. I can accomplish this by mapping either StringMatchQ ("Method 1") or MatchQ ("Method 2") across the list alphabets:

(* METHOD 1: StringMatchQ, mapped *) boolList1 = Map[StringMatchQ[#, "A" | "B"] &, alphabets]; // AbsoluteTiming (* METHOD 2: MatchQ, mapped *) boolList2 = Map[MatchQ[#, "A" | "B"] &, alphabets]; // AbsoluteTiming boolList1 == boolList2 (* OUTPUT: *) (* {18.4239, Null} *) (* {6.59128, Null} *) (* True *) 

The MatchQ method ("Method 2") is approximately 2.8 times faster than the StringMatchQ method ("Method 1"). This is somewhat surprising to me.

  • Naively, I would've expected the StringMatchQ method to be faster since I assume that the pattern supplied to StringMatchQ must be a string pattern (a more restrictive condition), rather than any type of pattern (a less restrictive condition).
  • But in fact, my assumption is incorrect; according to the first two entries ("overloads") in the documentation for StringMatchQ, the pattern supplied to StringMatchQ can be a string pattern, a regular expression, or a list of patterns.

I realize that without being able to "look under the hood" of StringMatchQ and MatchQ and see how they are implemented in Wolfram language, it's difficult to gain much understanding of why a built-in function behaves the way it does, performance-wise. But can anyone speculate on possible (and reasonable) explanations for the performance behavior of StringMatchQ and MatchQ demonstrated above?

(Note: A much faster method to generate the "pick list" is to use the listable form of StringMatchQ, rather than using Map, as shown below. But my question does not pertain to this faster method.)

boolList1a = StringMatchQ[alphabets, "A" | "B"]; // AbsoluteTiming (* OUTPUT: *) (* {1.20825, Null} *) 
$\endgroup$
1
  • $\begingroup$ ps. you forgot to include charRange = CharacterRange["A", "Z"]; in your code. $\endgroup$ Commented Nov 27, 2024 at 23:59

2 Answers 2

6
$\begingroup$

But can anyone speculate on possible (and reasonable) explanations for the performance behavior

It seems this is related to handling of the alternative (Or) patterns in StringMatchQ vs. MatchQ

You can see this by comparing the timing when pattern is only one character, then now the timing is almost same. When add the Or |, now StringMatchQ slows down by a much larger amount compared to MatchQ.

So whatever the cause, is related to use Or | in the pattern with StringMatchQ.

charRange = CharacterRange["A", "Z"]; alphabets = Flatten[ConstantArray[charRange, 10^6]]; // AbsoluteTiming ScientificForm[Length[alphabets], DigitBlock -> 3] 

Now compare the timings

boolList1=Map[StringMatchQ[#,"A"]&,alphabets];//RepeatedTiming (* {6.48611,Null} *) boolList2 = Map[MatchQ[#, "A"] &, alphabets]; // RepeatedTiming (* {5.1013, Null} *) 

We see the timing is not that much different, even though MatchQ seems a little faster but by not by as much as when adding Or |.

Now add the alternative patterns and notice how much slower StringMatchQ now becomes

boolList1 = Map[StringMatchQ[#, "A" | "B"] &, alphabets]; // RepeatedTiming (* {17.6302, Null} *) boolList2 = Map[MatchQ[#, "A" | "B"] &, alphabets]; // RepeatedTiming (*{5.77427, Null}*) 

Adding more or's Alternatives does not affect the timing any more

boolList1 = Map[StringMatchQ[#, "A" | "B" | "C"] &, alphabets]; // RepeatedTiming (*{17.7701, Null}*) boolList2 = Map[MatchQ[#, "A" | "B" | "C"] &, alphabets]; // RepeatedTiming (*{6.02637, Null}*) 

So something in handling of alternative patterns for StringMatchQ is the cause.

Both StringMatchQ and MatchQ are kernel functions so can't look at the implementation.

Needs["GeneralUtilities`"]; PrintDefinitions@MatchQ 

enter image description here

PrintDefinitions@StringMatchQ 

enter image description here

When Mathematica one day becomes open source, will look more closely to find out :)

$\endgroup$
3
  • 4
    $\begingroup$ I think this may have been a speed optimization added to MatchQ a couple of years back. $\endgroup$ Commented Nov 28, 2024 at 4:22
  • $\begingroup$ nit: | is Alternatives, not Or. $\endgroup$ Commented Nov 28, 2024 at 20:30
  • $\begingroup$ @att thanks, fixed. changed all OR with | (i.e. alternative) in answer. $\endgroup$ Commented Nov 28, 2024 at 20:38
4
$\begingroup$

StringMatchQ is more general than MatchQ, for string expressions. A more specific function can be optimized. I mean StringMatchQ treats strings as composite, and MatchQ treats strings as atoms. Pattern-matching atoms seems easier to me. I don't know the internals of how Mathematica instantiates and stores strings, but some optimizations are possible. For instance the length of the string seems to part of the internal String data structure. Possibly the character data of some strings are stored in a way that allows multiple references, so that two references to the same string would match True by comparing their pointers to the raw data.

It seems that string length is used by MatchQ[]. If that tests False, then False is returned. This is inferred from the speed of MatchQ[] and StringLength[]. I have no other proof.

If the string length is the same, sometimes the test takes almost no time. I suggested above when the strings are the same, it could be because the strings point to the same raw character array. I have no proof of this.

When the strings differ early, MatchQ[] returns fast, and StringMatchQ[] seems (to take enough time) to process the whole string and pattern.

I'm ignoring possible issues with complicated patterns and just wish to complement the points @Nasser has made.

Examples

(* LARGE vs VERY LARGE --> False *) foo = ExampleData[{"Text", "AeneidEnglish"}]; murf = StringJoin[Table[foo, 10]]; (* StringLength[] does not depend on string length *) StringLength[foo] // RepeatedTiming StringLength[murf] // RepeatedTiming (* {1.08668*10^-7, 606071} *) (* {1.10919*10^-7, 6060710} *) MatchQ[foo, murf] // RepeatedTiming (* {1.18261*10^-7, False} *) StringMatchQ[foo, murf] // RepeatedTiming (* {0.00347659, False} *) 

(* LARGE, SAME --> True *) foo = ExampleData[{"Text", "AeneidEnglish"}]; murf = ExampleData[{"Text", "AeneidEnglish"}]; MatchQ[foo, murf] // RepeatedTiming (* Immediate: same ptr? *) (* {1.1821*10^-7, False} *) StringMatchQ[foo, murf] // RepeatedTiming (* 3000x slower *) (* {0.000379477, False} *) 

(* LARGE, MODIFIED --> True ; both slower (why?) *) foo = StringJoin["Hi! ", ExampleData[{"Text", "AeneidEnglish"}]]; murf = StringJoin["Hi! ", ExampleData[{"Text", "AeneidEnglish"}]]; StringLength[foo] == StringLength[murf] (* True *) MatchQ[foo, murf] // RepeatedTiming (* diff. ptr.? *) (* {0.0000149543, True} <-- done in C? *) StringMatchQ[foo, murf] // RepeatedTiming (* ~100x slower *) (* {0.0017701, True} <-- done in WL? *) 

(* LARGE, EARLY DIFF. --> False *) foo = ExampleData[{"Text", "AeneidEnglish"}]; murf = StringReverse@ExampleData[{"Text", "AeneidEnglish"}]; MatchQ[foo, murf] // RepeatedTiming (* {1.21304*10^-7, False} *) StringMatchQ[foo, murf] // RepeatedTiming (* 3000x slower *) (* {0.000378394, False} *) 

(* LARGE, MODIFIED --> False ; MatchQ[] fast *) foo = StringJoin["Hi! ", ExampleData[{"Text", "AeneidEnglish"}]]; murf = StringDrop[ StringJoin["Hi! ", ExampleData[{"Text", "AeneidEnglish"}]], 1]; MatchQ[foo, murf] // RepeatedTiming (* Unequal lengths? *) (* {1.16776*10^-7, False} *) StringMatchQ[foo, murf] // RepeatedTiming (* {0.000380289, False} *) 

(* SHORT, SAME LENGTH --> False *) foo = "ab"; murf = "ac"; MatchQ[foo, murf] // RepeatedTiming (* {1.20145*10^-7, False} *) StringMatchQ[foo, murf] // RepeatedTiming (* 40% slower *) (* {1.71088*10^-7, False} *) 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.