Timeline for Why is StringExpression faster than RegularExpression?
Current License: CC BY-SA 3.0
21 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Apr 13, 2017 at 12:55 | history | edited | CommunityBot | replaced http://mathematica.stackexchange.com/ with https://mathematica.stackexchange.com/ | |
| Nov 26, 2014 at 7:20 | answer | added | m00nlight | timeline score: 13 | |
| May 23, 2013 at 7:53 | comment | added | Stefan | @Mr.Wizard after my first approach, which showed nearly similar performance behaviour and after Albert's further observation, I came up with the following: First@Timing[r3=StringCases[textBig,"(ICD-9-CM "~~x:RegularExpression[".+?"]~~")"->x];] Which shows similar performance behaviour. So we may conclude that you may choose to use RegularExpression in Mma, but you should avoid capturing/substitution if you want to be fast, plus capturing inside a StringExpression context does not work at all. | |
| May 23, 2013 at 0:58 | history | edited | Mr.Wizard | CC BY-SA 3.0 | added 298 characters in body |
| May 23, 2013 at 0:56 | comment | added | Mr.Wizard | @Albert Excellent observation! | |
| May 22, 2013 at 23:07 | history | tweeted | twitter.com/#!/StackMma/status/337344111206072320 | ||
| May 22, 2013 at 19:56 | comment | added | Albert Retey | I have no time to investigate that more closely, but I think the difference could well come from how the resubstituting of the matches is implemented. If you try the same thing without such substitutions, then the runtime differences are marginal, e.g.: StringCases[textBig, RegularExpression["(?ms)\\(ICD-9-CM .+?\\)"]] vs. StringCases[textBig, Shortest["(ICD-9-CM " ~~ __ ~~ ")"]] seem to be equally fast (or probably slow when compared to pcregrep :-) | |
| May 22, 2013 at 18:07 | comment | added | Mr.Wizard | @SEngstrom Actually I think it's the other way around; StringExpression can have programmatic conditions, etc., which is why the conversion with PatternConvert has additional fields (besides the regular expression). Of course in this simple case the additional fields are not used, but you can see it in Shortest["(ICD-9-CM " ~~ code__ ~~ ")" /; StringLength[code] < 7] // StringPattern`PatternConvert or Shortest["(ICD-9-CM " ~~ code__?LetterQ ~~ ")"] // StringPattern`PatternConvert | |
| May 22, 2013 at 18:02 | comment | added | SEngstrom | Could StringPattern be a subset of regular expressions and therefore allow better optimization? Just a guess and I can't think of a way to test it... | |
| May 22, 2013 at 17:40 | comment | added | Stefan | @Mr.Wizard during tracing i found that out as well :(...i'm working on it how to get captured expressions... | |
| May 22, 2013 at 17:32 | comment | added | Mr.Wizard | @Stefan that is not quite the same operation as it is matching the entire (ICD-9-CM 268.9) section rather than just the number. Nevertheless there does seem to be something going on. | |
| May 22, 2013 at 17:06 | comment | added | Stefan | it seems like that StringExpression does something magical with RegularExpression. Presumably it does compile/cache the pattern | |
| May 22, 2013 at 17:05 | comment | added | Stefan | this helps a lot! First@Timing[ r2 == StringCases[textBig, "" ~~ x : RegularExpression["(?ms)\(ICD-9-CM (.+?)\)"] -> x];] | |
| May 22, 2013 at 16:26 | comment | added | Stefan | @Szabolcs ?: means clustering but not capturing, so you can group regexes within (?:) but doesn't make backreferences as () does. | |
| May 22, 2013 at 16:24 | history | edited | Mr.Wizard | CC BY-SA 3.0 | deleted 4 characters in body |
| May 22, 2013 at 16:17 | comment | added | Szabolcs | I tried time pcregrep --buffer-size=100000000 '(?ms)(?:(?ms)\(ICD-9-CM (.+?)\))' test.txt >/dev/null with pcregrep 8.32. This doesn't replace, it only matches, so it may not be correct. It takes 0.09 s here. | |
| May 22, 2013 at 16:11 | comment | added | Szabolcs | Do you know what the ?: means? (I don't.) Maybe this is something worth mentioning to support then? | |
| May 22, 2013 at 16:06 | comment | added | Mr.Wizard | @Szabolcs It seems that has an effect, but not of the same magnitude. For example, nesting it ten times: re2 = Nest[ RegularExpression @ First @ StringPattern`PatternConvert[#] &, se, 10 ] yields a timing of 2.137 -- a minor increase, compared to the se/re difference. | |
| May 22, 2013 at 16:00 | comment | added | Szabolcs | The answer might be in StringPattern`PatternConvert[re]... which is not the same as re. | |
| May 22, 2013 at 15:59 | comment | added | Szabolcs | I think you have v7. Just mentioning that it's the same in 9 too. | |
| May 22, 2013 at 15:52 | history | asked | Mr.Wizard | CC BY-SA 3.0 |