Skip to main content
21 events
when toggle format what by license comment
Apr 13, 2017 at 12:55 history edited CommunityBot
replaced http://mathematica.stackexchange.com/ with https://mathematica.stackexchange.com/
Nov 26, 2014 at 7:20 answer added m00nlight timeline score: 13
May 23, 2013 at 7:53 comment added Stefan @Mr.Wizard after my first approach, which showed nearly similar performance behaviour and after Albert's further observation, I came up with the following: First@Timing[r3=StringCases[textBig,"(ICD-9-CM "~~x:RegularExpression[".+?"]~~")"->x];] Which shows similar performance behaviour. So we may conclude that you may choose to use RegularExpression in Mma, but you should avoid capturing/substitution if you want to be fast, plus capturing inside a StringExpression context does not work at all.
May 23, 2013 at 0:58 history edited Mr.Wizard CC BY-SA 3.0
added 298 characters in body
May 23, 2013 at 0:56 comment added Mr.Wizard @Albert Excellent observation!
May 22, 2013 at 23:07 history tweeted twitter.com/#!/StackMma/status/337344111206072320
May 22, 2013 at 19:56 comment added Albert Retey I have no time to investigate that more closely, but I think the difference could well come from how the resubstituting of the matches is implemented. If you try the same thing without such substitutions, then the runtime differences are marginal, e.g.: StringCases[textBig, RegularExpression["(?ms)\\(ICD-9-CM .+?\\)"]] vs. StringCases[textBig, Shortest["(ICD-9-CM " ~~ __ ~~ ")"]] seem to be equally fast (or probably slow when compared to pcregrep :-)
May 22, 2013 at 18:07 comment added Mr.Wizard @SEngstrom Actually I think it's the other way around; StringExpression can have programmatic conditions, etc., which is why the conversion with PatternConvert has additional fields (besides the regular expression). Of course in this simple case the additional fields are not used, but you can see it in Shortest["(ICD-9-CM " ~~ code__ ~~ ")" /; StringLength[code] < 7] // StringPattern`PatternConvert or Shortest["(ICD-9-CM " ~~ code__?LetterQ ~~ ")"] // StringPattern`PatternConvert
May 22, 2013 at 18:02 comment added SEngstrom Could StringPattern be a subset of regular expressions and therefore allow better optimization? Just a guess and I can't think of a way to test it...
May 22, 2013 at 17:40 comment added Stefan @Mr.Wizard during tracing i found that out as well :(...i'm working on it how to get captured expressions...
May 22, 2013 at 17:32 comment added Mr.Wizard @Stefan that is not quite the same operation as it is matching the entire (ICD-9-CM 268.9) section rather than just the number. Nevertheless there does seem to be something going on.
May 22, 2013 at 17:06 comment added Stefan it seems like that StringExpression does something magical with RegularExpression. Presumably it does compile/cache the pattern
May 22, 2013 at 17:05 comment added Stefan this helps a lot! First@Timing[ r2 == StringCases[textBig, "" ~~ x : RegularExpression["(?ms)\(ICD-9-CM (.+?)\)"] -> x];]
May 22, 2013 at 16:26 comment added Stefan @Szabolcs ?: means clustering but not capturing, so you can group regexes within (?:) but doesn't make backreferences as () does.
May 22, 2013 at 16:24 history edited Mr.Wizard CC BY-SA 3.0
deleted 4 characters in body
May 22, 2013 at 16:17 comment added Szabolcs I tried time pcregrep --buffer-size=100000000 '(?ms)(?:(?ms)\(ICD-9-CM (.+?)\))' test.txt >/dev/null with pcregrep 8.32. This doesn't replace, it only matches, so it may not be correct. It takes 0.09 s here.
May 22, 2013 at 16:11 comment added Szabolcs Do you know what the ?: means? (I don't.) Maybe this is something worth mentioning to support then?
May 22, 2013 at 16:06 comment added Mr.Wizard @Szabolcs It seems that has an effect, but not of the same magnitude. For example, nesting it ten times: re2 = Nest[ RegularExpression @ First @ StringPattern`PatternConvert[#] &, se, 10 ] yields a timing of 2.137 -- a minor increase, compared to the se/re difference.
May 22, 2013 at 16:00 comment added Szabolcs The answer might be in StringPattern`PatternConvert[re]... which is not the same as re.
May 22, 2013 at 15:59 comment added Szabolcs I think you have v7. Just mentioning that it's the same in 9 too.
May 22, 2013 at 15:52 history asked Mr.Wizard CC BY-SA 3.0