Revisions to Readable regular expressions without losing their power?

added 16 characters in body

edited Apr 23, 2022 at 22:42

6.8k
5
34
49

The key to documenting the regular expression is documenting it. Far too often people toss in what appears to be line noise and leave it at that.

Within perl thea single /x operator at the end oftells the regular expression suppresses whitespace allowing oneparser to document the regular expressionignore most whitespace that is neither backslashed nor within a bracketed character class.

The above regular expression would then become:

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Yes, its a bit consuming of vertical whitespace, though one could shorten it up without sacrificing too much readability.

And then, what the earlier regexp does is this: parse a string of numbers in format 1:2:3.4, capturing each number, where spaces are allowed and only 3 is required.

Looking at this regular expression one can see how it works (and doesn't work). In this case, this regex will match the string 1.

Similar approaches can be taken in other language. The python re.VERBOSE option works there.

Perl6 (the above example was for perl5) takes this further with the concept of rules which leads to even more powerful structures than the PCRE (it provides access to other grammars (context free and context sensitive) than just regular and extended regular ones).

In Java (where this example draws from), one can use string concatenation to form the regex.

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Admittedly, this creates many more " in the string possibly leading to some confusion there, can be more easily read (especially with syntax highlighting on most IDEs) and documented.

The key is recognizing the power and "write once" nature that regular expressions often fall into. Writing the code to defensively avoid this so that the regular expression remains clear and understandable is key. We format Java code for clarity - regular expressions are no different when the language gives you the option to do so.

The key to documenting the regular expression is documenting it. Far too often people toss in what appears to be line noise and leave it at that.

Within perl the /x operator at the end of the regular expression suppresses whitespace allowing one to document the regular expression.

The above regular expression would then become:

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Yes, its a bit consuming of vertical whitespace, though one could shorten it up without sacrificing too much readability.

And then, what the earlier regexp does is this: parse a string of numbers in format 1:2:3.4, capturing each number, where spaces are allowed and only 3 is required.

Looking at this regular expression one can see how it works (and doesn't work). In this case, this regex will match the string 1.

Similar approaches can be taken in other language. The python re.VERBOSE option works there.

Perl6 (the above example was for perl5) takes this further with the concept of rules which leads to even more powerful structures than the PCRE (it provides access to other grammars (context free and context sensitive) than just regular and extended regular ones).

In Java (where this example draws from), one can use string concatenation to form the regex.

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Admittedly, this creates many more " in the string possibly leading to some confusion there, can be more easily read (especially with syntax highlighting on most IDEs) and documented.

The key is recognizing the power and "write once" nature that regular expressions often fall into. Writing the code to defensively avoid this so that the regular expression remains clear and understandable is key. We format Java code for clarity - regular expressions are no different when the language gives you the option to do so.

The key to documenting the regular expression is documenting it. Far too often people toss in what appears to be line noise and leave it at that.

Within perl a single /x tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a bracketed character class.

The above regular expression would then become:

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Yes, its a bit consuming of vertical whitespace, though one could shorten it up without sacrificing too much readability.

And then, what the earlier regexp does is this: parse a string of numbers in format 1:2:3.4, capturing each number, where spaces are allowed and only 3 is required.

Looking at this regular expression one can see how it works (and doesn't work). In this case, this regex will match the string 1.

Similar approaches can be taken in other language. The python re.VERBOSE option works there.

Perl6 (the above example was for perl5) takes this further with the concept of rules which leads to even more powerful structures than the PCRE (it provides access to other grammars (context free and context sensitive) than just regular and extended regular ones).

In Java (where this example draws from), one can use string concatenation to form the regex.

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Admittedly, this creates many more " in the string possibly leading to some confusion there, can be more easily read (especially with syntax highlighting on most IDEs) and documented.

The key is recognizing the power and "write once" nature that regular expressions often fall into. Writing the code to defensively avoid this so that the regular expression remains clear and understandable is key. We format Java code for clarity - regular expressions are no different when the language gives you the option to do so.

Add language hints.

Source Link

edited Oct 1, 2015 at 16:53

user40980

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Fix perl code.

Source Link

edited Apr 15, 2013 at 18:54

user40980

The key to documenting the regular expression is documenting it. Far too often people toss in what appears to be line noise and leave it at that.

Within perl the /x operator at the end of the regular expression suppresses whitespace allowing one to document the regular expression.

The above regular expression would then become:

$re = mqr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Yes, its a bit consuming of vertical whitespace, though one could shorten it up without sacrificing too much readability.

And then, what the earlier regexp does is this: parse a string of numbers in format 1:2:3.4, capturing each number, where spaces are allowed and only 3 is required.

Looking at this regular expression one can see how it works (and doesn't work). In this case, this regex will match the string 1.

Similar approaches can be taken in other language. The python re.VERBOSE option works there.

Perl6 (the above example was for perl5) takes this further with the concept of rules which leads to even more powerful structures than the PCRE (it provides access to other grammars (context free and context sensitive) than just regular and extended regular ones).

In Java (where this example draws from), one can use string concatenation to form the regex.

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Admittedly, this creates many more " in the string possibly leading to some confusion there, can be more easily read (especially with syntax highlighting on most IDEs) and documented.

The key is recognizing the power and "write once" nature that regular expressions often fall into. Writing the code to defensively avoid this so that the regular expression remains clear and understandable is key. We format Java code for clarity - regular expressions are no different when the language gives you the option to do so.

The key to documenting the regular expression is documenting it. Far too often people toss in what appears to be line noise and leave it at that.

Within perl the /x operator at the end of the regular expression suppresses whitespace allowing one to document the regular expression.

The above regular expression would then become:

$re = m/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Yes, its a bit consuming of vertical whitespace, though one could shorten it up without sacrificing too much readability.

And then, what the earlier regexp does is this: parse a string of numbers in format 1:2:3.4, capturing each number, where spaces are allowed and only 3 is required.

Looking at this regular expression one can see how it works (and doesn't work). In this case, this regex will match the string 1.

Similar approaches can be taken in other language. The python re.VERBOSE option works there.

Perl6 (the above example was for perl5) takes this further with the concept of rules which leads to even more powerful structures than the PCRE (it provides access to other grammars (context free and context sensitive) than just regular and extended regular ones).

In Java (where this example draws from), one can use string concatenation to form the regex.

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Admittedly, this creates many more " in the string possibly leading to some confusion there, can be more easily read (especially with syntax highlighting on most IDEs) and documented.

The key is recognizing the power and "write once" nature that regular expressions often fall into. Writing the code to defensively avoid this so that the regular expression remains clear and understandable is key. We format Java code for clarity - regular expressions are no different when the language gives you the option to do so.

The key to documenting the regular expression is documenting it. Far too often people toss in what appears to be line noise and leave it at that.

Within perl the /x operator at the end of the regular expression suppresses whitespace allowing one to document the regular expression.

The above regular expression would then become:

$re = qr/ ^\s* (?: (?: ([\d]+)\s*:\s* )? (?: ([\d]+)\s*:\s* ) )? ([\d]+) (?: \s*[.,]\s*([\d]+) )? \s*$ /x;

Yes, its a bit consuming of vertical whitespace, though one could shorten it up without sacrificing too much readability.

And then, what the earlier regexp does is this: parse a string of numbers in format 1:2:3.4, capturing each number, where spaces are allowed and only 3 is required.

Looking at this regular expression one can see how it works (and doesn't work). In this case, this regex will match the string 1.

Similar approaches can be taken in other language. The python re.VERBOSE option works there.

Perl6 (the above example was for perl5) takes this further with the concept of rules which leads to even more powerful structures than the PCRE (it provides access to other grammars (context free and context sensitive) than just regular and extended regular ones).

In Java (where this example draws from), one can use string concatenation to form the regex.

Pattern re = Pattern.compile( "^\\s*"+ "(?:"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #1 ")?"+ "(?:"+ "([\\d]+)\\s*:\\s*"+ // Capture group #2 ")"+ ")?"+ // First groups match 0 or 1 times "([\\d]+)"+ // Capture group #3 "(?:\\s*[.,]\\s*([0-9]+))?"+ // Capture group #4 (0 or 1 times) "\\s*$" );

Admittedly, this creates many more " in the string possibly leading to some confusion there, can be more easily read (especially with syntax highlighting on most IDEs) and documented.

The key is recognizing the power and "write once" nature that regular expressions often fall into. Writing the code to defensively avoid this so that the regular expression remains clear and understandable is key. We format Java code for clarity - regular expressions are no different when the language gives you the option to do so.

Source Link

answered Apr 15, 2013 at 14:54

user40980

Loading

Stack Exchange Network

Return to Answer