13

I have had the need to use regular expressions only a few times in the work that I have done. However, in those few times I discovered a very powerful form of expression that would enable me to do some extremely useful things.

The problem is that the language used for regular expressions is wrong - full stop.

It is wrong from a psychological point of view - using disembodied symbols provides a useful reference only to those with an eidetic memory. Whilst the syntactic rules are clearly laid out, from my experience and what I have learnt from others, evolving a regular expression that functions successfully can prove to be a difficult thing to do in all but the most trivial situations. This is understandable since it is a symbolic analog for set theory, which is a fairly complicated thing.

One of the things that can prove difficult is dissolving the expression that you are working on into its discrete parts. Due to the nature of the language, it is possible to read one regular expression in multiple ways if you don't have an understanding of its primary goal so interpreting other people's regexes is complicated. In natural language study I believe this is called pragmatics.

The question I'd like to ask then is this - is there such a thing as a regular expression compiler? Or can one even be built?

It could be possible to consider regexes, from a metaphorical point of view, as assembly language - there are some similarities. Could a compiler be designed that could turn a more natural language - a higher language - into regular expressions? Then in my code, I could define my regexes using the higher level language in a header file and reference them where necessary using a symbolic reference. I and others could refer from my code to the header file and more easily appreciate what I am trying to achieve with my regexes.

I know it can be done from a logical point of view otherwise computers wouldn't be possible but if you have read this far then would you consider investing the time in realising it?

4
  • 5
    I find Regular expressions fairly easy to read. Commented Nov 4, 2009 at 14:56
  • 1
    I think the ambiguity of a natural language might add to rather than ease complications. Regex does seem daunting early on, especially things like backtracking and non greedy operators. However, having recently relearned regex, it look me an evening of intense study to get most of it. However, I'm sure someone with more experience could easily out regex me. Like anything worth knowing it comes down to practice and persistence. Commented Mar 14, 2014 at 6:57
  • Andrea Ambu ' s answer is a great help, for anyone who has problems with regex. Its sort of what I meant, but I really hoped that there might be a library of macros or similar for various programming languages that might be able to define textual search with a combinatorial syntax etc. People have different mental models, and some programmers, myself included find it virtually impossible to work with regex, although there are third parties available at a small cost to write them. If you are working regularly, then it is easier, but when it is only the occasional project it is difficult. Commented Mar 15, 2016 at 14:19
  • You can build a regex compiler, of course. Thing is, it will take a long time to do and a lot of effort. It's very hard work and keep in mind all the optimizations that the current languages implemented for you, which in this case you will have to figure out :) i tell you because i'm actually writing a regex compiler. I am doing it for fun and to learn compilers, parsing and NFAs but it is not simple! Commented Jul 23, 2016 at 23:06

14 Answers 14

12

1) Perl permits the /x switch on regular expressions to enable comments and whitespace to be included inside the regex itself. This makes it possible to spread a complex regex over several lines, using indentation to indicate block structure.

2) If you don't like the line-noise-resembling symbols themselves, it's not too hard to write your own functions that build regular expressions. E.g. in Perl:

sub at_start { '^'; } sub at_end { '$'; } sub any { "."; } sub zero_or_more { "(?:$_[0])*"; } sub one_or_more { "(?:$_[0])+"; } sub optional { "(?:$_[0])?"; } sub remember { "($_[0])"; } sub one_of { "(?:" . join("|", @_) . ")"; } sub in_charset { "[$_[0]]"; } # I know it's broken for ']'... sub not_in_charset { "[^$_[0]]"; } # I know it's broken for ']'... 

Then e.g. a regex to match a quoted string (/^"(?:[^\\"]|\\.)*"/) becomes:

at_start . '"' . zero_or_more( one_of( not_in_charset('\\\\"'), # Yuck, 2 levels of escaping required '\\\\' . any ) ) . '"' 

Using this "string-building functions" strategy lends itself to expressing useful building blocks as functions (e.g. the above regex could be stored in a function called quoted_string(), you might have other functions for reliably matching any numeric value, an email address, etc.).

Sign up to request clarification or add additional context in comments.

Comments

5

I never stumbled across something like that. And I don't think that something like that would be useful.

That higher-level language would be very verbose and my guess is that you'd need pretty long statements to come up with a regular expression of average complexity.

Maybe you just haven't been using regular expressions often enough. Believe me, my memory is far from being eidetic (or even good), but I rarely have problems crafting regular expressions or understanding those of my coworkers.

Comments

5

What about write them with Regex Buddy and paste the description it generates as comment on your code?

1 Comment

+1: regex is extremely hard to read, but this is a tooling issue, not a language issue
5

Regular Expressions (well, "real" regular expressions, none of that modern stuff;) are finite state machines. Therefore, you create a syntax that describes a regular expressions in terms of states, edges, input and possibly output labels. The fsmtools [webarchive] of AT&T support something like that, but they are far from a tool ready for everyday use.

The language in XFST [webarchive], the Xerox finite state toolkit, is also more verbose.

Apart from that, I'd say that if your regular expression becomes too complex, you should move on to something with more expressive power.

Comments

3

There are ways to make REs in their usual form more readable (such as the perl /x syntax), and several much wordier languages for expressing them. See:

I note, however, that a lot of old hands don't seem to like them.

There is no fundamental reason you couldn't write a compiler for a wordy RE language targeting a compact one, but I don't see any great advantage in it. If you like the wordy form, just use it.

Comments

3

XML Schema's "content model" is an example of what you want.

c(a|d)+r 

can be expressed as a content model in XML Schema as:

<sequence> <element name="c" type="xs:string"/> <choice minOccurs="1" maxOccurs="unbounded"> <element name="a" type="xs:string"/> <element name="d" type="xs:string"/> </choice> <element name="r" type="xs:string"/> <sequence> 

Relax NG has another way to express the same idea. It doesn't have to be an XML format itself (Relax NG also has an equivalent non-XML syntax).

The readability of regex is lowered by all the escaping necessary, and a format like the above reduces the need for that. Regex readability is also lowered when the regex becomes complex, because there is no systematic way to compose larger regular expressions from smaller ones (though you can concatenate strings). Modularity usually helps. But for me, the shorter syntax is tremendously easier to read (I often convert XML Schema content models into regex to help me work with them).

Comments

2

One way you can by pass this problem is by using programs like QuickREx, it shows how regex works on multiple test data(with highlights). You could save text data in file near your regex and latter when you want to change it, understand it or fix it that would be much easier.

Comments

1

I see allot of answers trying to solve the problem, but I think I have an answer for you.

I believe the entire regex syntax came from as far as the late 70's. (I wish I can find some kind of history on the subject) I picked up a 1979 book on automata of letters and the entire book is filled with mathematical proofs on finding patterns in text. I will get the title when I get home and update it here.

The thing is that this book had some very complicated symbols in relation to calculus that if I had not gone though such a class I would not be able to understand it. I bet, however, a mathematician who regularly uses this syntax would be able to read it like a novel.

It took me a good month to get a handle on how to read regular expressions to the point I just need to glance at it. To the lay person it looks like complicated asm with all these weird symbols in it. I don't consider regular expressions as assembly, its a mathematical formula for finding patterns in text. Considering the syntax and it coming originally from mathematician's, I don't think its far off.

So as for a compiler I doubt there can ever be such a one. As dmckee mentioned "I note, however, that a lot of old hands don't seem to like them." You have cartoons and sitcoms depicting complicated math equations on whiteboards. Its a joke to show how hard a certain subject is but in reality anyone with experience could understand it if they are given the subtext and a bit of training. Regex isn't hard. Once you get the basics down it just boils down to the particular parser your using. Its like some kids telling me that they don't want to learn C/C++ because its harder than Javascript even if it has the same syntax. Its perception rather than difficulty.

Once you have learned regex, its the engines that give you issues. Visual studio uses brackets instead of parentheses for grouping. The simple regex library SLRE I use has a simple subset vs PCRE more complete syntax. At this point we start talking about a more of a new language rather than a tool for text matching.

Also, most programers use a single short line for their regex matches rather than building a regex full match because they just want to parse some random data. Regex matching is a tool like Bison, yacc or ANTLR. A hand built parser will always be better so in essence you can compile your own regex, so why spend the time with 2 pages of code for a regex match when a simple ansi c while loop is faster?

If you want regex to be more dynamic and be readable, its better to build your parser in the native language your using for your program. Regex is meant to be a tool and not as a full fledged language.

As a side note look at some of the Lua source code between Lua 3.0 and 3.2.2. They change from a Bison parser to a hand built one. You realize how much more freedom they have with that than using a tool for doing their text parsing, especially with latter feature releases. Of course it also makes more complicated a code to keep updated. It was a choice between the clearness of the *.y files vs the robustness of being hand built.

Comments

1

Perhaps some JavaScript tools can help:

Sadly I did not find any ready to use "point and click" JS tool to easily build and manipulate RegEx yet. The power of RegEx (PCRE, Posix, Python) is, that they

  • are extremely compact (one can argue quite too compact)
  • can be used nearly everywhere
  • always look the same (one awkward size fits all) and therefor are easy to spot in code

So reinventing the wheel perhaps isn't the best choice, and Regular Expressions are internally compiled already to speed up things a lot. If you look for something more elaborative there are LEX and YACC (and their successors), but most time both exaggerate things compared to the simple way RegEx can be applied.

Following might be useful to others, but is not Linux so I was unable to test it:

If you find other good links, perhaps add as a comment. I know this is a little bit of a SO abuse to request this, but it's so incredible helpful. Thanks.

Comments

0

Have you considered using a parser generator (aka compiler compiler) such as ANTLR?

ANTLR also has some kind of IDE (ANTLR Works) where you can visualize/debug parsers.

On the other hand a parser generator is not something to throw into you app in a few seconds like a regex - and it also would be total overkill for something like checking email address format.

Also for simple situations this would be total overkill and maybe a better way is just to write comments for your regex explaining what it does.

Comments

0

I agree that the line-noise syntax of regexps is a big problem, and frankly I don't understand why so many people accept or defend it, it's not human-readable.

Something you don't mention in your post, but which is almost as bad, is that nearly every language, editor, or tool has its own variation on regexp syntax. Some of them support POSIX syntax as it was defined so many years ago, some support Perl syntax as it is today. But many have their own independent ways of expressing things, or which characters are "special" (special characters is another topic) and which are not. What is escaped and what isn't. Etc. Not only is it difficult to read a regexp written for one language or tool, but even if you totally memorize the syntax rules for your favorite variation, they can trip you up in a different language, where {2,3} no longer means what you expect. It's truly a mess.

Furthermore, I think there are many non-programmers who (if they knew it existed) would appreciate having a pattern-matching language they could use in everyday tools like Google or Microsoft Word. But there would need to be an easier syntax for it.

So, to answer your question, I have often thought of making some kind of cross-platform, cross-language, cross-everything library that would allow you to "translate" from any regexp syntax (be it Perl, or POSIX, or Emacs, etc) into any other regexp syntax. So that you wouldn't have to worry if Python regexps could do negative look-behind, or if character-class brackets should be escaped in an Emacs regexp. You could just memorize one syntax, then make a function call to get out the equivalent syntax for whatever you happened to be using.

From there, it could be extended with a new pattern-matching language, that would be a bit more verbose or at least more mnemonic. Something for people who don't want to spend half-an-hour studying a regexp to figure out what it does. (And people who think regexps are fine as they are have obviously never had to maintain anything they didn't write themselves, or they would understand the need for other people to be able to parse what they've written.)

Will I ever attempt such a beast? I don't know, it's been on my to-do list for a long time, and there are a lot of easier and more entertaining projects on there as well. But if you are contemplating something similar, let me know.

Comments

0

regular expression compiler:

ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/83/972/CS-TR-83-972.pdf

Comments

0

If you read the Dragon Book for compilers, it tells you to use regex for analyzing and parsing your higher level language. So, regexs seem to be something more lower-level. I use them very often in my daily job tasks for frontend/backend development and yes, i found some of them to be kind of cryptic. However, that doesn't make regex wrong, you can always write a new language if you don't like them, given that 1) you have the time 2) you are willing to put the proper effort 3) the force is strong within you :)

Comments

0

This is an old thread but I feel compelled to comment. Regex are definitely problematic for many tasks, such as extraction of patterns of interest that span many lines of text, extraction of content from context-free patterns that embed regular subpatterns, etc. Fact is not much has changed in that area since the 70s with awk, grep, etc.

I have looked long and hard for a simple regular expression compiler that enables regular patterns to be expressed in terms of the full complement of operations that regular sets are closed under. My goal was to use it to build transducers that map regular sequences (eg text stream) onto target effectors (simple functions that receive a single input symbol (eg UNICODE ordinal) and use it to effect some action on a target object that is bound to the transduction. A wide range of simple data extraction tasks can be accomplished using a small handful of simple select/cut/copy/paste/clear effectors.

The sweet thing is that this enables a clear separation of concerns between syntax (explication of the source pattern) and semantics (target/effector implementations). The transduction runtime provides a transduction stack, enabling complex or context-free patterns to be expressed as a tree of simpler transducers, and an input stack, enabling called transducers to return meaningful signals to the caller. And the target/effector complex has full application access to the host runtime and RAM so much more can be accomplished than with straight text-to-text transducers.

I could find only one such compiler, and I convinced the author to publish it in an open source forum. You can find it here: https://github.com/ntozubod/ginr. It was originally written in the late 1980s and hasn't changed much since, but I've used it extensively without issues. It was once used to build a transducer with >1M transitions that was applied to the prepress corpus (with proprietray markup) of the Oxford English Dictionary to SGML. As I said, I use it to define transducers, which I package and deploy for runtime use here: https://github.com/jrte/ribose (a WIP).

A couple of examples.

HelloWorld = (nil, out[`hello world`]); 
Fibonacci = ( # ~r, ~q, ~p are sequences of 0s preset to empty string ^ ( # fib(0): ~q <- 0 ('0', select[`~q`] paste) # fib(n>1): cycle (~r) <- (~q) <- (~p) <- (~p)(~r), (~r) <- ^ ('0', select[`~r`] cut[`~p`] select[`~p`] copy[`~q`] select[`~q`] cut[`~r`])* )? # (~q) is empty or selected and holds the result, so append nl and print result (nl, paste out stop) ); (Fibonacci$(0,1 2)):prsseq; (START) 0 [ select ~q paste ] 1 (START) nl [ paste out stop ] (FINAL) 1 0 [ select ~r cut ~p select ~p copy ~q select ~q cut ~r ] 1 1 nl [ paste out stop ] (FINAL) $ for n in '' 0 00 000 0000 00000 000000 0000000 00000000 000000000; do echo $n | jrte Fibonacci; done 0 0 00 000 00000 00000000 0000000000000 000000000000000000000 0000000000000000000000000000000000 

Comments