RegExplanations

Happy, happy happy! I've finally grokked (at least momentarily) the connection between Finite State Automata and Regular Expressions ... and at last I begin to understand why some regexps with nested quantifiers are so incredibly slow, a puzzle that has pestered me for several years now.

But first back up to get a modicum of context: what's all this about?

Finite state automata (FSAs) are little artificial "machines", logical constructs which respond to a digital input by changing their internal configuration in a restricted way, according to fixed rules. They're kind of like tiny primitive creatures that can flip among a few modes of operation — hunting, eating, sleeping, etc. — based on external triggers in combination with their current status. "If you're hungry, start hunting"; "If you're hunting and catch something, eat it and stop being hungry"; "If you're not hungry, go to sleep"; "If you're sleeping and time passes, wake up and feel hungry"; and so forth.
Regular expressions (regexps) are patterns that describe strings of characters with various types of optional wild cards: "ABC" means the letter "A" followed by "B" followed by "C"; "[a-z]" means "any lower-case Latin alphabetic letter"; "Frog|Toad" means "either Frog or Toad"; "*" means "0 or more copies of whatever came before"; "." means "any single character, I don't care what"; etc. Regexps are thus a language, an extraordinarily useful one when you want to tell a computer what to do with the symbols that it receives. And regexps are fundamental tools used by the UNIX operating system, by the Awk and Perl programming languages, and by a variety of other clever interfaces between people and computational systems.

Now you know enough to go on. So we've got these little mathematical machines, FSAs ... and we've got this little logical language, regexps ... hmmm! Maybe they can work together? Maybe one could be translated into the other? Maybe they can jointly be used to solve complex and important problems with great speed? Yes to all of the above!

But heretofore I had only been told that FSAs and regexps were intimately linked; I had never seen an explicit explanation of how they cohabit. Textbook assertions like "It can be demonstrated ..." failed to scratch my itch.

Then, by sheer good fortune, Darren Neimke read a ZhurnalWiki note on a completely unrelated theme (SemioticArsenal, 20 Nov 2003) and posted a comment. That led me to Darren's http://regexblogs.com/ site where I found a thicket of good information on the technology of regular expressions ("regexes" to Darren, "regexen" to the affected — but traditarian that I am, still "regexps" to me). My thirst for understanding the FSA-regexp link was aroused from its long slumber.

So I asked Darren, and he kindly pointed me to the lovely essay by Mark-Jason Dominus, "How Regexes Work" (http://perl.plover.com/Regex/article.html). I read MJD's words and the scales fell from my eyes — much like the Eureka! moment of revelation that I experienced when I figured out how to solve a Rubik's Cube (see RubikCubism1, 16 Mar 2001).

And the other question that had nagged me? Some regexps are woefully inefficient — they seem to take forever to run. Why? A Perl programming book had cautioned me not to use nested wild-card quantifiers, but it didn't explain the sources of the danger. Dominus did, beautifully. It's all a matter of how Perl implements regexps as FSAs, and in particular the backtracking depth-first search strategy that Perl uses to match patterns. (see BreadthAndDepth, 11 Jun 1999)

In a letter to me Darren said that "... finite state machines sit about 4 layers of abstraction below my level of competency :-) ...". Me too! But it's important to occasionally go spelunking and get a glimpse of those deeper levels, just as it's important (and fun, for a physicist's personality anyway) to have at least some vague notion of how everything works — and thereby get a better feel for the limitations of tools that are built on those foundations. (see KnowHowAndFearNot, 19 Nov 1999)

And in another sense, maybe it's not down but up that the progression is going — to greater levels of abstraction, richer and more powerful conceptual spaces. (see CreativeDevices, 1 Jan 2001, and No Concepts At All, 22 Feb 2001, and Darren Neimke's "Death by Abstraction" within http://regexblogs.com/, 26 Nov 2003)

TopicProgramming - TopicScience - TopicLanguage - TopicPersonalHistory - 2003-12-03

(correlates: WorseIsBetter, WikiWord, DarrenNeimke, ...)

RegExplanations

1 person liked this