SnipPattern

 

A minor epiphany came to me a few days ago. I was starting to scratch my head about how to remove "stuff" from the top (or bottom) of a set of Wiki pages — tags, mark-up material, or obsolete hyperlinks (e.g., strings automatically created by CorrelOracle or the like). I needed to snip such material off so that I could re-process the textual core of each file.

How to do this? The Perl programming language offers fine pattern-matching facilities, so if I could specify a string that marked the end of a header (or the beginning of a footer) then, thinks I, perhaps a search routine could locate that marker, note its offset from the beginning of the file, and proceed to rewrite the file without the unwanted part before (or after) it. That's a typical way to solve such a problem in a linear array-oriented language like C or FORTRAN: scan, recognize, measure, cut. Like extracting a gene from a strand of DNA, or clipping a piece of cloth from a patterned fabric.

But as I turned the problem over in my mind, I suddenly realized that Perl offered a much simpler and better way: let the pattern itself specify what to delete. There's no need to count bytes or look for start/stop sequences. "Regular expression" patterns can include controlled wild cards and symbols for the beginning and end of a file. So by moving the work out of the program and into the pattern the job becomes almost trivial. If the pattern to snip out is called "$snip", then the one-line Perl command:

 $body =~ s/$snip//so;

finds $snip in the file's body and cuts it out. No tricky looping, byte offsets, or other overhead. (see SnipPattern01SourceCode for the full program, which has the framework needed to handle a whole directory of Wiki pages plus specific examples)

This sort of productive aha! experience happens more frequently as one gets into the spirit of a programming language, or any other complex system. Beginners fight against constraints; experts leverage the strengths and avoid the weaknesses of their tools. A pencil can do things that a pen cannot, and vice versa. A spreadsheet doesn't make a very good word processor. A screwdriver isn't a chisel. High-extensibility languages, like FORTH or LISP, work best when they're used to transform tasks into simpler sub-problems and sub-sub-problems; high-efficiency mathematical languages like FORTRAN work best for deep numerical calculations; non-procedural languages like PROLOG work best when what to do is clear but how to do it requires a deep or subtle search. (see ResolutionAndUnification and StrandsOfTruth)

Another example: a few years ago I was playing around and trying to write some simple Awk programs to identify which human language (English, German, Spanish, etc.) various web pages were written in. My approach was simple: look for common words and award "points" accordingly to the languages that the words appeare most often in. "THE" suggests English (though yes, it could mean "tea" in French); "DER", "DIE", and "DAS" imply German (though yes, "die" is a fine English word too); and so forth. I built my recognizer-patterns, saved them in a file, and wrote the Awk code to load them and apply them. Then I started running tests — at which point it became clear that my approach was incredibly, intolerably slow. Argggh!

My son Merle looked at what I had done and had an aha! moment: instead of interpreting my patterns, he saw that he could write a program to compile them into another, far more efficient, Awk program. Merle's tiny compiler (itself written in Awk) was a tool to build a tool. Obvious? Only after one gets into the spirit of pattern-transformation. (see AwesomelySimple and DoMeta)


My co-author Ward tells me that he finds sipping out everything except operators in source code (using Perl) condenses MB size code to a few pages of cryptic symbols. What's interesting about this, is that he finds the result a workable way to analyze "syntax" and discover things about how the program is coded. For example, he's found it a good way to locate sections of code methods that are very similar or identical. In effect he's leveraging code analysis on the human brain's remarkable ability to identify subtle pattern signatures. – Bo Leuf


TopicProgramming - TopicThinking - TopicPersonalHistory - 2001-09-06


(correlates: YouCanHaveItAll, IntellectualHeimlichManeuver, SnipPattern01SourceCode, ...)