While looking through my old notebooks I found some fantasy plans for information retrieval software development projects which, as far as I know, still haven't been accomplished. I'd like to see:
- morel --- a relevance-ranked version of the UNIX "more" command. Classic "more" displays a file one screen's worth at a time but also has a neat frill: type "/" followed by a pattern, and you jump ahead to the next occurrence of that pattern. Extend that to get morel, which takes as arguments a file name (or names) and a set of search terms. Morel (= "more" + "relevance") scans the file(s) and comes back with chunks (screens) that best match most of the requested terms. (See FuzzyProximity, the ^zhurnal entry of 18 March 2000, for a sketch of one simple yet relatively successful matching method.) A little pattern-description language could allow run-time customization of the relevance ranking (e.g., put "+" in front of the most important words, use "-" to indicate negative weighting, insert
"|" to separate alternatives, and maybe even support some regular expression wildcards). Morel could be especially useful with big, heterogeneous, ill-structured collections of raw text, whenever a needle in a chaotic landscape of haystacks needs to be found.
- autolinker --- a system to build hyperlinks between related chunks of information. Consider a collection of web pages or otherwise delimited articles, such as this ^zhurnal itself. Humans can add cross-references (e.g., the pointer in the previous bullet), but that takes time and some amount of wit ... both of which are, alas, in limited supply. An autolinker takes the data collection, correlates items, and identifies clusters that cohere: material which has a common vocabulary, for example, or repeatedly uses certain phrases, or possesses other similarities based on a statistical metric. The autolinker then supplies bridges between related items, resulting in an enhanced set of files ready for fast and effective browsing. Auto-generated cross-links are there if needed, but can be ignored if they seem irrelevant. (Why, what do you know, it exists! CorrelOracle!)
- kwicker --- an on-the-fly generator for Key Word In Context (KWIC) displays and other information retrieval via a Web interface. The primitive free-text indexer/browser work that I did back in the late 1980's and early 1990's was helpful, but was also very much stand-alone. How about making that service available via a *.cgi or other remote computational resource? The user picks a database (e.g., the works of Shakespeare, or Gibbon's Decline and Fall) and gets a page showing an alphabetized list of all the words in that database (or an excerpt thereof, if the list is too long). Scroll around in the word list and click to get a KWIC showing all the instances of the selected word with half a line of context on each side. Scroll around in the KWIC and click to grab a page of the database centered on the selected instance. Simple, fast, intuitive, and useful. Add fuzzy proximity search for the (relatively rare) hard retrieval cases. (See http://www.his.com/~z/ftirp.html , "Free Text Information Retrieval Philosophy" for an overview essay, and the http://www.his.com/~z/c/ "Free Text Archive" for commentary, source code, and DOS executables for the old indexer/browser. See also the ^zhurnal entries of 29 October 1999, 31 January 2000, and 15 May 2000.)
My resolution for the Third Millennium: get to work on implementing the above, or induce someone cleverer to do it first!
Thursday, January 04, 2001 at 20:12:04 (EST) = Datetag20010104
(correlates: CorrelOracle, IndexerBrowserFlashback, KwicsChinksAndChunks, ...)