FuzzyProximity

 

Search engines have replaced the weather as a subject for complaint. Who hasn't experienced problems with relevance ranking? The simplest and most "obvious" questions seem to give the poorest answers. Why?

Part of the difficulty originates with the user, in the form of ill-posed queries. One can scarcely expect an information retrieval (IR) system to be a mind reader! Yet the majority of searches are for single words or short ambiguous phrases. A human expert with decades of real-world experience could not parse and respond to such a question. It's unfair to expect a machine to do better.

But even sophisticated search patterns are too often unsuccessful. Most relevance-ranking algorithms do a poor job of identifying the documents (or web pages) that best answer a request. Weakness seems to be associated with several factors:

  • long documents which happen to include many of the terms in a query are ranked too high, even when those terms are in disparate, unrelated parts of the text;
  • query languages which lack the ability to specify synonyms and alternative words, and to control thesaurus expansion of search terms at run time; and
  • boolean syntax (AND/OR/NOT) which leads to sharp-edged binary relevance estimates, wiping out all useful differentiation among documents with similar vocabularies.

But there is hope for improvement. One tiny example: my work in the first TREC (Text REtrieval Conference of 1992, sponsored by the US National Institute of Standards and Technology) led me to write a prototype document ranking program that did surprisingly well in an impartial refereed fly-off competition. The results are included in the conference proceedings (edited and with commentary by Dr. Donna Harmon). In brief, I began with the simpleminded and subjective observation that the documents which people like most tend to have strong local clustering of "interesting" words.

So, I took the 50 TREC benchmark questions and constructed obvious regular expressions ("regexps") for each of the key terms in them. Thus for topic #001, a search for "pending antitrust cases", I built the pattern /ANTITRUST/ & /CASE/ & /PEND/. For topic #002, "acquisitions or mergers involving US and foreign companies", I came up with /ACQUISITION|BUYOUT|MERGER|TAKEOVER/ & /US|U.S.|AMERICAN/ & /FOREIGN/. For equivalent terms I wrote a regexp with "|" (the "OR" symbol) joining the words. I spent at most two minutes writing these patterns per query — not a lot of deep thought. I converted all documents to upper-case before processing them, so my program was able to ignore capitalization issues.

To handle fuzzy proximity relationships among separate terms (the "&" symbol used above) and to generate relevance measures, I invented a quick and dirty point count system. Each search expression started out with 0 points. Every time a line in a document matched a search regexp I added an arbitrary 5 points to that regexp's score. A line in the document then received a net interest rating equal to the product of the separate pattern scores for that query. ("&" thus became a multiplication.) When moving on to the next line of a document, I multiplied all regexp point counts by 0.9, to make them fade away with a characteristic length scale of 10 lines.

The estimated relevance of a document as a whole was then just the maximum relevance of any line in that document. If a term needed to get a negative weight (boolean-NOT-like), instead of multiplying in the pattern's point score I used 5 minus that score. I experimented with different weights and relevance-length-scales for different terms, but it made little difference to overall rankings.

The bottom line of this almost-trivial algorithm is that high point scores go to documents which have many relevant words within a few lines of each other. The user has total control over thesaurus (synonym) expansion of terms, and can change relative weights and scoring ad lib. Long documents are neither penalized nor rewarded.

And it worked! With minimal effort this system came in among the top three of the TREC competition. My prototype implementation was slow as molasses: it took about a third of a second to rank each document against each query, so total run time was a couple of weeks(!). But of course the process ran politely in background and did not interfere with other activities on the workstation. If it were rewritten in a compiled language (I used gawk, the GNU project's free version of Awk) and executed on faster hardware (I had an original NeXT machine, the famously lethargic black cube) then it could easily run many orders of magnitude faster.

No magic elixIR (an Information Retrieval joke ... get it(^_^)?!) — just recognition of the fact that "good" documents tend to have "good" words near each other. Fuzzy proximity search and a straightforward point-count system did the rest.

Saturday, March 18, 2000 at 08:22:49 (EST) = 2000-03-18

(see IrWishes)

TopicProgramming - TopicPersonalHistory


(correlates: ScienceWiki, HowDoYouDo, IrWishes, ...)