FreeTextDesiderata

 

People need help in working with mountains of information. They need tools for real-time high-bandwidth access to large-scale collections of disorganized free-text data. Those tools have to be inexpensive, robust, efficient, and above all invisible — so they don't get in the way of human creativity.

"Real-time" means that simple retrieval requests should be answered in less than a second. (Complex Boolean proximity searches can take a bit longer to set up and execute, but even there, speed is important.) "High-bandwidth" means that the person has to get information back from the computer in a useful form which can be evaluated quickly. It's not good enough to return a list of documents or files, each of which has to be paged through in order to find the few relevant items. "Free-text" means that the input stream can't be assumed to include any regular structure beyond words separated by delimiters (spaces, punctuation, etc.).

People need tools for the earliest stages of research, when they have to be able to browse and free-associate without even knowing what are the
right questions to ask. Users have to be able to work with a database
without intimate knowledge of the details of what's in it, since nobody
will have time to read much of the accelerating flood of information. A big database in a particular area has to be a "corporate memory" for groups of scholars who are working on related topics. A new person has to be able to come in and do research without extensive training, and experienced users have to find the system transparent, so that it becomes an extension of their memories, not a barrier to getting at the data and thinking with it.

Good free-text IR tools coexist with other types of tools, such as structured databases which are appropriate for answering well-formulated queries at later phases of the research effort. There are four fundamental operations that a good real-time high-bandwidth large-scale free-text IR system must support:

  • viewing lists of words that occur in a database;
  • selecting subsets of a big database to work within;
  • browsing through candidate items efficiently; and
  • reading and taking notes on retrieved information.

See http://www.his.com/~z/ftirp.html = "Free Text Information Retrieval Philosophy" for notes on how one simple (and free!) IR system was implemented over a decade ago.

Friday, October 29, 1999 at 21:15:14 (EDT) = 1999-10-29

TopicProgramming


(correlates: PolicyMaking, IndexerBrowserFlashback, PlusUltra, ...)