During a recent lecture on programs to organize and arrange collections of text, I suddenly realized why so many document clustering systems are so dissatisfying to use. The problem? They're just like letting a monkey or a bird sort brightly-colored objects into piles.

The statistical correlations that make clusters are a black box to the human user — so that user spends most of her time trying to puzzle out why things are grouped the way they are. "Ah, these are all in Portuguese" ... "OK, these all mention cancer" ... "All of these are either about undersea vessels or sandwiches" ... "These seem to be the leftovers that are all longer than a dozen pages" ... etc.

And contrariwise, to the software all the documents are equally opaque — so in the absence of real understanding, the only thing to do is to group items based on various statistical correlations of recognizable features. A smart person could do no better, given a text corpus in an unknown language.

TopicProgramming - TopicScience - Datetag20060522

