CorrelOracle

ZhurnalWiki pages that wrestle with related themes deserve to be linked together --- but who has the energy to do so? As the number of pages approaches 1,000 the task gets even less feasible for a person.

Hence, the Correl Oracle: a small software experiment in building connections between files. Early this year ^z fantasized (see IrWishes, 2001 Jan 4) about:

autolinker --- a system to build hyperlinks between related chunks of information. Consider a collection of web pages or otherwise delimited articles, such as this ^zhurnal itself. Humans can add cross-references (e.g., the pointer in the previous bullet), but that takes time and some amount of wit ... both of which are, alas, in limited supply. An autolinker takes the data collection, correlates items, and identifies clusters that cohere: material which has a common vocabulary, for example, or repeatedly uses certain phrases, or possesses other similarities based on a statistical metric. The autolinker then supplies bridges between related items, resulting in an enhanced set of files ready for fast and effective browsing. Auto-generated cross-links are there if needed, but can be ignored if they seem irrelevant.

The Correl Oracle is a baby step in that direction, via a hundred or so lines of Perl (see CorrelOracle01SourceCode). It ran for about half an hour (on my little machine) to produce the cross-links at the bottom of ~700 ZhurnalWiki pages. The method it uses is quite straightforward:

The "similarity" metric I used in Correl Oracle version 0.1 is one that seems reasonable, but it's rather arbitrary and lacks much of a scientific/mathematical foundation (translation: I made it up!). Essentially, two files are similar if they each contain a disproportionate fraction of the occurrences of many words --- that is, if they share a common vocabulary which isn't shared by lots of other files. My similarity measure also gives more weight to smaller files, since otherwise the larger files win too often simply because they have more words. (Read the source code for details.)

The bottom line is that when two Wiki pages have a similarity greater than 1, they tend to have quite a lot in common, at least on a word-by-word basis. On the other hand, when the best that Correl Oracle can come up with is a similarity less than 1, it means that a particular Wiki page is relatively unique.

Much more remains to be done to make a better Correl Oracle:

But Correl Oracle version 0.1 is at least a start!

TopicProgramming - Datetag20010826


(correlates: IrWishes, CorrelOracle2, NecessityAndSufficiency, ...)