CorrelOracle

ZhurnalWiki pages that wrestle with related themes deserve to be linked together — but who has the energy to do so? As the number of pages approaches 1,000 the task gets even less feasible for a person.

Hence, the Correl Oracle: a small software experiment in building connections between files. Early this year ^z fantasized (see IrWishes, 2001 Jan 4) about:

autolinker — a system to build hyperlinks between related chunks of information. Consider a collection of web pages or otherwise delimited articles, such as this ^zhurnal itself. Humans can add cross-references (e.g., the pointer in the previous bullet), but that takes time and some amount of wit ... both of which are, alas, in limited supply. An autolinker takes the data collection, correlates items, and identifies clusters that cohere: material which has a common vocabulary, for example, or repeatedly uses certain phrases, or possesses other similarities based on a statistical metric. The autolinker then supplies bridges between related items, resulting in an enhanced set of files ready for fast and effective browsing. Auto-generated cross-links are there if needed, but can be ignored if they seem irrelevant.

The Correl Oracle is a baby step in that direction, via a hundred or so lines of Perl (see CorrelOracle01SourceCode). It ran for about half an hour (on my little machine) to produce the cross-links at the bottom of ~700 ZhurnalWiki pages. The method it uses is quite straightforward:

go through all the files in a directory (I omitted the "Index" and "Topic" files here since they lack much textual content)
take all contiguous alphanumeric strings in the files and turn them into all-capital letter "words" (e.g., "TAKE", "ALL", "CONTIGUOUS", "2001-08-26", etc.)
build tables of how many times each word occurs in each file (using Perl's fine associative arrays, aka "hashes")
compare the words used in every file with those in every other file and compute a "similarity" measure
for each file, take the three most "similar" other files and build links to them

The "similarity" metric I used in Correl Oracle version 0.1 is one that seems reasonable, but it's rather arbitrary and lacks much of a scientific/mathematical foundation (translation: I made it up!). Essentially, two files are similar if they each contain a disproportionate fraction of the occurrences of many words — that is, if they share a common vocabulary which isn't shared by lots of other files. My similarity measure also gives more weight to smaller files, since otherwise the larger files win too often simply because they have more words. (Read the source code for details.)

The bottom line is that when two Wiki pages have a similarity greater than 1, they tend to have quite a lot in common, at least on a word-by-word basis. On the other hand, when the best that Correl Oracle can come up with is a similarity less than 1, it means that a particular Wiki page is relatively unique.

Much more remains to be done to make a better Correl Oracle:

experiment with other definitions of "similarity" to sharpen correlations and reduce noise in auto-generated linkages
improve the definition of a "word", perhaps using a bit of linguistic knowledge (e.g., "stemming" to remove endings like "-s", "-ing", "-ed", etc. before computing similarity measures, or "stop word lists" to remove low-information-content terms such as "the", "a", "an", etc.)
explore the use of multi-word phrases as the basis of "similarity" (e.g., look at all adjacent two-word groups)

But Correl Oracle version 0.1 is at least a start!

TopicProgramming - 2001-08-26

(correlates: IrWishes, CorrelOracle2, NecessityAndSufficiency, ...)