CorrelOracle2

The Correl Oracle takes a set of files from a Wiki and analyzes them. It looks for co-occurrences of words and builds links among pages which share a common vocabulary. (see CorrelOracle for an introductory discussion; see CorrelOracle02SourceCode for the Perl program itself; see CorrelationLog for a big and gory table of inter-page correlation factors.)

Correl Oracle version 0.2 includes a few new features which should improve the quality of its output:

"words" no longer include digits
stopwords ("the", "and", "of", etc.) are filtered out and stemming (truncating of endings like "-s", "-ing", "-ed", etc.) is performed before correlating
the words which trigger a link are given (in stemmed form), along with the strength of the connection

How good is it? I would rate the current Correl Oracle as promising, maybe useful at times, but still in need of much work. Too many pages are connected because of the coincidental use of a few odd words. Perhaps the quality of links can be improved by looking at two-word phrases, or by adjusting the "similarity" metric which underlies the correlations? Perhaps the correlations will improve naturally as this Wiki gets larger? Hard to say...

And in any event, the code should be rewritten; I'm in the process of learning Perl as I go along, and it shows!

TopicProgramming - 2001-09-09

(correlates: CorrelOracle, CorrelOracle3, LowProfile, ...)