Department of Mathematics and Computing ScienceWorld Wide Web OrganizationPaul De Bra

Determining Relevance in General

Most information retrieval software that is currently available follows Luhn's assumption: the frequency of word occurrence in an article furnishes a useful measurement of word significance. There are a few pitfalls though:

Stripping, stemming and looking for synonyms must be done both on the words in the search string and on the words in the documents.


home blue tour

The figure below (taken from Rijsbergen's book) shows a plot of the hyperbolic curve relating the frequency of occurrence of words to their rank order. Words that occur very frequently (more than the upper cut-off) are not useful for discriminating relevant from non-relevant documents. Words that occur in very few documents are also not very useful for most queries. (Not many queries will ask for them.)

Figure showing resolving power of words