Tutorial: Robot Technology and Applications

Index Databases

Building an index-database is often considered a matter of "inverting" the document database. Instead of a database of documents containing words one creates a database of words, linked to documents.

Reducing words to a canonical form (by stripping suffixes and by stemming) is difficult to automate, and may be context sensitive.
For "non-standard" words it is difficult to decide where words start and end. For instance, is "index-database" a single word, or the combination of the words "index" and "database"?
Foreign languages use special characters that may be represented by HTML code, or by extended ascii characters, or that are simplified to plain characters. There may also be different ways to spell the same word. (In Dutch, "copieren, kopieren, copiëren, kopiëren, copiëren and kopiëren all mean the same thing.)
HTML codes may contain meta-information, e.g. in the header of a document. One may also either use or ignore comments (that are not visible in the browser).
Ideally an index-database contains enough information from a document to be able to regenerate the document from its entry in the database.

An index-database which contains all information from the indexed documents will be at least as large as the documents themselves. In the World Wide Web of over 100 gigabytes it is not feasible to generate and maintain databases this large. Hence all databases omit some information, and hope this will not be the information users wish to search for.