Building an index-database is often considered a matter of "inverting"
the document database. Instead of a database of documents containing words
one creates a database of words, linked to documents.
- Reducing words to a canonical form (by stripping suffixes and by
stemming) is difficult to automate, and may be context sensitive.
- For "non-standard" words it is difficult to decide where words start
and end. For instance, is "index-database" a single word, or the combination
of the words "index" and "database"?
- Foreign languages use special characters that may be represented by
HTML code, or by extended ascii characters, or that are simplified to
plain characters. There may also be different ways to spell the same word.
(In Dutch, "copieren, kopieren, copiëren, kopiëren, copiëren and
kopiëren all mean the same thing.)
- HTML codes may contain meta-information, e.g. in the header of a document.
One may also either use or ignore comments (that are not visible in the
browser).
- Ideally an index-database contains enough information from a document
to be able to regenerate the document from its entry in the database.
An index-database which contains all information from the indexed
documents will be at least as large as the documents themselves.
In the World Wide Web of over 100 gigabytes it is not feasible to generate
and maintain databases this large.
Hence all databases omit some information, and hope this will not be
the information users wish to search for.