The most typical data structure for complete index databases is
obtained as follows:
- The text is divided into unique semi-infinite strings,
or sistrings. Each sistring has a starting position in the text,
and continues to the right until it is unique.
- The sistrings are typically stored in (the leaves of) a tree,
the suffix tree. (Common parts are stored only once.)
- Each sistring can be associated with a location within a document
where the sistring occurs. Subtrees below a certain node represent all
occurrences of the substring represented by that node.
- Suffix trees (and similar suffix arrays) have a size of the same
order of magnitude as the input documents.
Below is an example suffix tree for the following words:
- begin
- beginning
- between
- bread
- break