dtSearch Version 7 Index Format

The dtSearch Version 7 Index Format is supported by dtSearch versions 7.00 and later.    Note:  dtSearch versions 6.40 and 6.50 included a beta implementation of the version 7 index format that was enabled in the developer API with the flag dtsIndexCreateVersion7.  For production use, only dtSearch 7.00 and later should be used to create version 7 indexes.


    Features of the new index format
    How to use the new index format in dtSearch Desktop
    How to use the new index format with the developer API
    Memory use and performance

    Indexing speed examples   
    Caching text
    Version compatibility

Features of the new index format

How to use the new index format in dtSearch Desktop

Building an index
dtSearch Desktop 7 uses the new index format by default for new indexes.

Building an index with cached text
(1) Click Index > Create Advanced in Index Manager,
(2) Check "Cache text" and/or "Cache documents" to specify the type of caching.   See "Caching Text," below, for more information.

Converting an index
To convert an existing index to the new format,
(1) Start dtSearch Indexer
(2) Check the "Compress index" and the "Upgrade to version 7 format" boxes in the Update Index dialog box, and leave the other boxes unchecked.
This will compress the index and convert it to the new format.

Searching an index
dtSearch will automatically detect when an index was created with the new format.

Resuming an interrupted update
(1) Click Index > Update Index
(2) Make sure the "Clear index before adding documents" box is not checked
(3) Click Start Indexing.  If the update can be resumed, dtSearch will ask if you want to resume the prior update.  Click Yes to have the prior index update continue.

How to use the new index format with the developer API

To create an index, set the dtsIndexCreateVersion7 flag in IndexJob.indexingFlags or dtsIndexJob.indexingFlags.  To specify caching of text, set the dtsIndexCacheText and/or dtsIndexCacheOriginalFile flags in indexingFlags.

To resume an update that was interrupted, set the dtsIndexResumeUpdate flag in IndexJob.indexingFlags or dtsIndexJob.indexingFlags.

The more robust indexing process and faster availability of new data features are implemented by having the indexer commit data periodically, rather than all at once at the end of the update.   The interval between commits can be changed.  An interval of at least 1024 MB is recommended to avoid excessive index fragmentation.   For larger indexes (more than 100 GB), an interval of at least 8192 MB is recommended.  To change the interval set the autoCommitIntervalMB member of IndexJob or dtsIndexJob.  To change the interval in dtSearch Desktop, set this registry entry:

[HKEY_CURRENT_USER\Software\dtSearch Corp.\dtSearch\Settings]

Memory use and performance

Two IndexJob settings that can significantly affect indexing performance are:


MaxMemToUseMB controls the size of the memory buffers that dtSearch can use to sort words.    If possible, dtSearch will use memory for all sorting operations; otherwise, some disk-based buffers will be used.   For large indexes (10 GB or more of text), some disk-based sort buffers are always necessary and there is little benefit to MaxMemToUseMB values above 512. 

IndexJob.AutoCommitIntervalMB determines how often index updates are forced to commit.   Higher values improve indexing performance.  

For best performance building large indexes, ensure that the drive where the index is located has free space of more than 60% of the size of the documents to be indexed.

Indexing speed examples

Index 1
Data: 441 GB, consisting of 3,104,817 large, gzipped HTML and text files, with unique words
Indexing time:  64.2 hours (6.8 GB/hour)
Unique words: 30,713,035
Index size:  12% of original document size

Index 2
Data: 170 GB, consisting of 4,373,004 mixed-type files (HTML, office documents, text)
Indexing time: 24.7 hours (6.8 GB/hour)
Unique words:  48,508,831
Index size: 12% of original document size

Index 3
Data: 457 GB, consisting of 25,232,225 zipped HTML and text files
Indexing time:  87.8 hours (5.2 GB/hour)
Unique words: 103,589,394
Index size: 16% of original document size

Index 4 (merged contents of Index 1, 2, and 3)
Data: 1068 GB, 32 million files
Indexing time:  Merged from Index 1, Index 2, and Index 3 in 11.2 hours
Unique words: 167,995,346
Index size:  13% of original document size

    PentiumŪ 4 Processor 550 (3.40GHz, 800 FSB), 2GB RAM.

Search speed:  generally less than a second.

Caching text

dtSearch 7 indexes can cache documents in either, or both, of two ways: (1) the entire original file can be cached, or (2) just the text of the file can be cached.  Cached documents are stored using ZIP compression.

The benefit of caching documents is faster and easier highlighting of hits.  This is especially true when the index was created using the dtSearch Spider or the "Data Source" indexing API.  In these types of indexes, the document names in the index do not correspond to local disk files, so access to the original document may be slow or even impossible.  With cached document text in the index, dtSearch can generate hit-highlighted document displays and search reports with no need to access the original data.  Because of these benefits, both types of caching (text and original document) are recommended for indexes created using the dtSearch Spider.

Using cached text - highlighting hits

To display the original file with hits highlighted, caching of the original file is best so formatting can be preserved in the hit-highlighted display.  (When HTML files are cached, only the HTML is stored.) 

dtSearch Desktop and dtSearch Web will automatically use cached original documents as input for hit-highlighting if an index contains cached original documents.

To use a cached document as input for FileConverter,
(1) Set FileConverter.IndexRetrievedFrom to the index path,
(2) Set FileConverter.InputDocId to the document id of the file (which can be obtained from search results as DocDetailItem("_docId"))
(3) Set dtsConvertGetFromCache in FileConverter.Flags

Using cached text - search reports

Caching documents in text form can make generation of search reports faster, especially generation of the synopsis in search results.   The text is cached in small chunks as compressed UTF-8, so dtSearch can quickly locate the context around hits, even in long documents.  

dtSearch Desktop and dtSearch Web will automatically use cached text as input for generation of search reports and the synopsis in search results.

To use cached documents as input for SearchReportJob, set the dtsReportGetFromCache flag in SearchReportJob.Flags.

Performance Implications of Caching Text

Caching text has no effect on search speed.

Caching text will make indexing slower due to the need to compress and store the text in the index.  It will of course make the index larger.   Compression reduces the size of stored documents by about 70-80%.

Security Implications of Caching Text

When a document is retrieved from the cache in an index, any security settings on the original file are not checked.   Instead, only access to the index itself is checked.   Therefore, a user who is able to search an index will also be able to access any cached documents stored in that index.

Version Compatibility


Index Format

dtSearch Version





read, write, create




read, write, create

read, write, create


6.5 (early beta)

read, write, create

read, write, create


6.5 (builds 6601 and later)

read, write, create

read, write

read, write, create


read, write, create

read, write

read, write, create

This chart shows which dtSearch versions can work with each index format.  The version 7.00 format was a preliminary implementation of the new index structure in dtSearch versions from June-December 2004.   Newer dtSearch versions can read and update this format, but will create indexes in the 7.01 format.  Starting with dtSearch version 7.0, the version 7.01 format is the default format for new indexes.