The dtSearch Version 7 Index Format is supported by dtSearch versions 7.00 and later. Note: dtSearch versions 6.40
and 6.50 included a beta implementation of the version 7 index format that was
enabled in the developer API with the flag dtsIndexCreateVersion7. For production use, only dtSearch
7.00 and later should be used to create version 7 indexes.
Contents
Features of the new index format
How to use the new index format in dtSearch Desktop
How to use the new index format with the
developer API
Memory use and performance
Indexing
speed examples
Caching text
Version compatibility
Features of the new index format
How to use the new index format in dtSearch Desktop
Building an index
dtSearch Desktop 7 uses the new index format by
default for new indexes.
Building an index with cached
text
(1) Click Index >
Create Advanced in Index Manager,
(2) Check "Cache text" and/or "Cache documents" to specify
the type of caching. See "Caching Text," below, for more
information.
Converting an index
To convert an existing
index to the new format,
(1) Start dtSearch Indexer
(2) Check the "Compress index" and the "Upgrade to version 7 format" boxes in the Update Index dialog box, and leave
the other boxes unchecked.
This will compress the index and convert it to the new format.
Searching
an index
dtSearch will automatically detect when an index
was created with the new format.
Resuming an interrupted
update
(1) Click Index >
Update Index
(2) Make sure the "Clear index before adding documents" box is not
checked
(3) Click Start Indexing. If the update can be resumed, dtSearch will ask if you want to resume the prior
update. Click Yes to have the prior index update
continue.
How to use the new index format with the developer API
To create an index, set the dtsIndexCreateVersion7 flag in IndexJob.indexingFlags or dtsIndexJob.indexingFlags. To specify caching of text, set the dtsIndexCacheText and/or dtsIndexCacheOriginalFile flags in indexingFlags.
To resume an update that was interrupted, set the dtsIndexResumeUpdate flag in IndexJob.indexingFlags or dtsIndexJob.indexingFlags.
The more robust indexing process and faster availability of new data features are implemented by having the indexer commit data periodically, rather than all at once at the end of the update. The interval between commits can be changed. An interval of at least 1024 MB is recommended to avoid excessive index fragmentation. For larger indexes (more than 100 GB), an interval of at least 8192 MB is recommended. To change the interval set the autoCommitIntervalMB member of IndexJob or dtsIndexJob. To change the interval in dtSearch Desktop, set this registry entry:
[HKEY_CURRENT_USER\Software\dtSearch Corp.\dtSearch\Settings]
"IndexAutoCommitIntervalMB"=dword:00004096
Two IndexJob settings that can significantly affect indexing performance are:
IndexJob.MaxMemToUseMB
IndexJob.AutoCommitIntervalMB
MaxMemToUseMB controls the size of the memory buffers that dtSearch can use to sort words. If possible, dtSearch will use memory for all sorting operations; otherwise, some disk-based buffers will be used. For large indexes (10 GB or more of text), some disk-based sort buffers are always necessary and there is little benefit to MaxMemToUseMB values above 512.
IndexJob.AutoCommitIntervalMB determines how often index updates are forced to commit. Higher values improve indexing performance.
For best performance building large indexes, ensure that the drive where the index is located has free space of more than 60% of the size of the documents to be indexed.
Index 1
Data: 441 GB, consisting
of 3,104,817 large, gzipped HTML and text files, with
unique words
Indexing time: 64.2 hours (6.8 GB/hour)
Unique words: 30,713,035
Index size: 12% of original document size
Index 2
Data: 170 GB, consisting
of 4,373,004 mixed-type files (HTML, office documents, text)
Indexing time: 24.7 hours (6.8 GB/hour)
Unique words: 48,508,831
Index size: 12% of original document size
Index 3
Data: 457 GB, consisting
of 25,232,225 zipped HTML and text files
Indexing time: 87.8 hours (5.2 GB/hour)
Unique words: 103,589,394
Index size: 16% of original document size
Index 4 (merged contents of
Index 1, 2, and 3)
Data: 1068 GB, 32
million files
Indexing time: Merged from Index 1, Index 2, and Index 3 in 11.2 hours
Unique words: 167,995,346
Index size: 13% of original document size
Hardware:
PentiumŪ 4 Processor 550 (3.40GHz, 800 FSB), 2GB RAM.
Search speed: generally less than a second.
dtSearch 7 indexes can cache documents in either, or both, of two ways: (1) the entire original file can be cached, or (2) just the text of the file can be cached. Cached documents are stored using ZIP compression.
The benefit of caching documents is faster and easier highlighting of hits. This is especially true when the index was created using the dtSearch Spider or the "Data Source" indexing API. In these types of indexes, the document names in the index do not correspond to local disk files, so access to the original document may be slow or even impossible. With cached document text in the index, dtSearch can generate hit-highlighted document displays and search reports with no need to access the original data. Because of these benefits, both types of caching (text and original document) are recommended for indexes created using the dtSearch Spider.
Using cached text - highlighting hits
To display the original file with hits highlighted, caching of the original file is best so formatting can be preserved in the hit-highlighted display. (When HTML files are cached, only the HTML is stored.)
dtSearch Desktop and dtSearch Web will automatically use cached original documents as input for hit-highlighting if an index contains cached original documents.
To use a cached document as
input for FileConverter,
(1) Set FileConverter.IndexRetrievedFrom to the index
path,
(2) Set FileConverter.InputDocId to the document id
of the file (which can be obtained from search results as DocDetailItem("_docId"))
(3) Set dtsConvertGetFromCache in FileConverter.Flags
Using cached text - search reports
Caching documents in text form can make generation of search reports faster, especially generation of the synopsis in search results. The text is cached in small chunks as compressed UTF-8, so dtSearch can quickly locate the context around hits, even in long documents.
dtSearch Desktop and dtSearch Web will automatically use cached text as input for generation of search reports and the synopsis in search results.
To use cached documents as input for SearchReportJob, set the dtsReportGetFromCache flag in SearchReportJob.Flags.
Performance Implications of Caching Text
Caching text has no effect on search speed.
Caching text will make indexing slower due to the need to compress and store the text in the index. It will of course make the index larger. Compression reduces the size of stored documents by about 70-80%.
Security Implications of Caching Text
When a document is retrieved from the cache in an index, any security settings on the original file are not checked. Instead, only access to the index itself is checked. Therefore, a user who is able to search an index will also be able to access any cached documents stored in that index.
|
Index
Format |
||
dtSearch
Version |
6.00 |
7.00 |
7.01 |
6.3 |
read, write, create |
--- |
--- |
6.4 |
read, write, create |
read, write, create |
--- |
6.5 (early beta) |
read, write, create |
read, write, create |
--- |
6.5 (builds 6601 and later) |
read, write, create |
read, write |
read, write, create |
7.0 |
read, write, create |
read, write |
read, write, create |
This chart shows which dtSearch versions can work with each index format. The version 7.00 format was a preliminary implementation of the new index structure in dtSearch versions from June-December 2004. Newer dtSearch versions can read and update this format, but will create indexes in the 7.01 format. Starting with dtSearch version 7.0, the version 7.01 format is the default format for new indexes.