Indexing of Large Collections of Data
article acts as a forensics supplement to the article
on tips for optimizing
indexing of large collections of data. Topics in that article include: document
storage and the NTFS file system, general indexing
strategy, index and document location, indexing
resources and efficient text processing.
can index over a terabyte of text in a single
index, with search time typically less than
a second. There are no limits on the number
of indexes dtSearch can build and simultaneously
search. Please see optimizing
indexing of large collections of data for additional information
on using the terabyte indexer.
dtSearch does not alter the original files, including Hash values, in indexing, searching and display of documents.
A single terabyte-data index can
span multiple local and remote locations. For
example, a single index can include data from
hard drives, local area networks, Exchange
servers (see Outlook/Exchange topic below),
Intranet servers and public Web sites (see
Spider topic below). (For indexing SQL-type databases, please see the Databases and Field Searching topic on the developer FAQ Selected Articles by Subject page.)
can rank federated or distributed indexed search
results collectively by relevance, displaying
all local and remote files with highlighted
hits. A scrolling "word
in dtSearch Desktop includes all words in an
index covering local and remote locations.
dtSearch can also output all indexed words
to a file.
Desktop: Click Index > List Index
API: Use ListIndexJob (.NET) or
The dtSearch Spider supports searching
of static browser-ready content (HTML, PDF,
XML/XSL); dynamic browser-ready content (MS
CMS, SharePoint, ASP.NET, etc,); as well as
browser-incompatible content (MS Office files,
OpenOffice files, etc.) The Spider can even
index and search web-accessible data in platforms
that dtSearch does not directly support like
MAC and Unix.
Spider supports public sites as well as password
accessible, forms-based authentication, and
other secure content access. Indexing with
the Spider involves simply selecting a URL
or URLs and indicating how many vertical or
horizontal links to follow. The Spider automatically
figures out the format of the data, so there
is no need to tell the Spider whether a retrieved
web page contains, for example, an MS Office
document or a PDF file.
dtSearch Spider displays static and dynamic browser-ready
content WYSIWYG, including display of images,
formatting and links, with the sole addition
of highlighted hits. The Spider converts browser-incompatible
content (such as MS Office or OpenOffice) "on
the fly" to HTML
for browser display with highlighted hits. More
information (basic article); more
information (advanced article)
convenient offline access, the dtSearch Spider
also includes a caching option, to store the
full spidered content along with the index. (Without
caching, the Spider has to return to the relevant
URL to display the full content with highlighted
Desktop: To enable caching, using
the Create Index (Advanced) dialog box.
developer API: To enable caching,
set the caching flags in IndexJob.IndexingFlags.
Adobe Reader X and XI Users
Adobe Reader X and XI require a plug-in to support highlighting of hits after a search. More information
Recognition of Dates, Email Addresses, and Credit
dtSearch can automatically recognize
dates, email addresses, and credit card numbers,
and search for these items by type. Through
this feature, dtSearch can, for example, search
for a credit card number regardless of how
it may be formatted, or search for a range
of dates even if the dates are expressed in
different text formats (January 15, 2005, through
2/19/07). dtSearch can also extract all dates,
emails and credit card numbers from a collection
of documents. More information
offers a Unicode filtering feature for automatic
recovery of text from corrupt forensically-recovered
documents and large data blocks, such as those
recovered through an "undelete" process,
from unallocated computer space, or from partially
recovered file fragments. The filtering algorithm
can scan recovered data blocks using multiple
Unicode and other text encoding detection methods. More information
Desktop: Click Options > Preferences > Filtering
Options, and check the "Filter text" option
under "Binary files" to enable
filtering of binary files.
developer API: Set Options.BinaryFiles
includes multiple ways to index Outlook or Exchange
messages, contacts, tasks, and notes. All
methods include indexing and searching of the
underlying messages, including all meta data, as well as the full text
of all email attachments. dtSearch will highlight
hits in both messages and attachments, including ZIP and other nested attachments.
(i) Starting with Version 7.67, dtSearch supports native PST files, bypassing the need to go through MAPI or pre-convert the messages to .msg, as described below.
the second approach, dtSearch indexes "live" content
in an Outlook profile. In addition to display
of search results in dtSearch with highlighted
hits, dtSearch supports launching a message,
contact, task, or note in the native application.
For example, you can search for a message in
dtSearch, launch the message in Outlook, and
then reply to the message using Outlook.
Exchange data, as well as for certain archiving and forensic applications, dtSearch
supports extracting Outlook and Exchange data
to .msg files. The .msg conversion approach in
dtSearch works through a command-line tool to
extract Outlook items in bulk from larger volumes
of PST or Exchange data. The converted .msg files
will include all properties of the original Outlook
item, including any attachments. Following conversion,
dtSearch can index the resulting .msg files,
including highlighting hits in messages and attachments. More information
Normally, dtSearch indexes each .eml file and each .msg file as a single document. Attachments are recursively unpacked and appended to the message body, so no matter how many attachments there are, a single document is indexed for each message. Using the File Types table, you can set up rules to require each message to be treated as a container, with the message body and attachments each indexed as a separate document in the container. More information
The above discussion applies to Outlook and
Exchange data. dtSearch can index Outlook Express
.dbx files just like any other supported file
dtSearch also supports Thunderbird (MBOX/EML), including nested email attachments.
searching uses a proprietary algorithm to find
search terms even if they are misspelled. dtSearch
recommends fuzzy searching for searching emails,
text, or any other text that may contain misspellings.
fuzziness adjusts from 0 to 10 so you can fine-tune
fuzziness to the level of OCR or typographical
errors in your files. A search for alphabet with a fuzziness of 1 would find alphaqet;
with a fuzziness of 3, it would find both alphaqet and alpkaqet. Fuzziness is not built into the
index, so you can vary fuzziness at the time
of each search. More information on fuzzy and
other search options
dtSearch includes Unicode-compatible
file parsing, to convert input data to Unicode.
dtSearch automatically recognizes all Unicode-supported
encodings, representing hundreds of international
The following dtSearch search options work
automatically on text in any international
language: phrase; Boolean; proximity and directed
proximity; wildcard; macro; numeric range;
fielded data / metadata search options; fuzzy
searching (adjustable from 0 to 10 to account
for typographical or OCR errors); and relevancy-ranked
searching (including natural language vector-space
ranking, positional scoring options, general
variable term weighting, variable term weighting
in fields, and other API-based document classification
and sorting options). More information
Chinese, Japanese and Korean Text With No Word Breaks
Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word. To make this type of text searchable, enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters, so each character will be treated as single word.
||dtSearch Desktop: In Options > Preferences > Letters and Words, check the box to “Insert word breaks between Chinese, Japanese, and Korean characters in text.”
||dtSearch Developer API: set dtsoTfAutoBreakCJK in Options.TextFlags.
Note: this setting will only affect text identified as Unicode Chinese, Japanese or Korean text; it will not affect text identified as other Unicode character sets.
For documents in certain
formats that do not include encoding information,
such as single-byte text files, dtSearch provides
a proprietary language recognition algorithm
for detecting text in a large variety of languages
(Western European, other European, Middle-Eastern,
etc.). This algorithm is enabled by default.
search in dtSearch will always include white-on-white
text and similar "invisible" text
in files. dtSearch also includes options for
searching embedded objects in Microsoft Office
documents, and normally hidden content in HTML.
HTML comments, scripts, links, and styles are
not by default included in indexing, dtSearch
has an option to include these.
Desktop: Click Options > Preferences > Indexing
Options, and check the box to "Index
HTML scripts, styles, links and comments."
developer API: Set Options.FieldFlags
= to a combination of these flags: dtsoFfHtmlShowLinks,
similar option searches hidden content (such
as Macros or other embedded objects) in Microsoft
Desktop: Click Options > Preferences > Indexing
Options, and check the box to "Index Hidden
content in Office documents."
dtSearch developer API: This option is set
by default. To disable it, set dtsoFfOfficeSkipHiddenContent
for List of Words or Concepts
an option to search for a list of words. Under
this option, a special dialog box provides
a way to search for a long list of words, and
create a list of matching files, in a single
step. This option can work with the full range
of dtSearch search features (Boolean, fuzzy,
natural language, etc.). More information
expanding a search for a specific set of word
or words to a user-defined list of concepts
or synonyms, dtSearch also offers a user-defined
thesaurus add-on to the comprehensive English-language
thesaurus included with dtSearch.
Desktop: Click Options > Preferences > Search
Options > User Thesaurus to add a
list of synonym rings to a specific terms.
View Log of Encrypted Files; Index Encrypted PDFs
an index update completes, click "View
Log" to see a report that will include
information on any encrypted or unreadable
files that the indexer could not process. This
report can be accessed at any time in the index
folder in the file Index_LastUpdateErrors.html. The
report indicates which files were (a) encrypted,
(b) corrupt, (c) partially encrypted, and (d)
partially corrupt. Partially encrypted
or corrupt files are files that could be indexed
in part but that included some encrypted or
corrupt data (for example, an email with an
To index encrypted PDFs, make a temporary, decrypted copy of the encrypted files, index the decrypted copy, and then replace the temporary decrypted copy with the encrypted versions. This one-time unencryption is sufficient for dtSearch operation. dtSearch does not need to unencrypt the PDF files to search and display them with highlighted hits once the original index is complete.
Copying Retrieved Files
dtSearch's Edit › Copy file function lets you copy all or selected documents retrieved from a search to a folder. You can optionally preserve the full path and filename in the copy, and you can preserve creation and last access times as well as the last modified date. More information.
Making Available Retrieved Files on CD/DVD or Other Portable Media
The dtSearch Publish product can quickly publish forensically retrieved (or e-discovery retrieved) documents to CD, DVD or other portable media. The resulting product provides instant search and display access to the document set. The CD, DVD or other portable media can run with zero footprint, requiring no installation on the end-user's computer.
Please see Mirroring Searchable Web Content on Portable Media article for an overview of how dtSearch Publish works.