Optimizing
Indexing of Large Collections of Data
This
article acts as a forensics supplement to the article
on tips for optimizing
indexing of large collections of data. Topics in that article include: document
storage and the NTFS file system, general indexing
strategy, index and document location, indexing
resources and efficient text processing.
dtSearch
can index over a terabyte of text in a single
index, with search time typically less than
a second. There are no limits on the number
of indexes dtSearch can build and simultaneously
search. Please see optimizing
indexing of large collections of data for additional information
on using the terabyte indexer.
Distributed/Federated
Searching
A single terabyte-data index can
span multiple local and remote locations. For
example, a single index can include data from
hard drives, local area networks, Exchange
servers (see Outlook/Exchange topic below),
Intranet servers and public Web sites (see
Spider topic below).
dtSearch
can rank federated or distributed indexed search
results collectively by relevance, displaying
all local and remote files with highlighted
hits. A scrolling "word
wheel" display
in dtSearch Desktop includes all words in an
index covering local and remote locations.
dtSearch can also output all indexed words
to a file.
 |
dtSearch
Desktop: Click Index > List Index
ContentsdtSearch |
 |
Developer
API: Use ListIndexJob (.NET) or
DListIndexJob (C++) |
Spider-Assisted
Searching
The dtSearch Spider supports searching
of static browser-ready content (HTML, PDF,
XML/XSL); dynamic browser-ready content (MS
CMS, SharePoint, ASP.NET, etc,); as well as
browser-incompatible content (MS Office files,
OpenOffice files, etc.) The Spider can even
index and search web-accessible data in platforms
that dtSearch does not directly support like
MAC and Unix.
The
Spider supports public sites as well as password
accessible, forms-based authentication, and
other secure content access. Indexing with
the Spider involves simply selecting a URL
or URLs and indicating how many vertical or
horizontal links to follow. The Spider automatically
figures out the format of the data, so there
is no need to tell the Spider whether a retrieved
web page contains, for example, an MS Office
document or a PDF file.
The
dtSearch Spider displays static and dynamic browser-ready
content WYSIWYG, including display of images,
formatting and links, with the sole addition
of highlighted hits. The Spider converts browser-incompatible
content (such as MS Office or OpenOffice) "on
the fly" to HTML
for browser display with highlighted hits.
More
information (basic article); more
information (advanced article)
For
convenient offline access, the dtSearch Spider
also includes a caching option, to store the
full spidered content along with the index. (Without
caching, the Spider has to return to the relevant
URL to display the full content with highlighted
hits.)
 |
dtSearch
Desktop: To enable caching, using
the Create Index (Advanced) dialog box. |
 |
dtSearch
developer API: To enable caching,
set the caching flags in IndexJob.IndexingFlags. |
Automatic
Recognition of Date, Email Addresses, and Credit
Card Numbers
dtSearch can automatically recognize
dates, email addresses, and credit card numbers,
and search for these items by type. Through
this feature, dtSearch can, for example, search
for a credit card number regardless of how
it may be formatted, or search for a range
of dates even if the dates are expressed in
different text formats (January 15, 2005, through
2/19/07). dtSearch can also extract all dates,
emails and credit card numbers from a collection
of documents. More information
Forensics
Filtering Features
dtSearch
offers a Unicode filtering feature for automatic
recovery of text from corrupt forensically-recovered
documents and large data blocks, such as those
recovered through an "undelete" process,
from unallocated computer space, or from partially
recovered file fragments. The filtering algorithm
can scan recovered data blocks using multiple
Unicode and other text encoding detection methods. More information
 |
dtSearch
Desktop: Click Options > Preferences > Filtering
Options, and check the "Filter text" option
under "Binary files" to enable
filtering of binary files. |
 |
dtSearch
developer API: Set Options.BinaryFiles
= dtsoFilterBinaryUnicode. |
Outlook/Exchange
Support
dtSearch
includes two ways to index Outlook or Exchange
messages, contacts, tasks, and notes. Both
methods include indexing and searching of the
underlying messages as well as the full text
of all email attachments. dtSearch will highlight
hits in both messages and attachments.
In
the first approach, dtSearch indexes "live" content
in an Outlook profile. In addition to display
of search results in dtSearch with highlighted
hits, dtSearch supports launching a message,
contact, task, or note in the native application.
For example, you can search for a message in
dtSearch, launch the message in Outlook, and
then reply to the message using Outlook.
For
archiving and forensic applications, dtSearch
recommends extracting Outlook and Exchange data
to .msg files. The .msg conversion approach in
dtSearch works through a command-line tool to
extract Outlook items in bulk from larger volumes
of PST or Exchange data. The converted .msg files
will include all properties of the original Outlook
item, including any attachments. Following conversion,
dtSearch can index the resulting .msg files,
including highlighting hits in messages and attachments.
More information
(Note:
the above discussion applies to Outlook and
Exchange data. dtSearch can index Outlook Express
.dbx files just like any other supported file
type.)
Fuzzy
Searching
Fuzzy
searching uses a proprietary algorithm to find
search terms even if they are misspelled. dtSearch
recommends fuzzy searching for searching emails,
OCR’ed
text, or any other text that may contain misspellings.
Search
fuzziness adjusts from 0 to 10 so you can fine-tune
fuzziness to the level of OCR or typographical
errors in your files. A search for alphabet with a fuzziness of 1 would find alphaqet;
with a fuzziness of 3, it would find both alphaqet and alpkaqet. Fuzziness is not built into the
index, so you can vary fuzziness at the time
of each search. More information on fuzzy and
other search options
International
Language Support
dtSearch includes Unicode-compatible
file parsing, to convert input data to Unicode.
dtSearch automatically recognizes all Unicode-supported
encodings, representing hundreds of international
languages.
The following dtSearch search options work
automatically on text in any international
language: phrase; Boolean; proximity and directed
proximity; wildcard; macro; numeric range;
fielded data / metadata search options; fuzzy
searching (adjustable from 0 to 10 to account
for typographical or OCR errors); and relevancy-ranked
searching (including natural language vector-space
ranking, positional scoring options, general
variable term weighting, variable term weighting
in fields, and other API-based document classification
and sorting options). More information
Chinese, Japanese and Korean Text With No Word Breaks
Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word. To make this type of text searchable, enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters, so each character will be treated as single word.
 |
dtSearch Desktop: In Options > Preferences > Letters and Words, check the box to “Insert word breaks between Chinese, Japanese, and Korean characters in text.” |
 |
dtSearch Developer API: set dtsoTfAutoBreakCJK in Options.TextFlags. |
Language
Group Identification
For documents in certain
formats that do not include encoding information,
such as single-byte text files, dtSearch provides
a proprietary language recognition algorithm
for detecting text in a large variety of languages
(Western European, other European, Middle-Eastern,
etc.). This algorithm is enabled by default.
Hidden
Content
A
search in dtSearch will always include white-on-white
text and similar "invisible" text
in files. dtSearch also includes options for
searching embedded objects in Microsoft Office
documents, and normally hidden content in HTML.
While
HTML comments, scripts, links, and styles are
not by default included in indexing, dtSearch
has an option to include these.
 |
dtSearch
Desktop: Click Options > Preferences > Indexing
Options, and check the box to "Index
HTML scripts, styles, links and comments." |
 |
dtSearch
developer API: Set Options.FieldFlags
= to a combination of these flags: dtsoFfHtmlShowLinks,
dtsoFfHtmlShowImgSrc, dtsoFfHtmlShowComments,
dtsoFfHtmlShowScripts, dtsoFfHtmlShowStylesheets,
and dtsoFfHtmlShowMetatags. |
A
similar option searches hidden content (such
as Macros or other embedded objects) in Microsoft
Office files.
dtSearch
Desktop: Click Options > Preferences > Indexing
Options, and check the box to "Index Hidden
content in Office documents."
dtSearch developer API: This option is set
by default. To disable it, set dtsoFfOfficeSkipHiddenContent
in Options.FieldFlags.
Search
for List of Words or Concepts
dtSearch provides
an option to search for a list of words. Under
this option, a special dialog box provides
a way to search for a long list of words, and
create a list of matching files, in a single
step. This option can work with the full range
of dtSearch search features (Boolean, fuzzy,
natural language, etc.). More information
For
expanding a search for a specific set of word
or words to a user-defined list of concepts
or synonyms, dtSearch also offers a user-defined
thesaurus add-on to the comprehensive English-language
thesaurus included with dtSearch.
 |
dtSearch
Desktop: Click Options > Preferences > Search
Options > User Thesaurus to add a
list of synonym rings to a specific terms. |
View Log of Encrypted Files; Index Encrypted PDFs
After
an index update completes, click "View
Log" to see a report that will include
information on any encrypted or unreadable
files that the indexer could not process. This
report can be accessed at any time in the index
folder in the file Index_LastUpdateErrors.html. The
report indicates which files were (a) encrypted,
(b) corrupt, (c) partially encrypted, and (d)
partially corrupt. Partially encrypted
or corrupt files are files that could be indexed
in part but that included some encrypted or
corrupt data (for example, an email with an
encrypted attachment).
To index encrypted PDFs, make a temporary, decrypted copy of the encrypted files, index the decrypted copy, and then replace the temporary decrypted copy with the encrypted versions. This one-time unencryption is sufficient for dtSearch operation. dtSearch does not need to unencrypt the PDF files to search and display them with highlighted hits once the original index is complete.
Making Available Retrieved Files on CD/DVD or Other Portable Media
The dtSearch Publish product can quickly publish forensically retrieved (or e-discovery retrieved) documents to CD, DVD or other portable media. The resulting product provides instant search and display access to the document set. The CD, DVD or other portable media can run with zero footprint, requiring no installation on the end-user's computer.
Please see Mirroring Searchable Web Content on Portable Media article for an overview of how dtSearch Publish works. |