Products
Features
copyright notice
and terms of use
Indexing and Searching Features of Special Interest to Forensics Users

 

Optimizing Indexing of Large Collections of Data

This article acts as a forensics supplement to the article on tips for optimizing indexing of large collections of data. Topics in that article include: document storage and the NTFS file system, general indexing strategy, index and document location, indexing resources and efficient text processing.

dtSearch can index over a terabyte of text in a single index, with search time typically less than a second. There are no limits on the number of indexes dtSearch can build and simultaneously search. Please see optimizing indexing of large collections of data for additional information on using the terabyte indexer.

dtSearch does not alter the original files, including Hash values, in indexing, searching and display of documents.

Distributed/Federated Searching

A single terabyte-data index can span multiple local and remote locations. For example, a single index can include data from hard drives, local area networks, Exchange servers (see Outlook/Exchange topic below), Intranet servers and public Web sites (see Spider topic below). (For indexing SQL-type databases, please see the Databases and Field Searching topic on the developer FAQ Selected Articles by Subject page.)

dtSearch can rank federated or distributed indexed search results collectively by relevance, displaying all local and remote files with highlighted hits. A scrolling "word wheel" display in dtSearch Desktop includes all words in an index covering local and remote locations. dtSearch can also output all indexed words to a file.

dtSearch Desktop: Click Index > List Index ContentsdtSearch
Developer API: Use ListIndexJob (.NET) or DListIndexJob (C++)

Spider-Assisted Searching

The dtSearch Spider supports searching of static browser-ready content (HTML, PDF, XML/XSL); dynamic browser-ready content (MS CMS, SharePoint, ASP.NET, etc,); as well as browser-incompatible content (MS Office files, OpenOffice files, etc.) The Spider can even index and search web-accessible data in platforms that dtSearch does not directly support like MAC and Unix.

The Spider supports public sites as well as password accessible, forms-based authentication, and other secure content access. Indexing with the Spider involves simply selecting a URL or URLs and indicating how many vertical or horizontal links to follow. The Spider automatically figures out the format of the data, so there is no need to tell the Spider whether a retrieved web page contains, for example, an MS Office document or a PDF file.

The dtSearch Spider displays static and dynamic browser-ready content WYSIWYG, including display of images, formatting and links, with the sole addition of highlighted hits. The Spider converts browser-incompatible content (such as MS Office or OpenOffice) "on the fly" to HTML for browser display with highlighted hits. More information (basic article); more information (advanced article)

For convenient offline access, the dtSearch Spider also includes a caching option, to store the full spidered content along with the index. (Without caching, the Spider has to return to the relevant URL to display the full content with highlighted hits.)

dtSearch Desktop: To enable caching, using the Create Index (Advanced) dialog box.
dtSearch developer API: To enable caching, set the caching flags in IndexJob.IndexingFlags.

Adobe Reader X and XI Users

Adobe Reader X and XI require a plug-in to support highlighting of hits after a search. More information

Automatic Recognition of Dates, Email Addresses, and Credit Card Numbers

dtSearch can automatically recognize dates, email addresses, and credit card numbers, and search for these items by type. Through this feature, dtSearch can, for example, search for a credit card number regardless of how it may be formatted, or search for a range of dates even if the dates are expressed in different text formats (January 15, 2005, through 2/19/07). dtSearch can also extract all dates, emails and credit card numbers from a collection of documents. More information

Forensics Filtering Features

dtSearch offers a Unicode filtering feature for automatic recovery of text from corrupt forensically-recovered documents and large data blocks, such as those recovered through an "undelete" process, from unallocated computer space, or from partially recovered file fragments. The filtering algorithm can scan recovered data blocks using multiple Unicode and other text encoding detection methods. More information

dtSearch Desktop: Click Options > Preferences > Filtering Options, and check the "Filter text" option under "Binary files" to enable filtering of binary files.
dtSearch developer API: Set Options.BinaryFiles = dtsoFilterBinaryUnicode.

Email Support

dtSearch includes multiple ways to index Outlook or Exchange messages, contacts, tasks, and notes. All methods include indexing and searching of the underlying messages, including all meta data, as well as the full text of all email attachments. dtSearch will highlight hits in both messages and attachments, including ZIP and other nested attachments.

(i) Starting with Version 7.67, dtSearch supports native PST files, bypassing the need to go through MAPI or pre-convert the messages to .msg, as described below.

(ii) In the second approach, dtSearch indexes "live" content in an Outlook profile. In addition to display of search results in dtSearch with highlighted hits, dtSearch supports launching a message, contact, task, or note in the native application. For example, you can search for a message in dtSearch, launch the message in Outlook, and then reply to the message using Outlook.

(iii) For Exchange data, as well as for certain archiving and forensic applications, dtSearch supports extracting Outlook and Exchange data to .msg files. The .msg conversion approach in dtSearch works through a command-line tool to extract Outlook items in bulk from larger volumes of PST or Exchange data. The converted .msg files will include all properties of the original Outlook item, including any attachments. Following conversion, dtSearch can index the resulting .msg files, including highlighting hits in messages and attachments. More information

Normally, dtSearch indexes each .eml file and each .msg file as a single document.  Attachments are recursively unpacked and appended to the message body, so no matter how many attachments there are, a single document is indexed for each message. Using the File Types table, you can set up rules to require each message to be treated as a container, with the message body and attachments each indexed as a separate document in the container. More information

The above discussion applies to Outlook and Exchange data. dtSearch can index Outlook Express .dbx files just like any other supported file type.

dtSearch also supports Thunderbird (MBOX/EML), including nested email attachments.

Fuzzy Searching

Fuzzy searching uses a proprietary algorithm to find search terms even if they are misspelled. dtSearch recommends fuzzy searching for searching emails, OCR’ed text, or any other text that may contain misspellings.

Search fuzziness adjusts from 0 to 10 so you can fine-tune fuzziness to the level of OCR or typographical errors in your files. A search for alphabet with a fuzziness of 1 would find alphaqet; with a fuzziness of 3, it would find both alphaqet and alpkaqet. Fuzziness is not built into the index, so you can vary fuzziness at the time of each search. More information on fuzzy and other search options

International Language Support

dtSearch includes Unicode-compatible file parsing, to convert input data to Unicode. dtSearch automatically recognizes all Unicode-supported encodings, representing hundreds of international languages.

The following dtSearch search options work automatically on text in any international language: phrase; Boolean; proximity and directed proximity; wildcard; macro; numeric range; fielded data / metadata search options; fuzzy searching (adjustable from 0 to 10 to account for typographical or OCR errors); and relevancy-ranked searching (including natural language vector-space ranking, positional scoring options, general variable term weighting, variable term weighting in fields, and other API-based document classification and sorting options). More information

Chinese, Japanese and Korean Text With No Word Breaks

Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word. To make this type of text searchable, enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters, so each character will be treated as single word.

dtSearch Desktop: In Options > Preferences > Letters and Words, check the box to “Insert word breaks between Chinese, Japanese, and Korean characters in text.”
dtSearch Developer API: set dtsoTfAutoBreakCJK in Options.TextFlags.

Note: this setting will only affect text identified as Unicode Chinese, Japanese or Korean text; it will not affect text identified as other Unicode character sets.

Language Group Identification

For documents in certain formats that do not include encoding information, such as single-byte text files, dtSearch provides a proprietary language recognition algorithm for detecting text in a large variety of languages (Western European, other European, Middle-Eastern, etc.). This algorithm is enabled by default.

Hidden Content

A search in dtSearch will always include white-on-white text and similar "invisible" text in files. dtSearch also includes options for searching embedded objects in Microsoft Office documents, and normally hidden content in HTML.

While HTML comments, scripts, links, and styles are not by default included in indexing, dtSearch has an option to include these.

dtSearch Desktop: Click Options > Preferences > Indexing Options, and check the box to "Index HTML scripts, styles, links and comments."
dtSearch developer API: Set Options.FieldFlags = to a combination of these flags: dtsoFfHtmlShowLinks, dtsoFfHtmlShowImgSrc, dtsoFfHtmlShowComments, dtsoFfHtmlShowScripts, dtsoFfHtmlShowStylesheets, and dtsoFfHtmlShowMetatags.

A similar option searches hidden content (such as Macros or other embedded objects) in Microsoft Office files.

dtSearch Desktop: Click Options > Preferences > Indexing Options, and check the box to "Index Hidden content in Office documents."
dtSearch developer API: This option is set by default. To disable it, set dtsoFfOfficeSkipHiddenContent in Options.FieldFlags.

Search for List of Words or Concepts

dtSearch provides an option to search for a list of words. Under this option, a special dialog box provides a way to search for a long list of words, and create a list of matching files, in a single step. This option can work with the full range of dtSearch search features (Boolean, fuzzy, natural language, etc.). More information

For expanding a search for a specific set of word or words to a user-defined list of concepts or synonyms, dtSearch also offers a user-defined thesaurus add-on to the comprehensive English-language thesaurus included with dtSearch.

dtSearch Desktop: Click Options > Preferences > Search Options > User Thesaurus to add a list of synonym rings to a specific terms.

View Log of Encrypted Files; Index Encrypted PDFs

After an index update completes, click "View Log" to see a report that will include information on any encrypted or unreadable files that the indexer could not process. This report can be accessed at any time in the index folder in the file Index_LastUpdateErrors.html. The report indicates which files were (a) encrypted, (b) corrupt, (c) partially encrypted, and (d) partially corrupt. Partially encrypted or corrupt files are files that could be indexed in part but that included some encrypted or corrupt data (for example, an email with an encrypted attachment).

To index encrypted PDFs, make a temporary, decrypted copy of the encrypted files, index the decrypted copy, and then replace the temporary decrypted copy with the encrypted versions.  This one-time unencryption is sufficient for dtSearch operation.  dtSearch does not need to unencrypt the PDF files to search and display them with highlighted hits once the original index is complete.

Copying Retrieved Files

dtSearch's Edit › Copy file function lets you copy all or selected documents retrieved from a search to a folder. You can optionally preserve the full path and filename in the copy, and you can preserve creation and last access times as well as the last modified date. More information.

Making Available Retrieved Files on CD/DVD or Other Portable Media

The dtSearch Publish product can quickly publish forensically retrieved (or e-discovery retrieved) documents to CD, DVD or other portable media.  The resulting product provides instant search and display access to the document set.  The CD, DVD or other portable media can run with zero footprint, requiring no installation on the end-user's computer.

Please see Mirroring Searchable Web Content on Portable Media article for an overview of how dtSearch Publish works.


 
 
Instantly Search Terabytes of Text
dtSearch document filters support a broad range of data
Supports MS Office through current versions (Word, Excel, PowerPoint, Access), OpenOffice, ZIP, HTML, XML/XSL, PDF and more
Supports Exchange, Outlook, Thunderbird and other popular email types, including nested and ZIP attachments
Spider supports public and secure, static and dynamic (ASP.NET, SharePoint, CMS, PHP, etc.) web data
APIs for SQL-type data, including BLOB data
Highlights hits in all supported data types
25+ full-text and fielded data search options
Federated searching
Special forensics search options
Advanced data classification objects
APIs for C++, Java and .NET through current versions
64-bit and 32-bit Win / Linux APIs; .NET Spider API
Document filters also available for separate licensing
 
 
dtSearch dtSearch Maze