Products
Features
copyright notice
and terms of use
OCR and Imaging

dtSearch supports the PDF "image with hidden text" format, and can highlight right on the scanned image in this format.
dtSearch also supports combined text and image displays in HTML.
dtSearch Desktop and Network include a built-in image viewer.
dtSearch recommends using fuzzy searching for sifting through possible OCR errors.

OCR and PDF

The Adobe PDF file format provides two ways to combine in a single file images and OCR’ed text, or images that have been converted to text through Optical Character Recognition (OCR) software.

(1) The "image with hidden text format" stores the complete original image of a scanned document, along with the text obtained through OCR. The text is "hidden" in the sense that simply opening the PDF file displays only the scanned image, not the underlying OCR'ed text. Because the OCR'ed text is "hidden" in the file, however, dtSearch can index and search it.



After a search, when a user clicks on an "image with hidden text format" PDF document, the dtSearch product will display the scanned image. Because the actual OCR’ed text is "hidden," the display will appear to highlight hits directly on the image. Click here for a dtSearch Web demo showing hidden text highlighting.

(2) Another option for combining scanned images and OCR’ed text in a single PDF file uses "small images" for the parts of each scanned page that do not appear to be text. For example, the format would store a picture or a signature as a small image embedded in the page. The format would store the non-picture portion of the page only as OCR’ed text.

While the "small images" alternative does not preserve the true image of the original document, it does produce much more compact files than the "image with hidden text" option. The "small images" PDF file usually stores only a few images for each page, instead of a complete image of the whole document. The text detected through OCR in the "small images" format can also be more readable because the resulting PDF file stores it as text with font information rather than as an image.

For more information on both PDF / OCR options, including a list of some additional third-party products that OCR into the PDF format, click here.

 
Instantly Search Terabytes of Text
dtSearch document filters support a broad range of data
Supports MS Office through current versions (Word, Excel, PowerPoint, Access), OpenOffice, ZIP, HTML, XML/XSL, PDF and more
Supports Exchange, Outlook, Thunderbird and other popular email types, including nested and ZIP attachments
Spider supports public and secure, static and dynamic (ASP.NET, SharePoint, CMS, PHP, etc.) web data
APIs for SQL-type data, including BLOB data
Highlights hits in all supported data types
25+ full-text and fielded data search options
Federated searching
Special forensics search options
Advanced data classification objects
APIs for C++, Java and .NET through current versions
64-bit and 32-bit Win / Linux APIs; .NET Spider API
Document filters also available for separate licensing
 
 
dtSearch dtSearch Maze