OCR and Imaging

dtSearch supports the PDF "image with hidden text" format, and can highlight right on the scanned image in this format.
dtSearch also supports combined text and image displays in HTML.
dtSearch Desktop and Network include a built-in image viewer.
dtSearch recommends using fuzzy searching for sifting through possible OCR errors.

OCR and PDF

The Adobe PDF file format provides two ways to combine in a single file images and OCR’ed text, or images that have been converted to text through Optical Character Recognition (OCR) software.

(1) The "image with hidden text format" stores the complete original image of a scanned document, along with the text obtained through OCR. The text is "hidden" in the sense that simply opening the PDF file displays only the scanned image, not the underlying OCR'ed text. Because the OCR'ed text is "hidden" in the file, however, dtSearch can index and search it.



After a search, when a user clicks on an "image with hidden text format" PDF document, the dtSearch product will display the scanned image. Because the actual OCR’ed text is "hidden," the display will appear to highlight hits directly on the image. Click here for a dtSearch Web demo showing hidden text highlighting.

(2) Another option for combining scanned images and OCR’ed text in a single PDF file uses "small images" for the parts of each scanned page that do not appear to be text. For example, the format would store a picture or a signature as a small image embedded in the page. The format would store the non-picture portion of the page only as OCR’ed text.

While the "small images" alternative does not preserve the true image of the original document, it does produce much more compact files than the "image with hidden text" option. The "small images" PDF file usually stores only a few images for each page, instead of a complete image of the whole document. The text detected through OCR in the "small images" format can also be more readable because the resulting PDF file stores it as text with font information rather than as an image.

For more information on both PDF / OCR options, including a list of some additional third-party products that OCR into the PDF format, click here.

 
 
The dtSearch product line can instantly search terabytes of text across a desktop, network, Internet or Intranet site.
dtSearch products also serve as tools for publishing, with instant text searching, large document collections to Web sites or CD/DVDs.
over two dozen indexed, unindexed, fielded and full-text search options
highlights hits in HTML, XML and PDF, while displaying embedded links, formatting and images
converts other file types — word processor, database, spreadsheet, email and full-text of email attachments, ZIP, Unicode, etc. — to HTML for display with highlighted hits
built-in Spider adds a third-party or other Web site (public, secure content, password accessible, etc.) to your searchable database
Spider supports Web-based content (HTML, PDF, XML, etc.) as well as dynamically-generated content (ASP.NET, MS CMS, SharePoint, etc.)
General supported file types
SQL and similar data sources