|
| OCR
and Imaging |
|
 |
dtSearch
supports the PDF "image with hidden text" format,
and can highlight right on the scanned image
in this format. |
 |
dtSearch
also supports combined text and image displays
in HTML. |
 |
dtSearch
Desktop and Network include a built-in image
viewer. |
 |
dtSearch
recommends using fuzzy searching for sifting
through possible OCR errors. |
|
|
OCR and PDF
The Adobe PDF file
format provides two ways to combine in a single
file images and OCRed text, or images that
have been converted to text through Optical Character
Recognition (OCR) software.
(1) The "image with hidden text format" stores the complete
original image of a scanned document, along with the text obtained through
OCR. The text is "hidden" in the sense that simply opening
the PDF file displays only the scanned image, not the underlying OCR'ed
text. Because the OCR'ed text is "hidden" in the file, however,
dtSearch can index and search it.

After a search, when a user clicks on an "image with hidden text
format" PDF document, the dtSearch product will display the scanned
image. Because the actual OCRed text is "hidden," the
display will appear to highlight hits directly on the image. Click here for
a dtSearch Web demo showing hidden text highlighting.
(2) Another option for combining scanned images and OCRed text
in a single PDF file uses "small images" for the parts of each
scanned page that do not appear to be text. For example, the format would
store a picture or a signature as a small image embedded in the page.
The format would store the non-picture portion of the page only as OCRed
text.
While the "small images" alternative does not preserve the
true image of the original document, it does produce much more compact
files than the "image with hidden text" option. The "small
images" PDF file usually stores only a few images for each page,
instead of a complete image of the whole document. The text detected
through OCR in the "small images" format can also be more readable
because the resulting PDF file stores it as text with font information
rather than as an image.
For more information
on both PDF / OCR options, including a list of
some additional third-party products that OCR
into the PDF format, click here. |
| |
|
|
 |
| Instantly Search Terabytes of Text |
| dtSearch document filters support a broad range of data |
 |
| • |
Supports MS Office through current versions (Word, Excel, PowerPoint, Access), OpenOffice, ZIP, HTML, XML/XSL, PDF and more |
|
 |
| • |
Supports Exchange, Outlook, Thunderbird and other popular email types, including nested and ZIP attachments |
|
 |
| • |
Spider supports public and secure, static and dynamic (ASP.NET, SharePoint, CMS, PHP, etc.) web data |
|
 |
| • |
APIs for SQL-type data, including BLOB data |
|
 |
| • |
Highlights hits in all supported data types |
|
|
| 25+ full-text and fielded data search options |
 |
|
| • |
Special forensics search options |
|
| • |
Advanced data classification objects |
|
|
| APIs for C++, Java and .NET through current versions |
 |
| • |
64-bit and 32-bit Win / Linux APIs; .NET Spider API |
|
| • |
Document filters also available for separate licensing |
|
 |
|
|
|
|