Article: Document Filters, Search Engines & The Anatomy Of A Binary Format
Document filters overview. dtSearch products embed dtSearch’s proprietary document filters to support a broad range of data types. (What are document filters? See white paper.)
- For all supported data types, support covers parsing, indexing and searching of retrieved full-text and metadata.
- For all supported data types, support also covers display of metadata and full-text data with highlighted hits. (Typically, dtSearch does this following dtSearch’s own automatic, built-in conversion of the data to HTML.)
- For many supported data types, display also includes integrated image display along with highlighted hits.
Supported data types. dtSearch’s proprietary document filters support parsing, indexing, searching and display with highlighted hits of text and metadata across a broad range of data types.
- Web-ready content: supports integrated images and text in HTML, XML/XSL, PDF, ASP.NET, CMS, PHP, SharePoint, etc.
- Other databases: supports XML, Access, XBASE, CSV, etc.; dtSearch Engine APIs support NoSQL and SQL-type databases, along with the full-text of BLOB data.
- MS Office formats: supports integrated browser-ready image and text in Word (RTF/DOC/DOCX), PowerPoint (PPT/PPTX), Excel (XLS/XLSX), Access (MDB/ACCDB) and OneNote (ONE).
- Other “Office” formats, PDF and other printer formats, compression formats: supports other “Office” suite formats; EMF Spool (SPL) files; compression formats like RAR, ZIP, GZIP and TAR; PDF, PDF Portfolio, and many encrypted PDFs.
- Emails and attachments: supports integrated browser-ready images, text and attachments in Outlook/Exchange (PST/OST/MSG) and Thunderbird (MBOX/EML).
- Recursively embedded objects: supports recursively embedded objects and images in supported email types and MS Office formats. For example, the dtSearch document filters would support an email attachment consisting of a ZIP container including both a PDF and an Access database, where the latter also includes an embedded PowerPoint with embedded images.
- Full list of supported document types.
Federated searching and the dtSearch Spider. dtSearch products provide federated search across any number of directories, emails (with nested attachments), and databases.
The dtSearch Spider adds local and remote online content to a search. The Spider can index sites to any level of depth, with support for public and secure online content, including log-ins and forms-based authentication. dtSearch products provide integrated relevancy ranking with highlighted hits across both online and offline data. Note: for developers, the Spider is presented as a .NET API.
Document filter APIs. All developers APIs (C++, Java and .NET through current versions) make available to developers dtSearch’s text parsing, extraction, conversion and hit-highlighting capabilities.
- An “object extraction” API lets developers navigate through the structure of each embedded object as a hierarchy, and optionally extract each object, such as an image in an MS Word file embedded in an MS Access database, compressed and attached to an email.
- General dtSearch Engine licenses include the document filters along with dtSearch indexing and searching functionality.
- The document filters are also available for separate license for developers requiring text parsing, extraction and conversion “only,” without search.