Document filters overview. dtSearch products
embed dtSearch’s proprietary document filters to support a broad range of data
types. (What are document filters? See
PDF of white paper.)
- For all supported data types, support covers parsing, indexing and searching
of retrieved full-text and metadata.
- The document filters can also convert non-web-ready content like Microsoft
Office document and email formats "on the fly" to HTML for
web-based display, etc., with highlighted
hits. See also dtSearch
Web and the dtSearch Engine
Supported data types. dtSearch’s proprietary
document filters support parsing, indexing, searching and display with highlighted
hits of text and metadata across a broad range of data
types.
- Web-ready content: supports integrated images and text in
HTML, XML/XSL, PDF, ASP.NET, CMS,
PHP, WordPress, SharePoint, etc.
- Other databases and data sources: supports XML, Access,
XBASE, CSV, etc.; dtSearch Engine APIs support NoSQL and SQL-type databases, along
with the full-text of BLOB data; dtSearch Engine APIs also support disk images, network data streams
and other non-file data.
- MS Office formats: supports integrated browser-ready image
and text in Word (RTF/DOC/DOCX), PowerPoint (PPT/PPTX), Excel (XLS/XLSX),
Access (MDB/ACCDB) and OneNote (ONE); support includes documents saved from
Office 365.
- Other “Office” formats, PDF and other printer formats, compression
formats: supports other “Office” suite formats; EMF Spool (SPL)
files; compression formats like RAR, ZIP, GZIP and TAR; PDF, PDF Portfolio, and many
encrypted PDFs (through PDF 2.0).
- Emails and
attachments: supports integrated browser-ready images, text and
attachments in Outlook/Exchange (PST/OST/MSG) and Thunderbird (MBOX/EML);
support includes emails saved from Office 365.
- Recursively embedded objects: supports recursively embedded
objects and images in supported email types and MS Office formats. For
example, the dtSearch document filters would support an email attachment
consisting of a ZIP container including both a PDF and an Access database,
where the latter also includes an embedded PowerPoint with embedded images.
- Using dtSearch with
cloud storage (OneDrive, DropBox, Amazon S3, SharePoint-synced
files, etc.)
- Version 2023.02 also adds new image/sound/video metadata support across 11
different file formats.
- Full list of supported document
types.
Federated searching and the dtSearch Spider.
dtSearch products provide federated search across any number of directories,
emails (with nested attachments), and databases. The dtSearch
Spider adds local and remote online content to a search. The Spider can index
sites to any level of depth, with support for public and secure online content,
including log-ins and forms-based authentication. dtSearch products provide
integrated relevancy ranking with highlighted
hits across both online and offline data. Note: for
developers, the Spider is presented as a .NET API.
Document filter APIs. All developers APIs (C++,
Java and .NET through current versions) make available to developers dtSearch’s
text parsing, extraction, conversion and hit-highlighting capabilities.
- An “object extraction” API lets developers navigate through the structure of
each embedded object as a hierarchy, and optionally extract each object,
such as an image in an MS Word file embedded in an MS Access database,
compressed and attached to an email.
- General dtSearch Engine licenses include the document filters along with
dtSearch indexing and searching functionality.
- The document filters are also available for separate license for developers
requiring text parsing, extraction and conversion “only,” without search.
|