Document filters
overview. dtSearch products embed
dtSearch’s proprietary document filters to support
a broad range of data types. (What are document
filters? See
PDF of white paper.)
- For all supported data types, support covers
parsing, indexing and searching of retrieved
full-text and metadata.
- The document filters can also convert
non-web-ready content like Microsoft Office
document and email formats "on the fly" to HTML
for web-based display, etc., with highlighted
hits. See also dtSearch
Web and the dtSearch Engine
Supported data types.
dtSearch’s proprietary document filters support
parsing, indexing, searching and display with highlighted
hits of text and metadata
across a broad range of data types.
- Web-ready content: supports
integrated images and text in HTML, XML/XSL, PDF,
ASP.NET, CMS, PHP, WordPress, SharePoint,
etc.
- Other databases and data sources:
supports XML, Access, XBASE, CSV, etc.; dtSearch
Engine APIs support NoSQL
and SQL-type databases, along with the
full-text of BLOB data; dtSearch Engine APIs
also support
disk images, network data streams and other
non-file data.
- MS Office formats: supports
integrated browser-ready image and text in Word
(RTF/DOC/DOCX), PowerPoint (PPT/PPTX), Excel
(XLS/XLSX), Access (MDB/ACCDB) and OneNote
(ONE); support includes documents saved from
Office 365.
- Other “Office” formats, PDF and other
printer formats, compression formats:
supports other “Office” suite formats; EMF Spool
(SPL) files; compression formats like RAR, ZIP,
GZIP and TAR; PDF,
PDF Portfolio, and many encrypted PDFs. New PDF 2.0
support
- Emails
and attachments: supports integrated
browser-ready images, text and attachments in
Outlook/Exchange (PST/OST/MSG) and Thunderbird
(MBOX/EML); support includes emails saved from
Office 365.
- Recursively embedded objects:
supports recursively embedded objects and images
in supported email types and MS Office formats.
For example, the dtSearch document filters would
support an email attachment consisting of a ZIP
container including both a PDF and an Access
database, where the latter also includes an
embedded PowerPoint with embedded images.
- Using
dtSearch with cloud storage (OneDrive,
DropBox, Amazon S3, SharePoint-synced files,
etc.)
- Full list of supported
document types.
Federated searching and
the dtSearch Spider. dtSearch products
provide federated search across any number of
directories, emails (with nested attachments), and
databases.
The dtSearch Spider adds local and remote online
content to a search. The Spider can index sites to
any level of depth, with support for public and
secure online content, including log-ins and
forms-based authentication. dtSearch products
provide integrated relevancy ranking with highlighted
hits across both online
and offline data. Note: for developers, the Spider
is presented as a .NET API.
Document filter APIs.
All developers APIs (C++, Java and .NET through
current versions) make available to developers
dtSearch’s text parsing, extraction, conversion
and hit-highlighting
capabilities.
- An “object extraction” API lets developers
navigate through the structure of each embedded
object as a hierarchy, and optionally extract
each object, such as an image in an MS Word file
embedded in an MS Access database, compressed
and attached to an email.
- General dtSearch Engine licenses include the
document filters along with dtSearch indexing
and searching functionality.
- The document filters are also available for
separate license for developers requiring text
parsing, extraction and conversion “only,”
without search.
|