Businesses use SharePoint to archive and store a wide variety of
enterprise documents. Large organizations typically have multiple
business units. Each business unit generally creates one or more
SharePoint Site Collections to organize its business documents and
The ability to quickly find key business information in SharePoint is
a significant business differentiator. However, searching across
SharePoint documents to extract the necessary information poses
challenges, both because of the potential volume of data and also
because of the variety of possible file formats. This article
demonstrates using the dtSearch Engine as a solution for searching such
dtSearch Engine Overview
The dtSearch Engine (see dtSearch.com)
can perform lightning fast text searching across terabytes of data. And
it does so for a wide variety of data formats. Supported data types
include MS Office through current versions (Word, Excel, PowerPoint,
Access, OneNote); other "Office" suite formats; compression formats
(RAR, ZIP, GZIP, TAR); Exchange, Outlook and other popular email types
(including multilevel nested attachments); PDF and PDF Portfolio; other
web-based formats (HTML, XML/XSL, etc.); and a wide variety of databases
(SQL, NoSQL, SharePoint, etc.).
There is no need to tell dtSearch what file format or other data type
it is working with – dtSearch figures that out for itself. And dtSearch
lets developers leverage over 25 full-text and metadata search options,
including faceted searching and multiple advanced data classification
options, as well as displaying search results with highlighted hits.
dtSearch products can index SharePoint data in two general ways. The
first way is through the dtSearch Spider, built-into the dtSearch
product line and also available through the dtSearch Engine SDK. The
second approach is through the dtSearch Engine DataSource API.
The dtSearch Engine DataSource API offers maximum flexibility for
database content integration. This article’s SharePointDemo is a Visual
Studio Project that uses the dtSearch Engine DataSource API for indexing
SharePoint Site Collections.
Central Administration Server Requirements. Every SharePoint
Farm has one or more servers in the cluster designated as the Central
Administration Server. This server acts as a front-end to the SharePoint
Farm. All of the main work in this article occurs on the Central
Administration Server. You will of course need admin rights to this
server for most of this work.
The integration has two categories of work: (1) using the SharePoint
API to crawl SharePoint web sites and to enumerate documents to index in
each site, and (2) implementing the dtSearch Engine’s DataSource
interface so the documents can be passed to the dtSearch Engine for
Visual Studio and Windows Server (64-Bit) Installation. After
verifying that you have the appropriate admin rights, install the
current version of Visual Studio and the current 64-bit version of
Windows Server including patches and OS drivers on the Central
Installing SharePoint DLLs. The SharePoint DLLs in this
project require the Windows Server 64-bit operating system. To simplify
the installation of the SharePoint DLLs, you can leverage the NuGet
capability in Visual Studio. NuGet is available through Visual Studio
Select Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution…
Select Browse. Then type in SharePoint. Then install the SharePoint DLLs.
Installing SharePointDemo. Download the ZIP file with the project files from the link attached to this article at CodeProject.com.
SharePointDemo includes two forms. AddWebForm.cs will retrieve all
the Site Collections within the SharePoint Farm. Form1.cs is the startup
form and also begins the indexing process.
Installing the dtSearch Engine. You’ll also need to install
the full dtSearch Engine for Windows package. You can request a
fully-functional 30-day evaluation version at dtSearch.com/contact.html. Your job as a developer is to inherit from the DataSource API and use method override for
Rewind(). More details below.
The dtSearch Engine’s default installation directory is C:\Program Files (x86)\dtSearch Developer\bin64\.
(Note that bin64 indicates 64-bit.) This article’s SharePointDemo
already includes References to the dtSearch Engine default installation
location. If you install the dtSearch Engine in a different directory,
right click on the References section of the SharePointDemo to select a
Verifying SharePoint Site Collections. At this
point, you’ll need to verify you have a SharePoint Farm with Site
Collection documents to index. Appendix A at the end of this paper walks
you through the creation of one or more Site Collections including the
addition of documents, if this is not already part of your SharePoint
setup. Note that everything in SharePoint is a web resource represented
by a URL; physical file paths are not available through SharePoint.
Running SharePoint Demo
You are now ready to run the project, SharePointDemo. Go to the Debug Menu and select Run.
Selecting Site Collections. Once Running, Form1.cs will appear. You will notice that the application defaults to Index Path c:\index. Before clicking Start Indexing, click the Add button. Doing so retrieves AddWebForm.cs.
AddWebForm.cs includes a ListBox along with a button, Find SharePoint Sites. This button will look for Site Collections. The sample below includes a couple of Site Collections.
The code that populates the ListBox contains a loop within a loop.
SharePoint Site Collections are children of SharePoint Web Services,
necessitating the nested loops.
After selecting one or more Site Collections to index in AddWebForm.cs, return to Form1.cs.
Indexing. Next, select Start Indexing in Form1.cs to begin the dtSearch Engine indexing process.
The DataSource Interface
The sample code uses a
SharePointDataSource class which implements the dtSearch DataSource interface to provide documents to the dtSearch indexer.
To index the SharePoint data,
EnumerateFiles method is first called to generate a list of files in the sites to be indexed. For each file,
EnumerateFile gets a URL that
SharePointDataSource can use to download the file. The sample uses filename and extension filters to identify documents to be skipped.
Once the indexer starts, it will call
GetNextDoc() repeatedly to get documents to index. On each call,
GetNextDoc will use one of the URLs in its list to get a document to index. For each URL, the
SharePointDataSource uses a .NET WebClient object to download the bytes and assigns the downloaded bytes to DocBytes.
byte bytes = WebClient.DownloadData(filesToIndex[currFileOffset]);
DocBytes = bytes;
currFileOffset += 1;
Conclusion and Next Steps
SharePointDemo as an Indexing Starting Point. This paper has
introduced a new Visual Studio Project which brings together two
separate and powerful APIs. By combining the SharePoint APIs and the
dtSearch Engine, the sample project makes it possible to index key
business data in SharePoint Site Collections. Developers can use
SharePointDemo as a starting point to build their own custom indexing
Search Options and Appendices. After indexing, developers will
typically turn to dtSearch developer product to enable web-based
concurrent indexed searching. dtSearch developer products provide
efficient multithreaded Internet or Intranet-based searching, with no
limit on the number of concurrent search threads. For more details,
please see dtSearch.com/PLF_engine_2.html
(You can also use dtSearch Desktop to search the indexes that you
create using this sample. To access the indexes in dtSearch Desktop,
click Index > Recognize Index in dtSearch Desktop.)
Appendix A – Setting up SharePoint 2016
Because the security model in SharePoint can become quite complex, we
assume in this article that the documents in the Site Collection allow
for anonymous access. That allows us to focus on the core aspects of the
code that do the indexing. To learn more about permissions and Site Collections, see this link, https://support.office.com/en-ie/article/Edit-and-manage-permissions-for-a-SharePoint-list-or-library-02d770f3-59eb-4910-a608-5f84cc297782?ui=en-US&rs=en-IE&ad=IE.
Creating the Site Collection
The central administration pages available for SharePoint
administrators allow them to create Site Collections. From the central
administration page, select Application Management -> Site
Collections -> Create Site Collections. A web form will appear
allowing you to define the attributes for the new Site Collection.
To create the site, enter a title and description, a URL, and a site
template (i.e., Team, Blog, Developer Site, etc.). A message will
appear, "site was created successfully with the specified URL." The
example in this article is http://mybtsp/sites/mydocs
The Site Collection page will look something like this:
Adding documents to the new Site Collection (http://mybtsp/sites/mydocs)
Next, add documents to the Site Collection. These could include any
dtSearch-supported file types such as Word, Excel, PowerPoint, Access,
OneNote, ZIP, HTML, XML/XSL, PDF, etc.
Setting Anonymous Access
To set anonymous access, please see:
You can also find more information about managing permissions here, https://technet.microsoft.com/en-us/library/ff607719(v=office.16).aspx.
More on dtSearch