Enterprise Search Protocol Handlers
Protocol handlers extend search capabilities by making new content sources available for Enterprise Search in Microsoft Office SharePoint Server 2007. This topic provides an overview of protocol handlers and how they fit into the Enterprise Search architecture, and discusses using the protocol handler interfaces to implement a protocol handler for crawling custom content sources.
Enterprise Search Indexing System Overview
The Enterprise Search Indexing system is made up of several different components, as described in the following list.
Index engine Manages the content crawling process using the content sources and crawl rules configured for the Search service. The Index engine maintains a crawl URL queue, passing the crawl URLs to the Filter Daemon during the content crawling process. The crawl URL queue is initially populated with the content source start addresses.
Content sources Specifies what content to crawl.
Crawl rules Specifies what content to exclude from the crawl, as well as the credentials to use for the crawl.
Filter Daemon Handles crawl URL requests from the Index engine by determining the appropriate protocol handler to use. Using the protocol handler, the Filter Daemon fetches the content, extracting and parsing the text and properties, and then invokes the appropriate IFilter, if needed.
Protocol handlers Opens content sources in their native protocols and exposes documents and other items to be filtered.
IFilters Opens documents and other content source items in their native formats and filters these into chunks of text and properties. The IFilter implementation can be part of the protocol handler component, or it can be a separate component.
Protocol Handler Overview
Protocol handlers are free-threaded COM objects that implement the ISearchProtocol interface.
Protocol handlers are registered on the index server at HKLM\Software\Microsoft\OfficeServer\12.0\Search\Setup\ProtocolHandlers.
URL Scheme
The format for a protocol handler's URL schema is scheme://hostname/path.extension.
The URL schema is used by the Filter Daemon to determine which protocol handler to use for a particular crawl URL. For more information, see The Crawl Process.
Protocol Handlers Types
Enterprise Search provides support for two types of protocol handlers:
Hierarchical Works with structured content sources, such as file shares, which include structures such as directories or folders that must be traversed.
Link-based Works with content sources such as Web sites, where links within the content indicate how the source is traversed.
Initializing the Protocol Handlers
The Filter Daemon initializes all the registered protocol handlers with a call to the Init method for the protocol handler's ISearchProtocol implementation. The Filter Daemon uses the ISearchProtocol methods to process crawl URLs from the Index engine. This process is described in the following section.
The Crawl Process
The Index engine initiates crawls of content sources. There are two types of crawls:
Full crawl A crawl of all the content. The crawl URL queue is seeded with the start addresses for the content source being crawled. Duplicate entries are removed from the queue. As the crawl progresses, the Index engine adds crawl URLs to the queue as they are discovered during the filtering process. Deleted items are removed from the content index. The crawl process continues until the crawl queue is empty.
Incremental crawl A crawl of only modified content. The crawl URL queue is seeded with start address URLs and the URLs from the crawl history for that content source. The Index engine passes the timestamp to the Filter Daemon with the crawl URL. For SharePoint content, the Index engine relies on the Change Log feature in Windows SharePoint Services 3.0, so that only content logged in the Change Log is crawled.
Selecting the Protocol Handler
The Filter Daemon determines the appropriate protocol handler for each crawl URL from the Index engine, based on the crawl URL and the URL schema. For example, for the crawl URL https://www.microsoft.com/
, the Filter Daemon selects the default HTTP protocol handler, which is a link-based protocol handler.
For the crawl URL \\CentralSales\Public\
, the Filter Daemon selects the default File protocol handler, which is a hierarchical protocol handler.
Returning the URL Accessor
The CreateAccessor method of the ISearchProtocol interface is called separately for each crawl URL.
Note
Only one crawl URL is processed per CreateAccessor method call, but there can be multiple calls to this method simultaneously. As a result, multiple threads can be working in parallel.
The CreateAccessor method returns a URLAccessor object that the Filter Daemon uses to process the crawl URL. The URLAccessor object is implemented in the IUrlAccessor interface.
Filtering the Content
The IUrlAccessor interface contains the BindToFilter method and the BindToStream method; you must implement at least one of these methods for each crawl URL.
BindToFilter
If the crawl URL is not associated with a binary stream that is parsed by one of the standard filters, you must implement the BindToFilter method. In this scenario, the IFilter must also be implemented as part of the URLAccessor object.
You can also implement the BindToFilter method to extract the metadata associated with content items. The protocol handler sends chunks of data containing the properties and links to the Index engine.
If the crawl URL is a folder or directory, you should implement the BindToFilter method for the protocol handler to enumerate the folder or directory contents. The protocol handler should then emit the PID_GTHR_DIRLINK_WITH_TIME property for each item. This property contains the item's URL and timestamp. During an incremental crawl, after the Index engine receives the PID_GTHR_DIRLINK_WITH_TIME for a given item, it checks the timestamp with the value that is stored for that item in the crawl history. If the timestamp has not changed, the item is not crawled. If there are no changes in the directory, or if a single item did not change with respect to the timestamp passed by the crawler, the protocol handler should return PRTH_S_NOT_MODIFIED for the content item, and no further processing of the item is required. For more information about PRTH_S_NOT_MODIFIED, see Protocol Handler Error Messages.
This makes incremental crawls more efficient, because the protocol handler does not need to bind to each item individually, only to those items that have changed.
Note
If the protocol handler's BindToFilter method does not implement emitting PID_GTHR_DIRLINK_WITH_TIME, and the CreateAccessor method does not support returning PRTH_S_NOT_MODIFIED, incremental crawls perform essentially the same as full crawls.
BindToStream
Implement the BindToStream method if there is a binary stream associated with the crawl URL that must be parsed by one of the standard filters, such as the Text, HTML, or Microsoft Office filter. The BindToStream method invokes the appropriate filter to extract the item's content.
For more information about creating a filter, see How to Write a Filter for Use by SharePoint Portal Server 2003 and Other Microsoft Search-Based Products.
The Filter Daemon calls both the BindToFilter and BindToStream methods only once for each crawl URL. One of the methods must succeed for the content item associated with the crawl URL to be filtered.
Security
The GetSecurityDescriptor method retrieves the security information associated with the content item, such as the different kinds of access allowed for particular users and groups of users. If you implement this method, the Filter Daemon provides the Index engine with security information about the content item. The Index engine incorporates this information into the full-text index with the document content.
The Query engine uses the security information when it executes queries against the full-text index to determine if the user submitting a search query has access to items in the results. Based on this, the Query engine performs security trimming of search results, so that users see only items they have access to, displayed in the search results. Therefore, if you do not implement the GetSecurityDescriptor method, all users are able to retrieve and view the contents of the item in their search query results. For more information about security trimming, see Enterprise Search Security Model.