Indexing Page Content With the FetchPageContentProcessor
Indexing Page Content With the FetchPageContentProcessor
The FetchPageContentProcessor processor executes an HTTP request, gets the content, and sets the data from the response in the BinaryData Coveo for Sitecore application code field.
In the October 2018 release of Coveo for Sitecore 4.1, FetchPageContentProcessor became the default HTML indexing processor on fresh installations of Coveo for Sitecore.
|
|
This HTTP request introduces a delay when indexing. |
Enabling the Processor
You need to add the <coveoGetBinaryData> element and the <coveoPostItemProcessingPipeline> child <processor> element shown below to the Coveo.SearchProvider.Custom.config file.
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
<sitecore>
<pipelines>
<coveoGetBinaryData>
<processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase">
<!-- FetchPageContentProcessor processor configurations here -->
</processor>
</coveoGetBinaryData>
<coveoPostItemProcessingPipeline>
<processor type="Coveo.SearchProvider.Processors.ExecuteGetBinaryDataPipeline, Coveo.SearchProviderBase" />
</coveoPostItemProcessingPipeline>
</pipelines>
</sitecore>
</configuration>
This can be achieved in two different ways depending on your version of Coveo for Sitecore 4.1 and whether you’re currently indexing HTML with the HtmlContentInBodyWithRequestsProcessor processor.
For October 2018 or More Recent Releases When Currently Using HtmlContentInBodyWithRequestsProcessor
The October 2018 release of Coveo for Sitecore 4.1 and subsequent releases provide a simple mechanism to switch from the HTMLContentInBodyWithRequestsProcessor to the FetchPageContentProcessor processor for your HTML indexing, without having to edit your configuration files.
Follow the instructions in Step 3: Enable the FetchPageContent Processor of the September 2018 to October 2018 Coveo for Sitecore upgrade steps.
Once the FetchPageContentProcessor processor enabled, you might need to perform configurations, for example, to set up HTTP request authentication for secured items.
Proceed to Configuring the Processor.
For All Other Situations
You must edit the Coveo.SearchProvider.Custom.config directly.
-
Using a text editor, open file
App_Config\Include\Coveo\Coveo.SearchProvider.Custom.config(orApp_Config\Modules\Coveo\Coveo.SearchProvider.Custom.configfor Sitecore 9 instances). -
In the
coveoPostItemProcessingPipelineelement, delete the following element.<processor type="Coveo.SearchProvider.Processors.HtmlContentInBodyWithRequestsProcessor, Coveo.SearchProviderBase" /> -
In the
coveoPostItemProcessingPipelineelement, add a new processor which executes thecoveoGetBinaryDatapipeline.<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/"> <sitecore> <pipelines> <coveoPostItemProcessingPipeline> <processor type="Coveo.SearchProvider.Processors.ExecuteGetBinaryDataPipeline, Coveo.SearchProviderBase" /> </coveoPostItemProcessingPipeline> </pipelines> </sitecore> </configuration> -
Add the
coveoGetBinaryDatapipeline.<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/"> <sitecore> <pipelines> <coveoGetBinaryData> </coveoGetBinaryData> </pipelines> </sitecore> </configuration> -
In the
coveoGetBinaryDatapipeline, add theFetchPageContentProcessorprocessor.<coveoGetBinaryData> <processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase"> <inboundFilter hint="list:AddInboundFilter"> <itemsWithLayout type="Coveo.SearchProvider.Processors.FetchPageContent.Filters.ItemsWithLayout, Coveo.SearchProviderBase" /> </inboundFilter> <preAuthentication hint="list:AddPreAuthenticator"></preAuthentication> <postProcessing hint="list:AddPostProcessing"></postProcessing> </processor> </coveoGetBinaryData>
Once the FetchPageContentProcessor processor enabled, you might need to perform configurations, for example, to set up HTTP request authentication for secured items.
Proceed to Configuring the Processor.
Configuring the Processor
The FetchPageContentProcessor processor contains the <inboundFilter>, <preAuthentication>, and <postProcessing> sections shown below.
<processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase">
<inboundFilter hint="list:AddInboundFilter">
<!-- inboundFilter configurations here -->
</inboundFilter>
<preAuthentication hint="list:AddPreAuthenticator" />
<!-- preAuthentication configurations here -->
</preAuthentication>
<postProcessing hint="list:AddPostProcessing">
<!-- postProcessing configurations here -->
</postProcessing>
</processor>
|
|
If you change anything in the configuration, you must rebuild or reindex your items for the new settings to be applied. |
Here are more details about the configuration options in each section.
The <inboundFilter> Section
This section allows you to specify the Sitecore items that you want to provide an HTML representation for. Filtering can reduce the number of requests that log an error in your Sitecore logs.
Available Configurations:
-
<itemsWithLayout>This filter specifies that only items that have a layout must be processed, eliminating many unnecessary requests for items that most likely don’t have any HTML content.
We recommend that you keep this filter at all times, but you can remove it in specific scenarios, such as when using Wildcard items, which don’t necessarily have a layout.
-
Custom Processor
You can implement your own processor using the
Coveo.SearchProvider.Processors.FetchPageContent.Filters.IFetchPageContentInboundFilterProcessorinterface.
The <preAuthentication> Section
This section allows you to authenticate the request that will be sent to fetch the HTML content.
Available Configurations:
-
FormsRequestIf your page requires authentication to get access to the page, the
FormsRequestprocessor allows you to authenticate the request like an end user would. TheFormsRequestprocessor is similar to the previous Form Authentication method (see Configuring Form Authentication for the HTML Content In Body With Requests Processor).Here is an overview of what this processor does:
-
Takes the configuration and builds a
POSTrequest. -
Executes the
POSTrequest to the login page -
Takes the response and stores its cookies
-
Takes the cookies and assigns them to the HTTP request used to get the binary data
FormsRequestconfigurations:-
The
credentialsExpireInattribute is used to keep the cookies for a period of time. In this example, it’s set to 5 minutes. -
The
formsAuthConfigurationobject contains attributes used to configure thePOSTrequest sent to authenticate the user.-
formsAuthLoginPage: The URL of the login page. -
formsAuthUserControl: Thenameattribute value of the input control used by the user to enter their username. -
formsAuthPasswordControl: Thenameattribute value of the input control used by the user to enter their password. -
formsAuthLoginCommand: Thenameattribute value of the submit control that the user clicks when logging in, followed by thevalueattribute value of the same submit control.
-
-
The
usernameis the Sitecore username used to authenticate the request. -
The
passwordis the password used to authenticate the request.Example<preAuthentication hint="list:AddPreAuthenticator"> <processor type="Coveo.SearchProvider.Processors.FetchPageContent.PreAuthenticators.FormsRequest, Coveo.SearchProviderBase" singleInstance="true"> <credentialsExpireIn>00:05:00</credentialsExpireIn> <formsAuthConfiguration type="Coveo.Framework.Configuration.FormsAuthConfiguration, Coveo.Framework"> <formsAuthLoginPage>http://mysitecoresite.local/sitecore/login</formsAuthLoginPage> <formsAuthUserControl>UserName</formsAuthUserControl> <formsAuthPasswordControl>Password</formsAuthPasswordControl> <formsAuthLoginCommand>ctl07=Log+in</formsAuthLoginCommand> </formsAuthConfiguration> <username>sitecore\coveocrawler</username> <password>b</password> </processor> </preAuthentication>When this processor is enabled, if you’re logging into the Sitecore default login page, you should see the following log during the indexing operation:
37148 14:27:28 INFO AUDIT (sitecore\coveocrawler): Login-
AddSingleSignOnHeadersThis processor ensures that the HTTP request has the required headers to follow single sign-on redirections.
To configure it, you need to add a list of URLs that are single sign-on logins.
ExampleWith the following configuration, if the HTTP request to get the binary data gets redirected to
http://myssosite.local/login.aspx, the usernamemycustomdomain\unicornsand passwordra1nb0wsare used to try to login in.<preAuthentication hint="list:AddPreAuthenticator"> <processor type="Coveo.SearchProvider.Processors.FetchPageContent.PreAuthenticators.AddSingleSignOnHeaders, Coveo.SearchProviderBase" singleInstance="true"> <LoginUrls hint="list"> <myExampleSite>http://myssosite.local/login.aspx</myExampleSite> </LoginUrls> <Username>mycustomdomain\unicorns</Username> <Password>ra1nb0ws</Password> </processor> </preAuthentication> -
Custom Processor
You can implement your own authentication processor by using the
Coveo.SearchProvider.Processors.FetchPageContent.PreAuthenticators.IFetchPageContentPreAuthenticatorProcessorinterface.This is useful if you have a custom authentication process and need to add headers or set a specific cookie to allow the request to get the content.
-
-
The <postProcessing> Section
This section allows you to process the content of the HTML page before sending it to the index, removing some sections that are useless in the index or that drive relevance down.
|
|
Post Processing comes at a performance cost. It requires the HTML byte array to be decoded, modified, and re-encoded. |
Available Configurations:
-
CleanHtmlThe clean HTML processor is used to remove sections that are between two comments sections.
ExampleThe following configuration removes the content between a
BEGIN NOINDEXand aEND NOINDEXcomment.<processor type="Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.CleanHtml, Coveo.SearchProviderBase"> <startComment>BEGIN NOINDEX</startComment> <endComment>END NOINDEX</endComment> </processor>For example, given the following markup:
<head> <title>My Site</title> </head> <body> <!-- BEGIN NOINDEX --> <header>I don't want to index this header.</header> <!-- END NOINDEX --> <div> Some content. </div> </body>The result in the
BinaryDatafield is:<head> <title>My Site</title> </head> <body> <div> Some content. </div> </body> -
Custom Processor
You can implement your own HTML processing by using the
Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.IFetchPageContentHtmlPostProcessingProcessorinterface.This is useful if you want to remove content in a different manner than that implementations provided by Coveo for Sitecore.