Index Page Content With the FetchPageContentProcessor

The FetchPageContentProcessor processor executes an HTTP request, gets the content, and sets the data from the response in the BinaryData Coveo for Sitecore application code field.

The FetchPageContentProcessor is the default and recommended Coveo for Sitecore HTML indexing processor.

This HTTP request introduces a delay when indexing.

Enabling the Processor

You need to add the <coveoGetBinaryData> element and the <coveoPostItemProcessingPipeline> child <processor> element shown below to the Coveo.SearchProvider.Custom.config file.

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <pipelines>
      <coveoGetBinaryData>
        <processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase">
          <!-- FetchPageContentProcessor processor configurations here -->
        </processor>
      </coveoGetBinaryData>
      <coveoPostItemProcessingPipeline>
        <processor type="Coveo.SearchProvider.Processors.ExecuteGetBinaryDataPipeline, Coveo.SearchProviderBase" />
      </coveoPostItemProcessingPipeline>
    </pipelines>
  </sitecore>
</configuration>

This can be achieved in two different ways, whether you’re currently indexing HTML with the HtmlContentInBodyWithRequestsProcessor processor.

When Currently Using HtmlContentInBodyWithRequestsProcessor

If you upgraded from Coveo for Sitecore 4.1 to Coveo for Sitecore 5, you might still be using the HTMLContentInBodyWithRequestsProcessor processor. Coveo for Sitecore 5 provides a simple mechanism to switch from the HTMLContentInBodyWithRequestsProcessor to the FetchPageContentProcessor processor for your HTML indexing, without having to edit your configuration files.

To switch from the HtmlContentInBodyWithRequestsProcessor processor to the FetchPageContentProcessor processor

  1. Go to the Configuration section of the Command Center, accessible at http://<INSTANCE_HOSTNAME>/coveo/command-center/index.html#configuration/.

  2. In the Configure options section, if the Index rendered HTML option is selected

    1. Select Only index Sitecore item data.

    2. Click Apply and Restart.

  3. In the Configure options section, select the Index rendered HTML option.

  4. Click Apply and Restart.

You should now see the following configuration in your App_Config\Include\Coveo\Coveo.SearchProvider.Custom.config file.

<coveoPostItemProcessingPipeline>
  <processor type="Coveo.SearchProvider.Processors.ExecuteGetBinaryDataPipeline, Coveo.SearchProviderBase" />
</coveoPostItemProcessingPipeline>
<coveoGetBinaryData>
  <processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase">
    <inboundFilter hint="list:AddInboundFilter">
      <itemsWithLayout type="Coveo.SearchProvider.Processors.FetchPageContent.Filters.ItemsWithLayout, Coveo.SearchProviderBase" />
    </inboundFilter>
    <preAuthentication hint="list:AddPreAuthenticator" />
    <postProcessing hint="list:AddPostProcessing">
      <processor type="Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.CleanHtml, Coveo.SearchProviderBase">
        <startComment>BEGIN NOINDEX</startComment>
        <endComment>END NOINDEX</endComment>
      </processor>
    </postProcessing>
  </processor>
</coveoGetBinaryData>

Once the FetchPageContentProcessor processor enabled, you might need to perform configurations, for example, to set up HTTP request authentication for secured items. Proceed to Configuring the Processor.

Enabling the FetchPageContentProcessor Processor in All Other Situations

You must edit the Coveo.SearchProvider.Custom.config file directly.

  1. Using a text editor, open file Coveo.SearchProvider.Custom.config.

    Sitecore 7 and 8 The Coveo.SearchProvider.Custom.config file is located in the <SITECORE_INSTANCE_ROOT>\Website\App_Config\Include\Coveo folder.

    Sitecore 9 and 10 The Coveo.SearchProvider.Custom.config file is located in the <SITECORE_INSTANCE_ROOT>\App_Config\Include\Coveo folder.

  2. In the coveoPostItemProcessingPipeline element, delete the following element.

    <processor type="Coveo.SearchProvider.Processors.HtmlContentInBodyWithRequestsProcessor, Coveo.SearchProviderBase" />
    
  3. In the coveoPostItemProcessingPipeline element, add a new processor which executes the coveoGetBinaryData pipeline.

    <configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
      <sitecore>
        <pipelines>
          <coveoPostItemProcessingPipeline>
            <processor type="Coveo.SearchProvider.Processors.ExecuteGetBinaryDataPipeline, Coveo.SearchProviderBase" />
          </coveoPostItemProcessingPipeline>
        </pipelines>
      </sitecore>
    </configuration>
    
  4. Add the coveoGetBinaryData pipeline.

    <configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
      <sitecore>
        <pipelines>
          <coveoGetBinaryData>
          </coveoGetBinaryData>
        </pipelines>
      </sitecore>
    </configuration>
    
  5. In the coveoGetBinaryData pipeline, add the FetchPageContentProcessor processor.

    <coveoGetBinaryData>
      <processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase">
        <inboundFilter hint="list:AddInboundFilter">
            <itemsWithLayout type="Coveo.SearchProvider.Processors.FetchPageContent.Filters.ItemsWithLayout, Coveo.SearchProviderBase" />
        </inboundFilter>
        <preAuthentication hint="list:AddPreAuthenticator"></preAuthentication>
        <postProcessing hint="list:AddPostProcessing"></postProcessing>
      </processor>
    </coveoGetBinaryData>
    

Once the FetchPageContentProcessor processor enabled, you might need to perform configurations, for example, to set up HTTP request authentication for secured items. Proceed to Configuring the Processor.

Configuring the Processor

The FetchPageContentProcessor processor contains the <inboundFilter>, <preAuthentication>, and <postProcessing> sections shown below.

<processor type="Coveo.SearchProvider.Processors.FetchPageContentProcessor, Coveo.SearchProviderBase">
  <inboundFilter hint="list:AddInboundFilter">
    <!-- inboundFilter configurations here -->
  </inboundFilter>
  <preAuthentication hint="list:AddPreAuthenticator" />
    <!-- preAuthentication configurations here -->
  </preAuthentication>
  <postProcessing hint="list:AddPostProcessing">
    <!-- postProcessing configurations here -->
  </postProcessing>
</processor>

If you change anything in the configuration, you must rebuild or reindex your items for the new settings to be applied.

Here are more details about the configuration options in each section.

The <inboundFilter> Section

This section allows you to specify the Sitecore items that you want to provide an HTML representation for. Filtering can reduce the number of requests that log an error in your Sitecore logs.

Available Configurations:

  • <itemsWithLayout>

    This filter specifies that only items that have a layout must be processed, eliminating many unnecessary requests for items that most likely don’t have any HTML content.

    We recommend that you keep this filter at all times, but you can remove it in specific scenarios, such as when using Wildcard items, which don’t necessarily have a layout.

  • Custom Processor

    You can implement your own processor using the Coveo.SearchProvider.Processors.FetchPageContent.Filters.IFetchPageContentInboundFilterProcessor interface.

The <preAuthentication> Section

This section lets you authenticate the request that will be sent to retrieve the HTML content.

Available Configurations:

  • FormsRequest

    If your page requires authentication to get access to the page, the FormsRequest processor allows you to authenticate the request like an end user would. The FormsRequest processor is similar to the previous Form Authentication method (see Configuring Form Authentication).

    Here is an overview of what this processor does:

    1. Takes the configuration and builds a POST request.

    2. Executes the POST request to the login page

    3. Takes the response and stores its cookies

    4. Takes the cookies and assigns them to the HTTP request used to get the binary data

    FormsRequest configurations:

    • The credentialsExpireIn attribute is used to keep the cookies for a period of time. In this example, it’s set to 5 minutes.

    • The formsAuthConfiguration object contains attributes used to configure the POST request sent to authenticate the user.

      • formsAuthLoginPage: The URL of the login page.
      • formsAuthUserControl: The name attribute value of the input control used by the user to enter their username.
      • formsAuthPasswordControl: The name attribute value of the input control used by the user to enter their password.
      • formsAuthLoginCommand: The name attribute value of the submit control that the user clicks when logging in, followed by the value attribute value of the same submit control.

        Spaces must be replaced with the + symbol.

    • The username is the Sitecore username used to authenticate the request.

    • The password is the password used to authenticate the request.

    You want to index content from your http://www.secured.com Sitecore website. You can access the authentication page of the website through http://www.secured.com/sitecore/login.

    Inspecting this authentication page in your browser, you see the following markup:

    Sitecore Log In Button HTML

    The corresponding <preAuthentication> section configuration would look as follows:

    <preAuthentication hint="list:AddPreAuthenticator">
      <processor type="Coveo.SearchProvider.Processors.FetchPageContent.PreAuthenticators.FormsRequest, Coveo.SearchProviderBase" singleInstance="true">
        <credentialsExpireIn>00:05:00</credentialsExpireIn>
        <formsAuthConfiguration type="Coveo.Framework.Configuration.FormsAuthConfiguration, Coveo.Framework">
          <formsAuthLoginPage>http://www.secured.com/sitecore/login</formsAuthLoginPage>
          <formsAuthUserControl>UserName</formsAuthUserControl>
          <formsAuthPasswordControl>Password</formsAuthPasswordControl>
          <formsAuthLoginCommand>LogInBtn=Log+in</formsAuthLoginCommand>
        </formsAuthConfiguration>
        <username>sitecore\coveocrawler</username>
        <password>b</password>
      </processor>
    </preAuthentication>
    

    When this processor is enabled, if you’re logging into the Sitecore default login page, you should see the following log during the indexing operation:

    37148 14:27:28 INFO  AUDIT (sitecore\coveocrawler): Login
    
  • AddSingleSignOnHeaders

    This processor ensures that the HTTP request has the required headers to follow Single Sign On redirections.

    To configure it, you need to add a list of URLs that are Single Sign On logins.

    With the following configuration, if the HTTP request to get the binary data gets redirected to http://myssosite.local/login.aspx, the username mycustomdomain\unicorns and password ra1nb0ws are used to try to login in.

    <preAuthentication hint="list:AddPreAuthenticator">
      <processor type="Coveo.SearchProvider.Processors.FetchPageContent.PreAuthenticators.AddSingleSignOnHeaders, Coveo.SearchProviderBase" singleInstance="true">
        <LoginUrls hint="list">
          <myExampleSite>http://myssosite.local/login.aspx</myExampleSite>
        </LoginUrls>
        <Username>mycustomdomain\unicorns</Username>
        <Password>ra1nb0ws</Password>
      </processor>
    </preAuthentication>
    
  • Custom Processor

    You can implement your own authentication processor by using the Coveo.SearchProvider.Processors.FetchPageContent.PreAuthenticators.IFetchPageContentPreAuthenticatorProcessor interface.

    This is useful if you have a custom authentication process and need to add headers or set a specific cookie to allow the request to get the content.

The <postProcessing> Section

This section allows you to process the content of the HTML page before sending it to the index, removing some sections that are useless in the index or that drive relevance down.

To remove HTML content using CSS selectors, you can also use an indexing pipeline extension (IPE) ( see Remove HTML Sections From Indexed Sitecore Items.

Post Processing comes at a performance cost. It requires the HTML byte array to be decoded, modified, and re-encoded.

Available Configurations:

  • CleanHtml

    The CleanHtml processor is used to remove sections that are between two comments.

    The following configuration removes the content between a BEGIN NOINDEX and a END NOINDEX comment.

    <processor type="Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.CleanHtml, Coveo.SearchProviderBase">
      <startComment>BEGIN NOINDEX</startComment>
      <endComment>END NOINDEX<endComment>
    </processor>
    

    For example, given the following markup:

    <head>
      <title>My Site</title>
    </head>
    <body>
      <!-- BEGIN NOINDEX -->
      <header>I don't want to index this header.</header>
      <!-- END NOINDEX -->
      <div>
          Some content.
      </div>
    </body>
    

    The result in the BinaryData field is:

    <head>
      <title>My Site</title>
    </head>
    <body>
      <div>
          Some content.
      </div>
    </body>
    
  • CleanHtmlWithSimpleSelectors

    Coveo for Sitecore (February 12, 2021) Coveo for Sitecore (March 26, 2021)

    The CleanHtmlWithSimpleSelectors processor had to be deprecated because it was incompatible with content that includes HTML tags with attributes that don’t have a value (e.g., <input type="text" name="lastname" disabled>). When a web page included tags like these, the processor threw an exception such as the following example and failed to remove any content from the page:

    ManagedPoolThread #1 20:10:10 ERROR An error occurred while cleaning HTML of item {110D559F-DEA5-42EA-9C1C-8A5DF7E70EF9} at http://sitecore82u7/de-DE
    Exception: System.Xml.XmlException
    Message: 'class' is an unexpected token. The expected token is '='. Line 9, position 17.
    Source: System.Xml
       at System.Xml.XmlTextReaderImpl.Throw(Exception e)
       at System.Xml.XmlTextReaderImpl.ParseAttributes()
       at System.Xml.XmlTextReaderImpl.ParseElement()
       at System.Xml.XmlTextReaderImpl.ParseElementContent()
       at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
       at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
       at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options)
       at System.Xml.Linq.XDocument.Parse(String text, LoadOptions options)
       at Coveo.SearchProvider.Utils.HtmlCleanerWithSimpleSelectors.CleanHtmlContentWithSimpleSelectors(String p_HtmlContent, List`1 p_SimpleSelectors)
       at Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.CleanHtmlWithSimpleSelectors.Process(FetchPageContentHtmlPostProcessingArgs p_Args)
    
  • Custom Processor

    You can implement your own HTML processing by using the Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.IFetchPageContentHtmlPostProcessingProcessor interface.

    This is useful if you want to remove content in a different manner than that implementations provided by Coveo for Sitecore.

Recommended Articles