--- title: Index page content with the FetchPageContentProcessor slug: '2326' canonical_url: https://docs.coveo.com/en/2326/ collection: coveo-for-sitecore-v5 source_format: adoc --- # Index page content with the FetchPageContentProcessor The `FetchPageContentProcessor` is the default and recommended Coveo for Sitecore HTML indexing processor. The processor performs an HTTP request, gets the page content, and sets the value of the Coveo for Sitecore code `BinaryData` field with this content. > **Important** > > This HTTP request introduces a delay when indexing. ## Enabling the processor You may want to enable the processor if: * You're currently [not indexing rendered HTML](#when-currently-not-indexing-rendered-html) and want to start indexing HTML. * You were using the [ HtmlContentInBodyWithRequestsProcessor](#when-currently-using-htmlcontentinbodywithrequestsprocessor) in Coveo for Sitecore 4.1 and upgraded to Coveo for Sitecore 5. ### When currently not indexing rendered HTML The Coveo Command Center has an `Index rendered HTML` option. When you select this option, Coveo for Sitecore automatically enables the `FetchPageContentProcessor`. . Open the **Indexing Options** section of the **Command Center**, accessible at `\http:///coveo/command-center/index.html#indexing-options/`. . Select `Index rendered HTML`. . Click **Apply and Restart**. > **Note** > > This procedure produces the following configuration structure in the `Coveo.SearchProvider.Custom.config` file: > > [.highlight] > [source,xml,options="nowrap"] > ``` ... ``` > > The `Coveo.SearchProvider.Custom.config` file is located in the `\App_Config\Include\Coveo\` folder. Once the `FetchPageContentProcessor` processor is enabled, you might need to [perform configurations](#configuring-the-processor) (for example, setting up HTTP request authentication for secured items). ### When currently using HtmlContentInBodyWithRequestsProcessor If you upgraded from Coveo for Sitecore 4.1 to Coveo for Sitecore 5, you might still be using the `HTMLContentInBodyWithRequestsProcessor`. Coveo for Sitecore 5 provides a mechanism to switch from the `HTMLContentInBodyWithRequestsProcessor` to the `FetchPageContentProcessor` without having to edit your configuration files. . Open the **Indexing Options** section of the **Command Center**, accessible at `\http:///coveo/command-center/index.html#indexing-options/`. . If the `Index rendered HTML` option is selected: .. Select `Only index Sitecore item data`. .. Click **Apply and Restart**. . Select `Index rendered HTML`. . Click **Apply and Restart**. > **Note** > > This procedure produces the following configuration structure in the `Coveo.SearchProvider.Custom.config` file: > > [.highlight] > [source,xml,options="nowrap"] > ``` ... ``` > > The `Coveo.SearchProvider.Custom.config` file is located in the `\App_Config\Include\Coveo\` folder. Once the `FetchPageContentProcessor` processor is enabled, you might need to [perform configurations](#configuring-the-processor) (for example, setting up HTTP request authentication for secured items). ## Configuring the processor The `FetchPageContentProcessor` contains the ``, ``, and `` sections shown below. ```xml ``` > **Important** > > If you change anything in the configuration, rebuild or reindex your items for the new settings to be applied. Here are more details about the configuration options in each section. ### The `` section This section lets you specify the Sitecore items that you want to provide an HTML representation for. Filtering can reduce the number of requests that log an error in your Sitecore logs. **Available configurations**: * `` This filter specifies that only items that have a layout must be processed, eliminating many unnecessary requests for items that most likely don't have any HTML content. We recommend that you keep this filter at all times, but you can remove it in specific scenarios, such as when using Wildcard items, which don't necessarily have a layout. * Custom Processor You can implement your own processor using the `Coveo.SearchProvider.Processors.FetchPageContent.Filters.IFetchPageContentInboundFilterProcessor` interface. ### The `` section This section lets you authenticate the request that will be sent to retrieve the HTML content. **Available configurations**: * `FormsRequest` If your page requires authentication to get access to the page, the `FormsRequest` processor lets you authenticate the request like an end user would. Here is an overview of what this processor does: -- . Takes the configuration and builds a POST request. . Sends the POST request to the login page. . Takes the response and stores its cookies. . Takes the cookies and assigns them to the HTTP request used to get the binary data. -- `FormsRequest` configurations: -- ** The `credentialsExpireIn` attribute is used to keep the cookies for a period of time. In this example, it's set to 5 minutes. ** The `formsAuthConfiguration` object contains attributes used to configure the POST request sent to authenticate the user. *** `formsAuthLoginPage`: The URL of the login page. *** `formsAuthUserControl`: The `name` attribute value of the input control used by the user to enter their username. *** `formsAuthPasswordControl`: The `name` attribute value of the input control used by the user to enter their password. *** `formsAuthLoginCommand`: The `name` attribute value of the submit control that the user clicks when logging in, followed by the `value` attribute value of the same submit control. > **Note** > > Spaces must be replaced with the `+` symbol. ** The `username` is the Sitecore username used to authenticate the request. ** The `password` is the password used to authenticate the request. -- **Example** You want to index content from your `+http://www.secured.com+` Sitecore website. You can access the authentication page of the site through `+http://www.secured.com/sitecore/login+`. Inspecting this authentication page in your browser, you see the following markup: ![Sitecore Log In Button HTML](https://docs.coveo.com/en/assets/images/c4sc-v5/forms-auth-login-command-screenshot.png) The corresponding `` section configuration would look as follows: ```xml 00:05:00 http://www.secured.com/sitecore/login UserName Password LogInBtn=Log+in sitecore\coveocrawler b ``` When this processor is enabled, if you're logging into the Sitecore default login page, you should see the following log during the indexing operation: ```text 37148 14:27:28 INFO AUDIT (sitecore\coveocrawler): Login ``` * `AddSingleSignOnHeaders` This processor ensures that the HTTP request has the required headers to follow Single Sign-On redirections. To configure it, add a list of URLs that are Single Sign-On logins. **Example** With the following configuration, if the HTTP request to get the binary data gets redirected to `+http://myssosite.local/login.aspx+`, the username `mycustomdomain\unicorns` and password `ra1nb0ws` are used to try to login in. ```xml http://myssosite.local/login.aspx mycustomdomain\unicorns ra1nb0ws ``` * Custom Processor You can implement your own authentication processor by using the `Coveo.SearchProvider.Processors.FetchPageContent.PreAuthenticators.IFetchPageContentPreAuthenticatorProcessor` interface. This is useful if you have a custom authentication process and need to add headers or set a specific cookie to allow the request to get the content. ### The `` section This section lets you process the content of the HTML page before sending it to the index, removing some sections that are useless in the index or that drive relevance down. > **Leading practice** > > To remove HTML content using CSS selectors, you can also use an [indexing pipeline extension (IPE)](https://docs.coveo.com/en/206/). > See [Example: Removing unwanted HTML sections](https://docs.coveo.com/en/3094#example-removing-unwanted-html-sections) for more details. > **Important** > > Post Processing comes at a performance cost. > It requires the HTML byte array to be decoded, modified, and re-encoded. **Available configurations**: * `CleanHtml` The `CleanHtml` processor is used to remove sections that are between two comments. **Example** The following configuration removes the content between a `BEGIN NOINDEX` and a `END NOINDEX` comment. ```xml BEGIN NOINDEX END NOINDEX ``` For example, given the following markup: ```html My Site
I don't want to index this header.
Some content.
``` The result in the `BinaryData` field is: ```xml My Site
Some content.
``` * `CleanHtmlWithSimpleSelectors` -- [.version.c4sc.c4sc-obsolete.5-0-943-3.March-26&-2021] [Obsolete](https://docs.coveo.com/en/l22b0522#release-notes) [.version.c4sc.c4sc-obsolete.5-0-943-3.March-26&-2021] [Obsolete](https://docs.coveo.com/en/l22b0522#release-notes) -- > **Important** > > The `CleanHtmlWithSimpleSelectors` processor had to be deprecated because it was incompatible with content that includes HTML tags with attributes that don't have a value (for example, ``). > When a web page included tags like these, the processor threw an exception such as the following example and failed to remove any content from the page: > > ```text ManagedPoolThread #1 20:10:10 ERROR An error occurred while cleaning HTML of item {110D559F-DEA5-42EA-9C1C-8A5DF7E70EF9} at http://sitecore82u7/de-DE Exception: System.Xml.XmlException Message: 'class' is an unexpected token. The expected token is '='. Line 9, position 17. Source: System.Xml at System.Xml.XmlTextReaderImpl.Throw(Exception e) at System.Xml.XmlTextReaderImpl.ParseAttributes() at System.Xml.XmlTextReaderImpl.ParseElement() at System.Xml.XmlTextReaderImpl.ParseElementContent() at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r) at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o) at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options) at System.Xml.Linq.XDocument.Parse(String text, LoadOptions options) at Coveo.SearchProvider.Utils.HtmlCleanerWithSimpleSelectors.CleanHtmlContentWithSimpleSelectors(String p_HtmlContent, List`1 p_SimpleSelectors) at Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.CleanHtmlWithSimpleSelectors.Process(FetchPageContentHtmlPostProcessingArgs p_Args) ``` * Custom Processor You can implement your own HTML processing by using the `Coveo.SearchProvider.Processors.FetchPageContent.PostProcessing.IFetchPageContentHtmlPostProcessingProcessor` interface. This is useful if you want to remove content using custom logic.