Index XML sitemap metadata

The Sitemap source supports indexing additional metadata included in an XML sitemap file. This metadata can come from:

Moreover, the Sitemap source can also index metadata retrieved from the meta tags in the head of the web pages listed in your sitemap.

Third-party extensions

Some sites such as Google offer extensions adding extra metadata to your sitemap (see Image sitemaps). Alternatively, you can build your own extension Extending the Sitemaps protocol. Either way, the data added to your sitemap can be retrieved and made searchable by Coveo. See Configuring fields and mappings to configure Coveo adequately.

See also Video sitemaps and video sitemap alternatives for another example.

Coveo-specific custom metadata

A developer can include custom metadata in an XML sitemap file specifically for Coveo indexing purposes. When they can generate or modify the sitemap XML file of a repository to index, they can also include a Coveo namespace (coveo:metadata) and metadata to provide information on items that isn’t found in default fields (i.e., Sitemap standard source fields and Coveo default fields).

Example

Since you have control on the sitemap file (it isn’t generated by a third party), you decide to create your XML sitemap file dynamically and add all the custom metadata you need.

Although the added Coveo metadata will only be read by the Coveo crawler and connector and ignored by all other processes, it still respects the Sitemap protocol (see Sitemaps XML format).

The following procedure requires a user that has the permissions and skills to modify or create an XML sitemap file and the required privileges in the Coveo Administration Console.

To add Coveo-specific custom metadata in an XML sitemap

You must code a third-party process to modify or create an XML sitemap file as follows:

  1. In the urlset XML element start tag (<urlset>), extend the Sitemap protocol using the Coveo namespace by adding the following line:

    xmlns:coveo="https://www.coveo.com/en/company/about-us"

    Note

    From a Coveo perspective, the value of the xmlns:coveo attribute (i.e., the URI) is irrelevant. The Coveo sitemap crawler ignores this value. However, other web search engine indexing services may need to validate this URI.

    The attribute name (i.e., xmlns:coveo) is important as the sitemap XML file will contain elements in the coveo namespace scope.

    Example
    <?xml version="1.0" encoding="UTF-8"?>
    <urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
      xmlns:coveo="https://www.coveo.com/en/company/about-us">
  2. For each url element (<url></url>) in the sitemap, create a new XML element named coveo:metadata (<coveo:metadata></coveo:metadata>).

    Example
    <url>
      <loc>http://example.com/about/</loc>
      <lastmod>2015-02-10T13:47:23+00:00</lastmod>
      <changefreq>weekly</changefreq>
      <priority>1.00</priority>
      <coveo:metadata>
      </coveo:metadata>
    </url>
  3. Within the coveo:metadata elements, add your custom metadata (name and value).

    Notes
    • To index special characters, you must use a CDATA tag (![CDATA[) at the beginning of the node (see Character Data and Markup). The source then ignores the CDATA tag and indexes the rest of the node content such as special characters (e.g., &, %, $, ~, and <xml> tags) as text.

      Example:

      The companyname metadata in the following sitemap file content

      <coveo:metadata>
        <casenumber>18467</casenumber>
        <companyname>
          <![CDATA[
          Company XYZ Inc. <USA>
          ]]>
        </companyname>
      </coveo:metadata>

      is indexed as follows in your Coveo index:

      Indexing sitemap metadata when adding CDATA tag
    • Nested metadata inside the <coveo:metadata> element isn’t supported.

    Example

    You want to add the name of the author, the last date of modification and the document tags (if any) so you add the following XML elements:

    <coveo:metadata>
      <modificationdate>2015-02-10T13:47:23+00:00</modificationdate>
      <authorname>John Smith</authorname>
      <tags />
    </coveo:metadata>

Once done, the sitemap could look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<urlset
  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
  xmlns:coveo="https://www.example.com/schemas">
  <url>
    <loc>http://example.com/about/</loc>
    <lastmod>2015-02-10T13:47:23+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.00</priority>
    <coveo:metadata>
      <modificationdate>2015-02-10T13:47:23+00:00</modificationdate>
      <authorname>John Smith</authorname>
      <tags />
    </coveo:metadata>
  </url>
</urlset>
Note

For more information, contact Coveo Professional Services.

Indexing a Sitemap source by reference

By default, a Sitemap source is set to retrieve HTML and PDF items (i.e., to index their content and metadata). With the document content, Coveo produces the item quickview, excerpt, and summary.

If you don’t need the quickview, excerpt, and summary, and you have all the information you want to index in your sitemap file metadata, you may want to index by reference (see Customize the indexing process). Indexing by reference improves performance.

Important

Indexing by reference doesn’t mean your web scraping configuration is ignored. The Coveo sitemap crawler will still scrape the content of documents matching your address filter configuration. To prevent unexpected field values, avoid using the same metadata names in your web scraping configuration as the ones in your sitemap file.

To index a Sitemap source by reference

  1. On the Sources (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console, add a Sitemap source.

  2. Access the Edit a Source JSON Configuration panel of the source you just created.

  3. In the documentConfig section of the JSON source configuration, find the extensionSettings section.

  4. In the extensionSettings section, delete the ByExtensions and ByContentTypes sections.

    byExtension
  5. Find the noExtension and the other sections.

    • In the noExtension section, change the action value from Retrieve to Reference.

    • In the other section, change the action value from Retrieve to Reference.

    indexByReference
  6. Click Save and Rebuild Source.

Meta tags of listed web pages

By default, the Sitemap source crawler doesn’t index the content of the meta tags in the head of the web pages listed in your sitemap. This operation is costly resource-wise and may therefore impact the indexing performance.

If you want the Sitemap source crawler to index the content of the meta tags as source item metadata, add the following to the source JSON configuration:

"IndexHtmlMetadata": {
  "sensitive": false,
  "value": "true"
}
Important

IndexHtmlMetadata is a crawler parameter. After the crawler has handled an HTML page, the document is pushed to the document processing manager where an HTML converter also extracts page metadata. By default, metadata values extracted by the document processing manager override values indexed by the crawler.

Simplified view of the Sitemap source indexing process

Simplified Sitemap source metadata indexing workflow (see Coveo indexing pipeline).

For example, if you’re indexing document metadata specified in your sitemap XML file and your HTML files themselves contain meta tags for the same key, your documents will be indexed with the values in your HTML meta tags by default, whether IndexHtmlMetadata is set to true or false. To force Coveo to index the values set during the crawling stage, set the origin argument to crawler in your mapping rule.

With IndexHtmlMetadata enabled, the Sitemap crawler will index the content attribute of meta tags when this tag is keyed with one of the following attributes: name, property, itemprop, or http-equiv.

Example

Given the <meta name="viewport" content="width=device-width, initial-scale=1.0" /> tag, the Sitemap crawler indexes the following metadata: "viewport": "width=device-width, initial-scale=1.0".

Configuring fields and mappings

Regardless of how the additional metadata was added to your sitemap, you must configure Coveo so that it indexes this information adequately.

  1. In the Coveo Administration Console, ensure that you have the required privileges.

  2. On the Fields (platform-ca | platform-eu | platform-au) page, for each metadata you want to see in your item details, add the corresponding custom field.

  3. On the Sources (platform-ca | platform-eu | platform-au) page, add a mapping rule for each field you added.

    Important
    • The metadataName in mapping rules must match the XML element name.

    • XML element names are case-sensitive.

    Notes
    • Coveo supports a single level of metadata in the <coveo:metadata> element.

      For example:

      <coveo:metadata>
        <OSCodes>WW1</OSCodes>
        <product>Inspiron XPS;Dimension XPS</product>
      </coveo:metadata>
    • Coveo supports extensions to the Sitemap standard (e.g., the Google Video Sitemap). In this scenario, Coveo flattens the metadata, i.e., the key of each piece of data is the result of the path to the corresponding value.

      For example, the sitemap excerpt below results in the following flattened metadata: "video.thumbnail_loc": "http://img.youtube.com/vi/wejYF7l0kKQ/2.jpg".

        <url>
          <loc>http://www.example.com/videos/some_video_landing_page.html</loc>
          <video:video>
            <video:thumbnail_loc>
              http://img.youtube.com/vi/wejYF7l0kKQ/2.jpg
            </video:thumbnail_loc>
          </video:video>
        </url>
    Example

    You want to have the video thumbnail in the results metadata, so you add the videothumbnail field and use the following mapping rule: %[video.thumbnail_loc].

    Admin-SitemapMappingRuleEx
  4. Save and rebuild your Sitemap source.

  5. On the Content Browser (platform-ca | platform-eu | platform-au) page, in the Fields tab located in the Properties panel of your Sitemap source items, ensure that the new metadata is available (see Access the "Fields" tab).