Index XML sitemap metadata

On top of the various webpage metadata extraction features, the Sitemap source supports collecting metadata included in an XML sitemap file. This metadata can come from:

Third-party extensions

Some sites such as Google offer extensions adding extra metadata to your sitemap (see Image sitemaps). Alternatively, you can build your own extension Extending the Sitemaps protocol. Either way, the data added to your sitemap can be retrieved and made searchable by Coveo. See Configuring fields and mappings to configure Coveo adequately.

See also Video sitemaps and video sitemap alternatives for another example.

Coveo-specific custom metadata

A developer can include custom metadata in an XML sitemap file specifically for Coveo indexing purposes. When they can generate or modify the sitemap XML file of a repository to index, they can also include a Coveo namespace (coveo:metadata) and metadata to provide information on items that isn’t found in default fields (that is, Sitemap standard source fields and Coveo default fields).

Example

Since you have control on the sitemap file (it’s not generated by a third party), you decide to create your XML sitemap file dynamically and add all the custom metadata you need.

Although the added Coveo metadata will only be read by the Coveo crawler and connector and ignored by all other processes, it still respects the Sitemap protocol (see Sitemaps XML format).

The following procedure requires a user that has the permissions and skills to modify or create an XML sitemap file and the required privileges in the Coveo Administration Console.

To add Coveo-specific custom metadata in an XML sitemap

You must code a third-party process to modify or create an XML sitemap file as follows:

  1. In the urlset XML element start tag (<urlset>), extend the Sitemap protocol using the Coveo namespace by adding the following line:

    xmlns:coveo="https://www.coveo.com/en/company/about-us"

    Note

    From a Coveo perspective, the value of the xmlns:coveo attribute (that is, the URI) is irrelevant. The Coveo sitemap crawler ignores this value. However, other web search engine indexing services may need to validate this URI.

    The attribute name (that is, xmlns:coveo) is important as the sitemap XML file will contain elements in the coveo namespace scope.

    Example
    <?xml version="1.0" encoding="UTF-8"?>
    <urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
      xmlns:coveo="https://www.coveo.com/en/company/about-us">
  2. For each url element (<url></url>) in the sitemap, create a new XML element named coveo:metadata (<coveo:metadata></coveo:metadata>).

    Example
    <url>
      <loc>http://example.com/about/</loc>
      <lastmod>2015-02-10T13:47:23+00:00</lastmod>
      <changefreq>weekly</changefreq>
      <priority>1.00</priority>
      <coveo:metadata>
      </coveo:metadata>
    </url>
  3. Within the coveo:metadata elements, add your custom metadata (name and value).

    Notes
    • To index special characters, you must use a CDATA tag (![CDATA[) at the beginning of the node (see Character Data and Markup). The source then ignores the CDATA tag and indexes the rest of the node content such as special characters (for example, &, %, $, ~, and <xml> tags) as text.

      Example:

      The companyname metadata in the following sitemap file content

      <coveo:metadata>
        <casenumber>18467</casenumber>
        <companyname>
          <![CDATA[
          Company XYZ Inc. <USA>
          ]]>
        </companyname>
      </coveo:metadata>

      is indexed as follows in your Coveo index: image::index-content/SitemapMetadataWithCDataTagIndexing.png[Indexing sitemap metadata when adding CDATA tag,role="bordered"]

    • Nested metadata inside the <coveo:metadata> element isn’t supported.

    Example

    You want to add the name of the author, the last date of modification and the document tags (if any) so you add the following XML elements:

    <coveo:metadata>
      <modificationdate>2015-02-10T13:47:23+00:00</modificationdate>
      <authorname>John Smith</authorname>
      <tags />
    </coveo:metadata>

Once done, the sitemap could look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<urlset
  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
  xmlns:coveo="https://www.example.com/schemas">
  <url>
    <loc>http://example.com/about/</loc>
    <lastmod>2015-02-10T13:47:23+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>1.00</priority>
    <coveo:metadata>
      <modificationdate>2015-02-10T13:47:23+00:00</modificationdate>
      <authorname>John Smith</authorname>
      <tags />
    </coveo:metadata>
  </url>
</urlset>
Note

For more information, contact Coveo Professional Services.

Indexing a Sitemap source by reference

By default, a Sitemap source is set to retrieve HTML and PDF items (that is, to index their content and metadata). With the document content, the Coveo Platform produces the item quickview, excerpt, and summary.

If you don’t need the quickview, excerpt, and summary, and you have all the information you want to index in your sitemap file metadata, you may want to index by reference (see Customize the indexing process). Indexing by reference improves performance.

Important

Indexing by reference doesn’t mean your web scraping configuration is ignored. The Coveo sitemap crawler will still scrape the content of documents matching your exclusion and inclusion rules. To prevent unexpected field values, avoid using the same metadata names in your web scraping configuration as the ones in your sitemap file.

To index a Sitemap source by reference

  1. On the Sources (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console, add a Sitemap source.

  2. Access the Edit configuration with JSON panel of the source you just created.

  3. In the documentConfig section of the JSON source configuration, find the extensionSettings section.

  4. In the extensionSettings section, delete the ByExtensions and ByContentTypes sections.

    byExtension
  5. Find the noExtension and the other sections.

    • In the noExtension section, change the action value from Retrieve to Reference.

    • In the other section, change the action value from Retrieve to Reference.

    indexByReference
  6. Click Save and Rebuild Source.

Configuring fields and mappings

Regardless of how the additional metadata was added to your sitemap, you must configure Coveo so that it indexes this information adequately.

  1. In the Coveo Administration Console, ensure that you have the required privileges.

  2. On the Fields (platform-ca | platform-eu | platform-au) page, for each metadata you want to see in your item details, add the corresponding custom field.

  3. On the Sources (platform-ca | platform-eu | platform-au) page, add a mapping rule for each field you added.

    Important
    • The metadataName in mapping rules must match the XML element name.

    • XML element names are case-sensitive.

    Notes
    • Coveo supports a single level of metadata in the <coveo:metadata> element.

      For example:

      <coveo:metadata>
        <OSCodes>WW1</OSCodes>
        <product>Inspiron XPS;Dimension XPS</product>
      </coveo:metadata>
    • Coveo supports extensions to the Sitemap standard (for example, the Google Video Sitemap). In this scenario, Coveo flattens the metadata, that is, the key of each piece of data is the result of the path to the corresponding value.

      For example, the sitemap excerpt below results in the following flattened metadata: "video.thumbnail_loc": "http://img.youtube.com/vi/wejYF7l0kKQ/2.jpg".

        <url>
          <loc>http://www.example.com/videos/some_video_landing_page.html</loc>
          <video:video>
            <video:thumbnail_loc>
              http://img.youtube.com/vi/wejYF7l0kKQ/2.jpg
            </video:thumbnail_loc>
          </video:video>
        </url>
    Example

    You want to have the video thumbnail in the results metadata, so you add the videothumbnail field and use the following mapping rule: %[video.thumbnail_loc].

    Admin-SitemapMappingRuleEx
  4. Save and rebuild your Sitemap source.

  5. On the Content Browser (platform-ca | platform-eu | platform-au) page, in the Fields tab located in the Properties panel of your Sitemap source items, ensure that the new metadata is available (see Access the "Fields" tab).