Index with a generic connector

This is for:

Developer

The Coveo Platform provides two generic connectors that may be used to index website content, namely the Sitemap and Web connectors.

The goals of this article are:

  • To present the Sitemap and Web connector characteristics and features. This information will help you determine the optimal indexing strategy for your use case.

  • To guide you in creating your Coveo organization source(s), once your indexing strategy has been determined.

Note

Though they represent different concepts, the terms source and connector are often used interchangeably in Coveo terminology.

Sitemap and Web connector comparison table

Consider the characteristics of the Sitemap and Web connectors in the table below when deciding how to index your Adobe Experience Manager content. Green check marks in the table below highlight the advantages of each connector over the other.

Criteria Sitemap connector Web connector

Prerequisites

An existing sitemap or a new, Coveo-specific sitemap.

Learn about AEM out-of-the-box sitemaps, which are available starting with AEM 6.5.9.

check None.

Content coverage

 

(Ability to index all content from AEM. Ease of scoping the AEM content to be indexed.)

Covers all content.

 

DAM is supported if sitemap includes links to assets.

 

Content can be filtered using inclusion/exclusion rules.

Covers all content.

 

DAM is supported if assets are linked to parent web pages.

 

Content can be filtered using inclusion/exclusion rules.

Indexing speed

check Faster than with Web connector since the Sitemap connector simply fetches the web pages listed in the sitemap file.

Slower than with the Sitemap connector because the Web connector has to discover content, reading a web page to find links to other pages.

Metadata

 

(What kind of metadata can you index?)

Partial update support

 

(Can the connector index only what has changed in the source since the last indexing operation? Is this indexing triggered manually or can it be scheduled?)

check Manual and scheduled refresh is supported, provided the target sitemap file defines the optional Last Modification Date (for example, lastmod for XML sitemaps).

 

The maximum refresh schedule frequency is every 5 minutes.

 

A rescan (either manual or scheduled) or rebuild operation is required to take into account deleted and new sitemap entries.

Refresh isn’t supported. Only rescans (either manual or scheduled) and rebuilds are available.

The Coveo Sitemap connector is the ideal choice for Adobe Experience Manager content, not only from an indexing performance perspective but also because of the many metadata indexing options it provides. The Coveo Web connector should only be considered as a fallback solution.

Adobe Experience Manager sitemaps

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL, so that search engines can more intelligently crawl the site. Here’s an example of a Sitemap.

Sitemaps in AEM are considered as an opt-in feature. That is, we need to explicitly enable them.

There are two ways to do it:

  • Configure Sitemap Scheduler: This generates sitemap files at a defined frequency (for example, once a day). Using this approach, a background process creates the sitemap and caches out in AEM. It is then served from cache, when requested. This approach is recommended by Adobe for large websites, such as those indexed with Coveo.

  • On-demand Sitemap Generation: In this method, a sitemap is generated whenever a request for Sitemap is made. This approach is suitable only for small sites in production. Enable on-demand sitemap generation during development so you don’t have to wait until the scheduler kicks in.

AEM sitemap customization

Customizing sitemaps in AEM provides greater control over content indexing, enhancing SEO and search engine visibility. By leveraging the AEM SEO - Page Tree Sitemap Generator and extending the Apache Sling Sitemap Generator with Java, developers can enrich sitemap entries with custom metadata, optimized update frequencies, and additional SEO-friendly elements. These enhancements not only refine sitemap structures but also ensure that search engines, including Coveo, can efficiently index custom sitemap elements, improving content discoverability.

  • AEM offers the flexibility to enable or disable options such as Add Last Modified and Add Language Alternates through the Adobe AEM SEO - Page Tree Sitemap Generator configuration panel, which can be accessed via the OSGi Configuration Manager at http://[aem_server]:[port]/system/console/configMgr. This feature allows users to customize sitemaps according to specific SEO requirements, enhancing the management of page attributes directly within AEM’s interface.

    Sources | Adobe Experience Manager Sitemap Generator However, while injecting data via Java-based Sling extensions in a CMS is feasible, it may pose challenges related to implementation complexity, performance, and security.

Enable on-demand sitemap generation

  1. In your browser, open the AEM SDK author instance http://localhost:4502/system/console/bundles page.

  2. In the main menu, select OSGi > Configuration. This will display a list of all configurable services in your AEM instance.

    Sources | Adobe Experience Manager Web Console Configuration

  3. In the Apache Sling Sitemap - Sitemap Generator Manager configuration, enable All on-demand.

    Sources | Adobe Experience Manager All on-demand option to enable dynamic sitemap generation

  4. Open the AEM Sites management interface at http://localhost:4502/sites.html/content.

  5. Select the level (for example, site, page) where you want to enable sitemap generation.

    Sources | Adobe Experience Manager Select where you want to enable sitemap generation

  6. Click Properties in the context menu.

  7. Select the Advanced tab. Here, you’ll find an option to Generate Sitemap. Check this box to activate sitemap generation for the chosen level.

    Sources | Adobe Experience Manager Activate sitemap generation

  8. Save and publish your changes.

Browse your sitemap to validate it’s being generated as expected. Typically, you can access the sitemap by appending /sitemap.xml to the base URL of your site or specific page.

Key learnings

  1. Sitemaps can be generated by AEM at any level. The choice lies with the website author. By level, think of locale-based websites website.com/en/, subdirectory-level websites website.com/en/articles/, or country-based websites website.com/ca/ or website.com/gb/. By level, think of locale based websites website.com/en/ or at subdirectory level website.com/en/articles/ or country based websites website.com/ca/ or website.com/gb/.

    1. An AEM website can have multiple sitemaps, which are reflected in the root-level sitemap file. This file points to the other sitemap files under it.

  2. AEM internally uses the open-source library Apache Sling to generate sitemaps.

Adobe Experience Manager metadata indexing options

Depending on the way your Adobe Experience Manager website metadata is organized, one of the options below (or a combination thereof) will fulfill your needs. The options are presented in order of performance.

Adding Coveo metadata tags directly in your sitemap file

By default, when using the Sitemap connector, Coveo doesn’t index the content of the <meta> tags in the <head> of the web pages. This operation is costly resource-wise and may therefore impact the indexing performance.

Instead, by default, the Coveo Sitemap connector is coded to look for item metadata added directly inside the website sitemap file <url> elements. The connector expects this metadata to be included in a <coveo:metadata> tag. A developer, therefore, needs to extend the Sitemap protocol, and to modify or generate the sitemap file with the necessary Coveo metadata tag structure and content (see Coveo-Specific Custom Metadata).

Using JSON-LD script tags in your web pages

If you’re already using JSON-LD <script> tags in your web pages as your metadata implementation format, Coveo has an Extract JSON-LD metadata Sitemap source option you can enable to extract that metadata.

Note

Coveo also provides the IndexJsonLdMetadata parameter in its Web connector. However, the Web connector also automatically parses the entire document which is a drawback from a performance standpoint.

For general information on how to enable or configure connector parameters, see Edit a source JSON configuration.

Indexing web page head section metadata in a Sitemap source

As mentioned in Adding Coveo metadata tags directly in your sitemap file, the Sitemap connector doesn’t fetch the content of the <meta> tags in the <head> of web pages by default. However, the Sitemap connector has the IndexHtmlMetadata parameter you can enable to do just that. You can then create a field and mapping to store that metadata.

Note

AEM supports advanced customization of sitemaps by extending the Apache Sling sitemap generator with Java code. This enables developers to enhance SEO by introducing custom tags, modifying update frequencies, or adding detailed metadata to sitemap entries. For those looking to implement such customizations, the blog "AEM Simplified by Nikhil" provides a practical guide with code examples and detailed instructions. This feature makes AEM more flexible and effective for specific organizational SEO needs.

  1. Titles and descriptions significantly enhance the quality of indexed metadata. Ensuring that each new page created in the CMS includes a default meta jcr:description tag can be useful. Developers can utilize Sling to set up an event listener that triggers when a new page is created, automatically injecting a default description into the page’s metadata. This custom implementation can then be deployed by logging into the AEM OSGi console and using the Install/Update option to upload and install the bundle.

Using a web scraping configuration

Unlike the previous options that automatically capture a site’s metadata because it’s presented in a standard format, setting up a web scraping configuration requires more work on your part. Moreover, the web scraping configuration may vary from one page to another. However, a web scraping configuration is more flexible than the previous metadata extraction options.

The Sitemap and Web connectors both support web scraping configurations. Once again, you should favor the Sitemap connector for performance considerations.

To more easily create and test web scraping configurations, consider using the Coveo Labs Web Scraper Helper Chrome extension.

Indexing sitemap alternate URLs

The Sitemap source supports alternate language links but, by default, these links aren’t parsed. You need to set the ParseSitemapAlternateLinks parameter to true to enable this feature (see ParseSitemapAlternateLinks).

Create your source

As a prerequisite, you need a Coveo organization. If you don’t have one, you can start a free trial.

Note

You can also create a test organization afterward.

To create a source

  1. Access the Sources (platform-ca | platform-eu | platform-au) page in your Coveo organization.

  2. Click Add source in the upper-right corner of the screen.

  3. In the Add a source of content dialog, select the source type you’ve chosen to use.

  4. Name and configure your source.

  5. Click Add and build source.

You can now browse the content you’ve indexed in the Content Browser (platform-ca | platform-eu | platform-au).

Notes

Web and Sitemap connector courses

Should you prefer a more guided approach to creating Web and Sitemap connectors, Level Up courses are ideally suited for your needs.

If you’re considering using the Sitemap connector to index your Adobe Experience Manager content, some Web connector course material provides valuable background. We therefore recommend the following learning path: