Index With a Generic Connector

The Coveo Platform provides two generic connectors that may be used to index website content, namely the Sitemap and Web connectors.

The goals of this article are:

  1. To present the Sitemap and Web connector characteristics and features. This information will help you determine the optimal indexing strategy for your use case.

  2. To guide you in creating your Coveo organization source(s), once your indexing strategy has been determined.

Though they represent different concepts, the terms source and connector are often used interchangeably in Coveo terminology.

AEM Metadata Indexing Options

The Coveo Sitemap connector is the ideal choice for AEM content, not only from an indexing performance perspective but also because of the many metadata indexing options it provides. The Coveo Web connector should only be considered as a fallback solution.

Depending on the way your AEM website metadata is organized, one of the options below (or a combination thereof) will fulfill your needs. The options are presented in order of performance.

Adding Coveo Metadata Tags Directly in Your Sitemap File

By default, when using the Sitemap connector, Coveo doesn’t index the content of the <meta> tags in the <head> of the web pages. This operation is costly resource-wise and may therefore impact the indexing performance.

Instead, by default, the Coveo Sitemap connector is coded to look for item metadata added directly inside the website sitemap file <url> elements. The connector expects this metadata to be included in a <coveo:metadata> tag. A developer, therefore, needs to extend the Sitemap protocol, and to modify or generate the sitemap file with the necessary Coveo metadata tag structure and content (see Coveo-Specific Custom Metadata).

Using JSON-LD Script Tags in Your Web Pages

If you’re already using JSON-LD <script> tags in your web pages as your metadata implementation format, Coveo has a Sitemap connector parameter you can enable to extract that metadata.

For general information on how to enable or configure connector parameters, see Edit a Source JSON Configuration.

To enable the IndexJsonLdMetadata in the Sitemap source JSON configuration, see IndexJsonLdMetadata.

Coveo also provides the IndexJsonLdMetadata parameter in its Web connector. However, the Web connector also automatically parses the entire document which is a drawback from a performance standpoint.

Indexing Web Page Head Section Metadata in a Sitemap Source

As mentioned in Adding Coveo Metadata Tags Directly in Your Sitemap File, the Sitemap connector doesn’t index the content of the <meta> tags in the <head> of web pages by default. However, the Sitemap connector has the IndexHtmlMetadata parameter you can enable to do just that.

Using a Web Scraping Configuration

Unlike the previous alternatives that automatically capture website metadata because it is presented in a standard format, setting up a web scraping configuration requires more work on your part. Moreover, the web scraping configuration may vary from one page to another. The Sitemap and Web connectors both support web scraping configurations. Once again, you should favor the Sitemap connector for performance considerations.

To more easily create and test web scraping configurations, consider using the Coveo Labs Web Scraper Helper Chrome extension.

Sitemap and Web Connector Comparison and Specificities

Whether you choose to index your AEM content with the Sitemap or Web connector, there are additional considerations you should know about both.

Indexing Speed

Whereas the Sitemap connector simply loops through the website sitemap file list of pages, the Web connector crawls a page to discover links to other pages before moving on to crawling these other pages. Discovering pages progressively in this manner is inherently slower and less efficient. This is why you should always choose the Sitemap connector over the Web connector, when possible.

Incremental Refresh

For a website sitemap file to be valid from the Coveo perspective, it must define the optional sitemap protocol Last Modification Date attribute for each URL. Having Last Modification Date values in each URL makes source incremental refresh possible (see Refresh VS Rescan VS Rebuild). Of the three content update operations, a refresh is the least resource-intensive. Being able to refresh content with the Sitemap connector is another reason you should favor this option over the Web connector.

Sitemap Alternate URLs

The Sitemap source supports alternate language links but, by default, these links aren’t parsed. You need to set the ParseSitemapAlternateLinks parameter to true to enable this feature (see ParseSitemapAlternateLinks).

Create Your Source(s)

As a prerequisite, you need to have a Coveo organization. If you don’t, the easiest way to create an organization is to start a free 30-day trial.

You can also create a test organization afterwards.

To create a source

  1. Access the Sources page in your Coveo organization.

  2. Click Add source in the upper-right corner of the screen.

  3. In the Add a Source of Searchable Content dialog, select the source type you have chosen to use.

  4. Name and configure your source.

If necessary, refer to the Add or Edit a Sitemap Source or Add or Edit a Web Source documentation sections for help.

Web and Sitemap Connector Tutorials

Should you prefer a more guided approach to creating Web and Sitemap connectors, the Platform Developer Tutorials are ideally suited for your needs.

Given some Web connector tutorial material provides valuable background, if you’re considering using the Sitemap source to index your AEM content, we suggest the following learning path:

Browse Your Source Content

You should now be able to browse the content you have indexed in the Content Browser.

What’s Next?

The next step is to develop and deploy a hosted search page that will tap into the content you have indexed.

What's Next for Me?