Index HTML page metadata
Index HTML page metadata
The Sitemap source supports many custom metadata indexing methods. You can extract metadata from the sitemap XML file and from the HTML content of the web pages listed in the sitemap.
There are three methods to index HTML page metadata: web scraping configurations, the JSON-LD metadata extraction feature, and the IndexHtmlMetadata
JSON parameter, which is the focus of this article.
Usage scenarios
IndexHtmlMetadata
is a crawler parameter that’s disabled by default, because it’s costly resource-wise and may impact indexing performance.
When you enable IndexHtmlMetadata
, the Sitemap crawler extracts the content
attribute value of <meta>
tags keyed with one of the following attributes: name
, property
, itemprop
, or http-equiv
.
Given the <meta name="viewport" content="width=device-width, initial-scale=1.0" />
tag, the Sitemap crawler extracts the following metadata:
Metadata name: viewport
Metadata value: width=device-width, initial-scale=1.0
After the crawling stage, the document is pushed to the document processing manager (DPM) where an HTML converter also extracts metadata from page <meta>
tags that have a name
attribute.
The converter is more efficient at extracting metadata from <meta>
HTML tags with a name
attribute compared to IndexHtmlMetadata
.
However, you might want to enable IndexHtmlMetadata
in the following scenarios:
-
If you need to extract metadata from
<meta>
tags that have aproperty
,itemprop
, orhttp-equiv
attribute, which the converter doesn’t support. -
If you’re using a web scraping configuration that excludes
<meta>
tags, from which the converter would otherwise extract metadata.
Enable IndexHtmlMetadata
If you’re in one of the usage scenarios, enable IndexHtmlMetadata
as follows:
-
On the Sources (platform-ca | platform-eu | platform-au) page, click the desired source, and then click More > Edit configuration with JSON in the Action bar.
-
Locate the
IndexHtmlMetadata
parameter in the JSON configuration."IndexHtmlMetadata": { "sensitive": false, "value": "false" }
-
Set its
value
totrue
."IndexHtmlMetadata": { "sensitive": false, "value": "true" }
-
Click Save and rebuild source.
Use the crawler metadata values in your fields
By default, most fields are populated with the values extracted by the converter when the same metadata is extracted both by the converter and the crawler.
To force Coveo to index metadata values coming from the crawler, set the origin
argument to crawler
in your mapping rule.
Suppose your source View and map metadata subpage shows values being extracted for the category
metadata both by the crawler and the converter.
To have your crawler values (for example, Developer Guide
) indexed in your field, you would use the following mapping rule:
%[category:crawler]
.