Web scraping configuration
Web scraping configuration
Web and Sitemap sources let you create web scraping configurations, with each configuration typically targeting a specific set of pages on your website.
A web scraping configuration lets you:
-
Precisely select the web page content to index by excluding specific parts.
ExampleFor three-pane web pages, you exclude the repetitive header, footer, and left navigation panel to only index the original content of each page.
-
Extract content to create metadata.
ExampleIn a question and answers site, no
<meta>
element provides the number of votes for the question, but the value appears on the page. You use a CSS selector rule to extract the string of this value and index it as question metadata. -
Create sub-items that become independent items in the index.
ExampleIn a blog post, you extract comments as sub-items.
After creating sub-items, you can apply a web scraping configuration that targets those sub-items specifically. For example, you could set rules to extract specific metadata from the blog post comments.
|
Note
To edit web scraping configurations, you must have the required privileges. |
Web scraping configuration editing modes
The Web source Web scraping tab features two modes to manage web scraping configurations:
UI assisted mode

The Web source lets you add (1), edit (2), and delete (3) one web scraping configuration at a time through a user interface that makes many technical aspects transparent. UI assisted mode is easier to use and more mistake-proof than Edit with JSON mode.
Use this mode except for sub-item related configurations (which are only supported in Edit with JSON mode).
Edit with JSON mode
The Edit with JSON button gives access to the aggregated web scraping JSON configuration of the source. Adding, editing, and deleting configurations directly in the JSON requires more technical skills than using UI assisted mode.


Use this mode to perform sub-item related configurations and when you want to test your aggregated web scraping configuration with the Coveo Labs Web Scraper Helper.
|
Note
The Web scraping tab displays a message when the aggregated web scraping configuration contains a sub-item related configuration. ![]() |
After saving changes in either mode, the changes become visible and editable in the other mode (except sub-items which are only visible in Edit with JSON mode).
Web scraping configuration sections
The aggregated web scraping configuration consists of a JSON array of configuration objects. Each configuration object can contain five high-level sections (or JSON properties) identified in the following JSON schema.
Further details on configuring each section through the recommended editing mode are provided further below.
[ -> Array of configurations.
The first matching "for" configuration is applied.
{
"name": string,
-> Name given to the current configuration object in the array.
"for": {
-> Specifies which pages to target with the current configuration.
This corresponds to the "Pages to target" setting in UI assisted mode.
"urls": string[] -> An array of REGEX to match specific URLs (or a part thereof).
"types": string[] -> An array of types when you want to match specific subItems.
},
"exclude": [
-> Array of selector objects to remove elements from the page.
These selectors can't be Boolean or absolute.
This corresponds to the "Elements to exclude" setting in UI assisted mode.
{
"type": string, -> "CSS" or "XPATH".
"path": string -> The actual selector value.
}
],
"metadata": {
-> Map of selector objects.
The key represents the name of the piece of metadata.
This corresponds to the "Metadata to extract" setting in UI assisted mode.
"<METADATA_NAME>": { -> Replace <METADATA_NAME> with the actual metadata name you want to use.
"type": string, -> "CSS" or "XPATH"
"path": string, -> The actual selector value.
"isBoolean": Boolean, -> Whether to evaluate this selector as a Boolean.
If the selected element is found, the returned value is "true".
This parameter is only supported in "Edit with JSON" mode.
"isAbsolute": Boolean -> Whether to retrieve the metadata from the parent page instead of
the current subItem.
This parameter is only supported in "Edit with JSON" mode.
}
},
"subItems": {
-> Map of selectors.
The key represents the type of subItem to retrieve.
These selectors can't be Boolean or absolute.
This parameter is only supported in "Edit with JSON" mode.
"<TYPES_VALUE>": { -> Replace <TYPES_VALUE> with the name you want to identify the subItems as.
You can target these <TYPES_VALUE> in latter configuration objects using
the "types" property.
"type": string, -> "CSS" or "XPATH"
"path": string -> The actual selector value.
}
}
},
{...}
]
In UI assisted mode, you can set the configuration name in the Basic configuration tab. | |
See Basic configuration tab. | |
See Elements to exclude tab. | |
See Metadata to extract tab. | |
See subItems property. |
Configurations in UI assisted mode
Basic configuration
Name
Provide a descriptive name
for your web scraping configuration as you will likely set up multiple web scraping configurations for your source.
Pages to target
The Pages to target settings generate the urls
property values for the current web scraping configuration in the aggregated JSON.
The urls
represent the web pages that are targeted by the current web scraping configuration.
To target sub-items instead of URLs, see the types
property.
For each crawled page, only the configuration associated with the first matching Pages to target rules is applied. The order in which the web scraping configurations appear may impact the scraping outcome. Your first web scraping configuration should have the most specific filters. As you add more configurations, make your filters more generic. Your last configuration should have the most generic filter (typically the Apply to all pages filter).
When you use the Apply to all pages option, Coveo automatically adds an all-inclusive rule behind the scenes for you. As a result, the associated web scraping configuration is applied to all pages of the source (or all remaining pages if a higher-priority web scraping configuration exists).
When you use the Apply to pages if they match at least one rule, you must then add one or multiple rules to specify the pages of the source you want to target (or the pages within the remaining pages if a higher priority web scraping configuration exists).
You can use any of the five available types of rules:
-
is and a URL which includes the protocol (e.g.,
https://myfood.com/
) -
contains and a string found in the URL (e.g.,
recipes
) -
begins with and a string found at the beginning of the URL and which includes the protocol (e.g.,
https://myfood
) -
ends with and a string found at the end of the URL (e.g.,
.pdf
) -
matches regex rule and a regex rule that matches the whole URL or a part of it
Examples-
\.html$
to capture all pages whose URL ends with.html
-
^.*company\.com\/employees\/.` to capture all employee profile pages like `+https://company.com/employees/Julie-Moreau
When using the matches regex rule type, test your regular expressions in a tool such as Regex101 to make sure they match the desired URLs.
-
Elements to exclude
The Elements to exclude settings generate the
exclude
property values for the current web scraping configuration in the aggregated JSON.
You can specify one or multiple HTML page elements that won’t be indexed in the pages targeted by the current web scraping configuration.
For each section that you want to exclude from indexing, choose the selector type (CSS or XPATH) and then input the selector itself.
Links in excluded parts are followed, so you can exclude navigation sections such as a table of contents, but the source crawler will still discover the pages listed in the table of contents.
You want to index Stack Overflow site pages.
Only the title, the question, and the answers matter to you, so you want to remove the top bar, the header, the top advertisement, the Google add below the title, and the sidebar on the right.
Your Elements to exclude could be configured as follows:

|
Note
Excluding sections may affect the processing performance as the page is reloaded after the exclusion, but the performance hit may be perceived only when you crawl at full speed and the website response is fast (>200,000 items/hour). |
Metadata to extract
|
The Web source automatically extracts default metadata. Make sure the metadata you want isn’t already extracted as default metadata before configuring web scraping metadata. |
The Metadata to extract settings generate the metadata
property values for the current web scraping configuration in the aggregated JSON.
You can configure one or multiple metadata to extract from the pages targeted by the current web scraping configuration.
For each metadata that you want to extract, provide a metadata name, a selector type (CSS or XPATH), and the selector itself.
When indexing Stack Overflow site pages, you want to set metadata for:
-
The number of votes for the question.
-
The date and time the question was asked.
Your Metadata to extract could be configured as follows:

After extracting custom metadata from your source, you can:
-
Add fields for this new custom metadata.
-
Add mappings to populate your fields with the desired metadata you extracted.
Configurations in Edit with JSON mode
subItems
property
The subItems
property defines how to create sub-items when you want to create multiple index source items from a single web page.
After indexing, your source will contain one item for the entire web page and as many sub-items as your subItems
property configuration detects.
The subItems
property is a map of selectors with each key representing a sub-item types
value.
When naming your sub-item types, take into consideration that types
values are mapped to the @documenttype
field.
For each types
value you define, you must specify a selector type
(CSS or XPATH) and a path
(the actual selector string).
|
Note
Sub-item indexing doesn’t include the CSS, so the Quick View of sub-items shows their content without the formatting. |
On a Q&A site, each page contains a question and several answers. You want one item for the question part, and one item for each answer.
Your aggregated JSON configuration would contain a configuration object with the following structure:
{
"name": "Q_and_A",
"for": {
"urls": [".*"]
},
"exclude": [{}],
"metadata": {},
"subItems": {
"answers": {
"type": "<CSS_OR_XPATH>",
"path": "<SELECTOR_FOR_THE_ANSWERS>"
}
}
}
where <CSS_OR_XPATH>
and <SELECTOR_FOR_THE_ANSWERS>
are replaced with the appropriate selector type and selector.
types
property
The for
property can contain arrays of urls
and types
.
The types
array lets you target sub-items you created in a previous configuration object.
To create a web scraping configuration that targets sub-items
-
Create a web scraping configuration below the one in which the sub-items are created. You can perform this step in UI assisted mode.
-
In the new web scraping configuration section, in the
for
section, use thetypes
parameter to match the sub-item <TYPES_VALUE> set in the sub-item creation configuration. -
In the new web scraping configuration section, specify the desired web scraping configurations (i.e., the
exclude
andmetadata
properties).
You have a web scraping configuration called Parent
that creates sub-items called comments
.
You want to create a web scraping configuration called Child
that extracts a details
metadata from these comments
sub-items.
Your aggregated JSON configuration would have the following structure:
[{
"name": "Parent",
"for": {
"urls": [".*"]
},
"exclude": [{}],
"metadata": {},
"subItems": {
"comments": {
"type": "<CSS_OR_XPATH>",
"path": "<SELECTOR_FOR_THE_COMMENTS_SUBITEMS>"
}
}
},
{
"name": "Child",
"for": {
"types": ["comments"]
},
"exclude": [{}],
"metadata": {
"details": {
"type": "<CSS_OR_XPATH>",
"path": "<SELECTOR_FOR_THE_DETAILS_METADATA>"
}
}
}]
where <CSS_OR_XPATH>
, <SELECTOR_FOR_THE_COMMENTS_SUBITEMS>
, and <SELECTOR_FOR_THE_DETAILS_METADATA>
are replaced with the appropriate selector types and selectors.
If you extract metadata from your sub-items, you can then:
-
Add fields for this new metadata.
-
Add mappings to populate your fields with the metadata you extracted.
NoteYou can create mapping rules that only apply to a given sub-item type.
isBoolean
property
The isBoolean
property is used to return true
or false
for the current metadata
object value rather than what the selector itself returns.
When the selector matches any element on the page, the metadata
object value is set to true
, and false
otherwise.
You want to create a metadata called questionHasAnswer
.
You want questionHasAnswer
to be set to true
if the web page contains at least one <div class="answer">
element.
Your aggregated JSON configuration would contain the following metadata configuration object:
"metadata":
{
"questionHasAnswer": {
"type": "CSS",
"path": "div.answer",
"isBoolean": true
}
}
isAbsolute
property
When extracting metadata from a sub-item, selectors are only applied to the sub-item body, by default.
Use the isAbsolute
property to apply the selectors to the parent page instead of the current sub-item.
Selectors
The web scraping configuration supports XPath or CSS (JQuery-style) selector types. Selectors let you select the HTML page elements (or their text content) that you want to include or exclude in your source for a given page.
You should know the following about selectors in a web scraping configuration:
-
You can use either XPath or CSS or both types in the same web scraping configuration.
-
When no type is specified, Coveo considers it’s CSS by default.
-
By default, if a selector matches many elements, they’re returned as a multi-value metadata (an array of strings).
-
If the selector path matches DOM elements, the elements are returned.
-
If the selector matches text nodes, such as when you use "
text()
" in an XPath, only the text values are returned. -
You can’t chain selectors.
|
Leading practice
You can use the developer tools of your browser, such as those of Google Chrome:
![]() |
CSS selectors
CSS selectors are the most commonly used web selectors (see CSS Selector Reference). They’re used extensively with jQuery (see Category: Selectors). CSS selectors rely on DOM element names, classes, IDs, and there hierarchy in HTML pages to isolate specific elements.
The following CSS selector selects the element which has the class content
that’s inside a span
under a div
element.
div > span > .content
The web scraping configuration supports use of CSS selector pseudo-elements to retrieve element text or attribute values.
-
Text
Add
::text
at the end of a CSS selector to select the inner text of an HTML element.ExampleThe following expression selects the text of a
span
element with a classtitle
that’s under adiv
element with a classpost
.div.post > span.title::text
-
Attribute
Add
::attr(<attributeName>)
at the end of a CSS selector to select an HTML element attribute value, where<attributeName>
is the name of the attribute you want to extract.Examples-
You want to get the URL from a post title link:
div.post > a.title::attr(href)
-
For a Stack Overflow website page, you want to extract the asked date that appears in the side bar, but you want to get the date and time in the
title
attribute, not the text.<p class="label-key" title="2012-06-27 13:51:36Z"> <b>4 years ago</b> </p>
The following expression selects the value of the
title
attribute of thep
element.div#sidebar table#qinfo p::attr(title)
-
XPath selectors
XPath lets you select nodes in an XML item in a tree-like fashion using URI-style expressions (see XPath Syntax). While XPath is an older technology and more verbose than CSS selectors, it offers features not available with CSS selectors, such as selecting an element containing a specific value or that has an attribute with a specific value.
-
The following expression returns the value of the
content
attribute for themeta
element (under thehead
element) that has an attributeproperty
with theog:url
value.//head/meta[@property="og:url"]/@content
-
The following expression returns the class of the paragraph that contains the sentence.
//p[text()='This is some content I want to match the element']/@class
An advantage of XPath over CSS is that you can use common XPath functions such as boolean()
, count()
, contains()
, and substrings()
to evaluate things that aren’t available using CSS.
-
You want to get a date string from a
title
attribute in a<strong>
element that can only be uniquely identified by the parent element that contains the textquestion asked
.<p>question asked: <strong title="Dec. 15, 2016, 12:18 p.m.">15 Dec, 12:18</strong> </p>
You can take advantage of the
contains()
function in the following XPath selector to get the attribute text://p[contains(.,'question asked')]/strong/@title
-
You want to extract the number of answers in a Q&A page. Each question is in a
<div class="answer">
. You can take advantage of thecount()
method to get the number of answers in the page:count(//tr[@class='answer'])
|
Note
The XPath selector must be compatible with XPath 1.0. |
Advanced web scraping JSON example
Context:
For Stack Overflow website pages, you want to split the question and each answer in separate index items. This enables result folding in the search interface to wrap the answers under the corresponding question item (see About result folding).
Solution:
You create the following web scraping configuration which:
-
Excludes non-content sections (header, herobox, advertisement, sidebar, footer).
-
Extracts some question metadata.
-
Defines
answer
sub-items. -
Extracts some answer metadata.
[
{
"name": "questions",
"for": {
"urls": [".*"]
},
"exclude": [
{
"type": "CSS",
"path": "body header"
},
{
"type": "CSS",
"path": "#herobox"
},
{
"type": "CSS",
"path": "#mainbar .everyonelovesstackoverflow"
},
{
"type": "CSS",
"path": "#sidebar"
},
{
"type": "CSS",
"path": "#footer"
},
{
"type": "CSS",
"path": "#answers"
}
],
"metadata": {
"askeddate":{
"type": "CSS",
"path": "div#sidebar table#qinfo p::attr(title)"
},
"upvotecount": {
"type": "XPATH",
"path": "//div[@id='question'] //span[@itemprop='upvoteCount']/text()"
},
"author":{
"type": "CSS",
"path": "td.post-signature.owner div.user-details a::text"
}
},
"subItems": {
"answer": {
"type": "CSS",
"path": "#answers div.answer"
}
}
},
{
"name": "answers",
"for": {
"types": ["answer"]
},
"metadata": {
"upvotecount": {
"type": "XPATH",
"path": "//span[@itemprop='upvoteCount']/text()"
},
"author": {
"type": "CSS",
"path": "td.post-signature:last-of-type div.user-details a::text"
}
}
}
]
Tips, tools, and troubleshooting
Working efficiently and using the proper tools will help you successfully and more rapidly develop a web scraping configuration. Here are a few pointers:
1- Use UI assisted mode whenever possible
-
UI assisted mode generates regexes for you, handles character escaping, and validates your input values. UI assisted mode is simpler and more mistake proof than Edit with JSON mode.
-
Create a web scraping configuration in UI assisted mode, even if you need to use Edit with JSON mode for some configurations later. For example, the left image below shows that you can just provide the configuration name in UI assisted mode and save to have the web scraping configuration JSON structure (right image below) created for you.
2- Work incrementally
-
Use a test source that includes only a few typical pages to test your web scraping configuration as you develop it. Rebuilding this test source will be quick. Once the configuration works as desired for your test source, apply it to more or all of the items and validate the results.
-
Incrementally add web scraping properties to your JSON configuration. Save functional web scraping configurations so you can roll back your changes, if necessary.
3- Use the right tools
-
Use the Content Browser (platform-ca | platform-eu | platform-au) to validate your configuration changes (see Inspect search results).
-
Use the Export to Excel option to view field values for many items at a time.
-
Use the Coveo Labs Web Scraper Helper available on the Chrome Web Store to test web scraping configurations.
-
In Edit with JSON mode, copy (
) your JSON configuration.
-
When viewing the web page you want to test, paste the contents of your clipboard into the Helper Text tab.
This will show the metadata values captured and the excluded HTML elements (which the Helper hides) on that page.
-
-
When working in Edit with JSON mode, the Web source validates your web scraping configuration JSON in real time, underlining content in red whenever it encounters an unexpected character. Hover over an error for more details. For example, note the missing comma at the end of line 3 in the following example:
-
Test your regular expressions in a tool such as Regex101 to make sure they match the desired URLs. If you copy your regex back into the aggregated web scraping JSON afterward (i.e., in Edit with JSON mode), remember to escape backslash (
\
) characters.Figure 1. Missing escape characterFigure 2. Properly escaped backslash
4- Get help
The Troubleshooting Web source issues article will help you solve most web scraping configuration-related problems.