Web scraping configuration
Web scraping configuration
Web and Sitemap sources can include a web scraping configuration, a powerful tool allowing members with the required privileges to precisely select the web page content to index, exclude specific parts, extract content to create metadata, and create sub-items.
As a member of one of these groups, you must however add the web scraping configuration to your Web or Sitemap source in JSON format, which requires more technical skills.
|
Note
Consider using the Coveo Labs Web Scraper Helper available on the Chrome Web Store and on GitHub (web-scraper-helper project) to more easily create and especially test web scraping configurations. |
Application examples
-
Exclude parts of the page from being indexed.
ExampleFor 3-pane web pages, you exclude the repetitive header, footer, and left navigation panel to only index the original content of each page.
-
Create metadata with values extracted from the page.
ExampleIn a question and answers site, no meta element provides the number of votes for the question, but the value appears on the page. You use a CSS selector rule to extract the string of this value and index it as question metadata.
-
Extract part of an item to create sub-items that become independent items in the index.
ExampleIn a blog post, you extract comments as sub-items.
-
Set different web scraping rules for different item types or sub-items.
ExampleIn the same blog post where comments are extracted as sub-items, you can extract different metadata for the parent item and for the comments.
ExampleFor Stack Overflow website pages, you want to split the question and each answer in separate index items and later use result folding in the search interface to wrap the answers under the corresponding question item (see About result folding).
You can do that with the following web scraping configuration, which:
-
Excludes non-content sections (header, herobox, advertisement, sidebar, footer).
-
Extracts some question metadata.
-
Defines
answer
sub-items. -
Extracts some answer metadata.
[ { "name": "questions", "for": { "urls": [".*"] }, "exclude": [ { "type": "CSS", "path": "body header" }, { "type": "CSS", "path": "#herobox" }, { "type": "CSS", "path": "#mainbar .everyonelovesstackoverflow" }, { "type": "CSS", "path": "#sidebar" }, { "type": "CSS", "path": "#footer" }, { "type": "CSS", "path": "#answers" } ], "metadata": { "askeddate":{ "type": "CSS", "path": "div#sidebar table#qinfo p::attr(title)" }, "upvotecount": { "type": "XPATH", "path": "//div[@id='question'] //span[@itemprop='upvoteCount']/text()" }, "author":{ "type": "CSS", "path": "td.post-signature.owner div.user-details a::text" } }, "subItems": { "answer": { "type": "CSS", "path": "#answers div.answer" } } }, { "name": "answers", "for": { "types": ["answer"] }, "metadata": { "upvotecount": { "type": "XPATH", "path": "//span[@itemprop='upvoteCount']/text()" }, "author": { "type": "CSS", "path": "td.post-signature:last-of-type div.user-details a::text" } } } ]
-
Web scraping leading practices
When a web scraping JSON configuration causes an error, the Coveo Administration Console currently doesn’t return error details, making it hard to troubleshoot. The following procedure suggests steps that may help to successfully develop a web scraping configuration.
-
Use a test source that includes only a few typical pages to test your web scraping configuration as you develop it. Rebuilding the source will be quick.
-
Use a text editor to create your web scraping JSON configuration (see JSON configuration schema).
-
Incrementally add web scraping elements to your JSON configuration (see Filters (for), SubItems, Exclusion, Metadata).
-
Ensure that the syntax of your JSON configuration is valid. You can use online tools such as JSONLint.
-
In the source configuration panel, paste the valid JSON in the Web scraping configuration box.
-
Click Save and Rebuild to make sure that changes perform as expected.
When an error is returned, roll back your changes to return to a functional configuration to help identify changes that are causing the errors.
-
Use the Content Browser (platform-ca | platform-eu | platform-au) to validate your changes (see Inspect search result items):
-
Use the debug panel to view fields and their values for one item.
-
Use the Export to Excel option to easily compare field values for many items.
-
-
When your web scraping extracts metadata:
-
Ensure that the fields receiving the metadata exist (see Add or edit a field).
-
Ensure that the metadata is mapped to the appropriate fields (see Manage source mappings).
-
-
Once the configuration works as desired for your test source, apply it to more items or all of them and validate the results.
-
Apply the configuration to your production source.
JSON configuration schema
The configuration consists of a JSON array of setting objects. The JSON structure is defined and commented in the following schema.
[ -> Array of settings.
The first matching "for" setting is applied.
{
"name": string, -> Name given to the current settings object in the array (optional).
"for": { -> Filters which pages have these settings applied.
"urls": string[] -> An array of REGEX to match specific URLs (or a part thereof).
"types": string[] -> An array of types when you want to match specific subItems.
},
"exclude": [ -> Array of selector objects to remove elements from the page.
These selectors can't be Boolean or absolute.
{
"type": string, -> "CSS" or "XPATH".
"path": string -> The actual selector value.
}
],
"metadata": { -> Map of selector objects.
The key represents the name of the piece of metadata.
"meta1": {
"type": string, -> "CSS" or "XPATH"
"path": string, -> The actual selector value.
"isBoolean": Boolean, -> Whether to evaluate this selector as a Boolean.
If the selected element is found, the returned value is "true".
"isAbsolute": Boolean -> Whether to retrieve the metadata from the parent page instead of the current subItem.
}
},
"subItems": { -> Map of selectors.
The key represents the type of subItem to retrieve.
These selectors can't be Boolean or absolute.
"post": {
"type": string, -> "CSS" or "XPATH"
"path": string -> The actual selector value.
}
}
},
{...}
]
You can copy and paste the following web scraper configuration JSON sample, and then tailor the filter, exclusions, metadata, and sub-items to your needs.
[{
"name": "MyConfigName1",
"for": {
"urls": [".*"]
},
"exclude": [{
"type": "CSS|XPATH",
"path": "css|xpath selector"
}, {
"type": "CSS|XPATH",
"path": "css|xpath selector"
}],
"metadata": {
"meta1": {
"type": "CSS|XPATH",
"path": "css|xpath selector"
}
},
"subItems": {
"MySubItemName": {
"type": "CSS|XPATH",
"path": "css|xpath selector"
}
}
},
{
"name": "MyConfigName2",
"for": {
"types": ["MySubItemName"]
},
"metadata": {
"meta1": {
"type": "CSS|XPATH",
"path": "css|xpath selector"
},
"title": {
"type": "CSS|XPATH",
"path": "css|xpath selector"
},
"uri": {
"type": "CSS|XPATH",
"path": "css|xpath selector"
},
"meta2": {
"type": "CSS|XPATH",
"path": "css|xpath selector"
}
}
}]
Filters (for
)
The for
element represents the restriction of the current setting object, i.e., to which pages or sub-Item types the settings are applied.
This element is mandatory.
At least one is required to match URLs, otherwise the settings aren’t applied to any pages.
For each page, only the first matching for
element is applied.
When you include more than one for
element, the order in which the elements appear may impact the scraping outcome.
You typically start with the most specific filters.
A for
element specifies a matching value for either urls
(pages) or types
(sub-Items) (see SubItems).
The value can be a REGEX expression.
|
|
-
You can apply a setting object to all site items with the
.*
REGEX:"for": { "urls": [".*"] }
-
You can apply a setting object only to HTML pages:
"for": { "urls": ["\\.html$"] }
-
You can apply a setting object only to a
post
subItem:"for": { "types": ["post"] }
Selectors
The web scraping configuration supports XPath or CSS (JQuery-style) selector types. Selectors let you precisely select the HTML page elements (or their text content) that you want to include or exclude in your source for a given page.
You should know the following about selectors in a web scraping configuration:
-
You can use either XPath or CSS or both types in the same web scraping configuration.
-
When no type is specified, Coveo considers it’s CSS by default.
-
By default, if a selector matches many elements, they’re returned as a multi-value metadata (an array of strings).
-
If the selector path matches DOM elements, the elements are returned.
-
If the selector matches text nodes, such as when you use "
text()
" in an XPath, only the text values are returned. -
You can’t chain selectors.
|
Leading practice
You can use the developer tools of your browser, such as those of Google Chrome:
![]() |
CSS selectors
CSS selectors are the most commonly used web selectors (see CSS Selector Reference). They’re used extensively with jQuery (see Category: Selectors). CSS selectors rely on DOM elements names, classes, IDs, and there hierarchy in HTML pages to isolate specific elements.
The following CSS selector selects the element which has the class content
that’s inside a span
under a div
element.
div > span > .content
The web scraping configuration supports the use of CSS selector pseudo-elements to retrieve element text or attribute values.
-
Text
Add
::text
at the end of a CSS selector to select the inner text of an HTML element.ExampleThe following expression selects the text of a
span
element with a classtitle
that’s under adiv
element with a classpost
.div.post > span.title::text
-
Attribute
Add
::attr(<attributeName>)
at the end of a CSS selector to select an HTML element attribute value, where<attributeName>
is the name of the attribute you want to extract.Examples-
You want to get the URL from a post title link:
div.post > a.title::attr(href)
-
For a Stack Overflow website page, you want to extract the asked date that appears in the side bar, but you want to get the date and time in the
title
attribute, not the text.<p class="label-key" title="2012-06-27 13:51:36Z"> <b>4 years ago</b> </p>
The following expression selects the value of the
title
attribute of thep
element.div#sidebar table#qinfo p::attr(title)
-
XPath selectors
XPath allows you to select nodes in an XML item in a tree-like fashion using URI-style expressions (see XPath Syntax). While XPath is an older technology and more verbose than CSS selectors, it offers features not available with CSS selectors, such as selecting an element containing a specific value or that has an attribute with a specific value.
-
The following expression returns the value of the
content
attribute for themeta
element (under thehead
element) that has an attributeproperty
with theog:url
value.//head/meta[@property="og:url"]/@content
-
The following expression returns the class of the paragraph that contains the sentence.
//p[text()='This is some content I want to match the element']/@class
An advantage of XPath over CSS is that you can use common XPATH functions such as boolean()
, count()
, contains()
, and substrings()
to evaluate things that aren’t available using CSS.
-
You want to get a date string from a
title
attribute in a<strong>
element that can only be uniquely identified by the parent element that contains the textquestion asked
.<p>question asked: <strong title="Dec. 15, 2016, 12:18 p.m.">15 Dec, 12:18</strong> </p>
You can take advantage of the
contains()
function in the following XPATH selector to get the attribute text://p[contains(.,'question asked')]/strong/@title
-
You want to extract the number of answers in a Q&A page. Each question is in a
<div class="answer">
. You can take advantage of thecount()
method to get the number of answers in the page:count(//tr[@class='answer'])
|
Note
The XPath selector must be compatible with XPath 1.0. |
Exclusion
You can use the exclude
element to remove one or more parts of the page body, meaning their content won’t be indexed.
However, links in excluded parts will be followed, so you can exclude navigation sections such as a table of content (TOC), but the source will still discover the pages linked by the TOC.
You want to index Stack Overflow site pages such as this one:
http://stackoverflow.com/questions/11227809/
You want to index only the title, the question and the answers, so you want to remove the top bar, the header, the top advertisement, the Google add below the title, and finally eliminate the sidebar on the right.
Your exclude
element uses selector and could be:
"exclude": [
{
"type": "CSS",
"path": "div.topbar"
},
{
"type": "CSS",
"path": "#header"
},
{
"type": "CSS",
"path": "#herobox"
},
{
"type": "CSS",
"path": "div.everyonelovesstackoverflow"
},
{
"type": "CSS",
"path": "#sidebar"
}
]
|
Note
Copying and pasting this example will return an error. You must integrate the example in the complete web scraping configuration schema (see JSON configuration schema). |
|
Note
Excluding sections may affect the processing performance as the page is reloaded after the exclusion, but the performance hit may be perceived only when you crawl at full speed and the website response is fast (>200,000 items/hour). |
Metadata
The metadata
element defines how to retrieve specific metadata for the current page or sub-item.
It’s a map of selectors, each key representing the metadata name.
In this section, each selector can be set to return a Boolean instead of values using the isBoolean
property that returns true
when the selector matches any elements on the page, and false
otherwise.
When extracting metadata from a sub-item, selectors are evaluated on the sub-item body only by default.
They can also be set absolute using the isAbsolute
property to be evaluated from the parent page instead of the current element.
Use this setting only sub-items settings when you want to retrieve a metadata that’s in the parent page only.
From the Stack Overflow site pages such as this one:
You want to set metadata for:
-
The number of votes for the question.
-
Get the date and time the question was asked.
-
Check if there’s at least one answer for the question.
"metadata":
{
"questionVotes": {
"type": "XPATH",
"path": "//*[@id='question']//div[@class='vote']/span/text()"
},
"questionAskedDate": {
"type": "CSS",
"path": "#question div.user-action-time > span::attr(title)"
},
"questionHasAnswer": {
"type": "CSS",
"path": "div.answer",
"isBoolean": true
}
}
|
Notes
|
SubItems
|
Note
The Sitemap source type doesn’t support the |
The subItems
element defines how to retrieve sub-items from a single page, when you want to create many index source items from a single web page.
-
A web forum page contains many posts. You want to create a source item for each post in the page to make them individually searchable.
-
On a Q&A site, each page contains a question and several answers. In your Coveo organization source, you want one item for the question part, and one item for each answer.
The subItems
element is a map of selectors with the keys representing the sub-item types.
-
In the
subItems
section:-
Define one or more sub-item types.
-
You can chose any sub-item type name, but since this value will be mapped to the
@documenttype
field, you should enter a value that fits well with the other field values.
ExampleIn a Q&A website, a page may contain a question, answers, and comments. You could define
answer
andcomment
sub-items, while the main item would contain only the question. -
-
When you want to perform web scraping tasks on matching parts of a sub-item type:
-
Add a section starting with a
for
statement, and use thetypes
parameter to match the sub-item name. -
Include the desired web scraping tasks (
exclude
,metadata
).
-
[{
"name": "Config 1",
"for": {
"urls": [".*"]
},
"exclude": [{}],
"metadata": {},
"subItems": {
"mySubItemName": {
"type": "CSS",
"path": "css selector"
}
}
},
{
"name": "Config 2",
"for": {
"types": ["mySubItemName"]
},
"exclude": [{}],
"metadata": {
"meta1": {
"type": "CSS",
"path": "css selector"
},
"meta2": {
"type": "CSS",
"path": "css selector"
}
}
}]
|
Notes
|