Web source JSON modification
Web source JSON modification
Many source configuration parameters can be set through the user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be configured in the source JSON configuration.
This article explains how to configure Web source parameters, whether they’re already listed in the JSON or not.
Configuring listed and unlisted parameters
If the parameter you want to change is already listed in the parameters
section of the source JSON configuration, just modify its value
in the JSON configuration.
If the parameter isn’t listed in the parameters
section, copy the entire parameter example object from the Reference section below and paste it into the parameters
section of the source JSON configuration.
Then, modify the value
in the JSON configuration, if necessary.
If a parameter has a |
Document the changes you make to the source JSON configuration in the Change notes area below the JSON configuration. This ensures that you can easily revert to a previous configuration if needed. |
Reference
This section provides information on the Web source parameters that you can only modify through the JSON configuration.
If a JSON configuration parameter isn’t documented in this article, configure it through the user interface instead.
AdditionalHeaders
(Object | Null)
Specifies a map of HTTP headers in JSON format to be sent with every HTTP request. The value must be a valid JSON map in the format: {\"header1\": \"value1\", \"header2\": \"value2\", … }
. The value is null
by default.
"AdditionalHeaders": {
"sensitive": true,
"value": "{\"X-CSRF-Token\": \"6b609652ef8e448eb6de270139d489c3\"}"
}
|
EnableJavaScriptRenderingOptimizations
(Boolean)
Whether to enable JavaScript rendering optimizations.
When set to true
, the crawler doesn’t download images and external files.
The default value is true
.
EnableJavaScriptRenderingOptimizations
to false
On a page, you have a dynamically generated table.
The data in the table comes from a JSON file, downloaded from a server, using JavaScript.
To index the table data, you would need to set EnableJavaScriptRenderingOptimizations
to false
.
"EnableJavaScriptRenderingOptimizations": {
"sensitive": false,
"value": "false"
}
ExpandBeforeFiltering
(Boolean)
Whether the crawler expands the pages before applying the inclusion and exclusion filters. This might make the crawling process slower since filtered pages are still going to be downloaded. The default value is false
.
You want to index the pages that are linked from https://mysite.com/
, but you don’t want to index https://mysite.com/
itself.
To achieve this, proceed as follows:
-
Set the Starting URL value to
https://mysite.com/
in the source configuration user interface. -
Add an exclusion rule for
https://mysite.com/
and use the defaultInclude all non-excluded pages
inclusion rule in the source configuration user interface. -
Set
ExpandBeforeFiltering
totrue
.
"ExpandBeforeFiltering": {
"sensitive": false,
"value": "true"
}
FollowAutoRedirects
(Boolean)
Whether the crawler should follow automatic redirections. The default value is true
.
"FollowAutoRedirects": {
"sensitive": false,
"value": "false"
}
FollowCanonicalLink
(Boolean)
Whether the crawler should redirect to the canonical link (if any) in the <head>
section of the page. Using canonical links helps reduce duplicates, the canonical link being the "preferred" version of a page. The default value is false
.
"FollowCanonicalLink": {
"sensitive": false,
"value": "true"
}
IgnoreUrlFragment
(Boolean)
Whether to ignore the fragment part of found pages. The fragment part of the URL is everything after the hash (#
) sign. The default value is true
. Setting IgnoreUrlFragment
to false
may be necessary, for example, with one-page web apps.
"IgnoreUrlFragment": {
"sensitive": false,
"value": "false"
}
IndexExternalPages
(Boolean)
Whether to index linked web pages that aren’t part of the domain of the specified Starting URL.
The default value is false
.
The IndexExternalPages
parameter has no bearing on whether Starting URL subdomain pages are indexed.
Enabling this setting is unsafe. With basic authentication, when the crawler is challenged for access to an external page, it then sends your basic authentication credentials to the external domain server. With form authentication, when the crawler is denied access to an external page and redirected to a page that contains a login form, it submits your form authentication credentials to the external domain server. |
"IndexExternalPages": {
"sensitive": false,
"value": "true"
}
IndexExternalPages applicability examples
Example 1: A local page contains a link to an external page
One of your Starting URLs is https://www.mycompany.com
and one of its pages contains a link to https://en.wikipedia.org/
.
If IndexExternalPages
is set to false
, the linked Wikipedia page won’t be crawled and included in your searchable content.
Example 2: A local page that redirects to an external page
The IndexExternalPages
setting isn’t taken into account in this scenario.
For example, one of your Starting URLs is https://www.mycompany.com
and one of its pages contains a link to https://www.mycompany.com/mypage
.
When clicked, this link redirects the reader to https://en.wikipedia.org/
.
In this scenario, the Wikipedia page will be crawled and indexed if it’s included in your crawling scope.
IndexExternalPagesLinks
(Boolean)
Whether linked pages in external web pages are indexed. The default value is false
.
Note
|
Including pages linked in external pages can lead anywhere on the Internet, infinitely discovering linked pages on other sites. If you do enable this option, you should add one or more rules, preferably inclusion rules, to restrict the discovery to identifiable sites. |
"IndexExternalPagesLinks": {
"sensitive": false,
"value": "true"
}
IndexJsonLdMetadata
(Boolean)
Whether to index metadata from JSON-LD <script>
tags. The default value is false
.
When enabled, JSON-LD objects in the web page are flattened and represented in jsonld.parent.child
metadata format in your Coveo organization.
Given the following JSON-LD script tag in a web page:
<script id="jsonld" type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "NewsArticle",
"url": "http://www.bbc.com/news/world-us-canada-39324587",
"publisher": {
"@type": "Organization",
"name": "BBC News",
"logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
},
"headline": "Canada Strikes Gold in Olympic Hockey Final"
}
</script>
To index the publisher name value (that is, BBC News
in this page) in an index field
-
Set your configuration as follows:
"IndexJsonLdMetadata": { "sensitive": false, "value": "true" }
-
Use the following mapping rule for your field:
%[jsonld.publisher.name]
Note
Contextual JSON-LD key-value pairs whose keys begin with |
IndexSubdomains
(Boolean)
Whether to index subdomains of the base Starting URL you entered when creating the Web source.
The default value is false
.
Setting IndexSubdomains
to true
in effect adds all subdomains to the list of starting URLs without having to add them individually in the user interface.
It doesn’t change your inclusion and exclusion rules though.
If you set IndexSubdomains
to true
, make sure you adjust your inclusion and exclusion rules.
When IndexSubdomains
is set to true
, the Web source strips the www
subdomain when determining subdomains.
For example, if your base Starting URL is https://www.abc.com
, the Web source considers https://docs.abc.com
as being a subdomain.
"IndexSubdomains": {
"sensitive": false,
"value": "true"
}
MaxAutoRedirects
(Integer)
The maximum number of redirects that the requests may follow. No maximum is applied when the value is 0
. The default value is 7
.
"MaxAutoRedirects": {
"sensitive": false,
"value": "10"
}
MaxCrawlDelayInSeconds
(Integer)
The maximum number of seconds to respect in the robots.txt
directive of Crawl-delay: X. The robots.txt directive override source crawling setting in the user interface must not be enabled for MaxCrawlDelayInSeconds
to be used. If MaxCrawlDelayInSeconds
is set to 0
, the crawler won’t respect the directive of the robots.txt
file. The default value is 5
.
"MaxCrawlDelayInSeconds": {
"sensitive": false,
"value": "3"
}
MaxCrawlDepth
(Integer)
Maximum number of link levels under your Starting URLs to include in the source. When the value is 0
, only the starting URLs are crawled and all their links are ignored. The default value is 100
.
"MaxCrawlDepth": {
"sensitive": false,
"value": "10"
}
MaxPageSizeInBytes
(Integer)
The maximum size of a web resource in bytes. If the resource size exceeds this value, it’s not downloaded or processed. No maximum is applied when the value is 0
. The default value is 536870912
(512 MB).
"MaxPageSizeInBytes": {
"sensitive": false,
"value": "400000000"
}
MaxPagesToCrawl
(Integer)
The maximum number of pages to index. No maximum is applied when the value is 0
. The default value is 2500000
.
This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary. |
"MaxPagesToCrawl": {
"sensitive": false,
"value": "10000"
}
MaxPagesToCrawlPerDomain
(Integer)
The maximum number of pages to index per domain. No maximum is applied when the value is 0
. The default value is 0
.
"MaxPagesToCrawlPerDomain": {
"sensitive": false,
"value": "8000"
}
MaxRequestsRetryCount
(Integer)
The maximum number of request retries for a URL, if an error occurs. No retry is attempted when the value is 0
. The default value is 3
.
"MaxRequestsRetryCount": {
"sensitive": false,
"value": "5"
}
MinCrawlDelayPerDomainInMilliseconds
(Integer)
The number of milliseconds between consecutive HTTP requests sent to retrieve content on a given domain. The default value is 1000
, which means that a maximum of 3,600 items are indexed per hour on any given domain.
"MinCrawlDelayPerDomainInMilliseconds": {
"sensitive": false,
"value": "1500"
}
RequestsRetryDelayInMilliseconds
(Integer)
The minimum delay in milliseconds between a failed HTTP request and the ensuing retry. The default value is 1000
.
"RequestsRetryDelayInMilliseconds": {
"sensitive": false,
"value": "1500"
}
RequestsTimeoutInSeconds
(Integer)
Timeout value in seconds for web requests. Web requests don’t timeout when the value is set to 0
. The default value is 60
.
Consider increasing the value if a target website sometimes responds slower. This prevents getting errors that can be avoided.
"RequestsTimeoutInSeconds": {
"sensitive": false,
"value": "60"
}
RespectUrlCasing
(Boolean)
Whether page URLs that you include are case sensitive. The default value is true
.
"RespectUrlCasing": {
"sensitive": false,
"value": "true"
}
When
RespectUrlCasing is Set to TrueThe crawler uses page URLs as is. If the same web page appears twice on the Web server (that is, with different URL casings), you can expect duplicate items in your source, one copy for each URL casing variant found by the crawler. When web page URLs are case sensitive (that is, when |
When
RespectUrlCasing is Set to FalseThe crawler first converts the current page URL to lowercase letters. It then uses the lowercased page URL to perform an HTTP request to the Web server. If the Web server is case-sensitive and resource URLs on the server contain uppercase letters, the Web server won’t serve the request. The way the Web server deals with this situation (for example, HTTP error) varies from site to site. When web page URLs aren’t case sensitive (that is, when |
RobotsDotTextUserAgentString
(String | Null)
The User Agent used by the Coveo crawler to interact with the target Web content. The User Agent is compared against website robots.txt
file allow/disallow directives, provided the robots.txt directive override user interface setting isn’t enabled.
The default value is CoveoBot
.
"RobotsDotTextUserAgentString": {
"sensitive": false,
"value": "AcmeBot"
}
SendCookies
(Boolean)
Whether the crawler can exchange HTTP cookies with the crawled site. The default value is true
.
"SendCookies": {
"sensitive": false,
"value": "false"
}
SkipOnStartingAddressError
(Boolean)
Whether the crawler should skip a Starting URL when erroneous.
This setting may be useful on a source that includes multiple starting URLs.
When SkipOnStartingAddressError
is set to true
, a rebuild doesn’t fail when unable to retrieve a given starting URL.
Instead, it moves on to the next starting URL specified in the source configuration user interface.
The default value is true
.
When |
"SkipOnStartingAddressError": {
"sensitive": false,
"value": "false"
}
UseHiddenBasicAuthentication
(Boolean)
Whether the crawler should send a basic authentication header in all requests.
The default value is false
.
Note
To use |
Enabling this setting is unsafe, as your basic authentication credentials will be sent with every page your source requests, regardless of the domain. |
"UseHiddenBasicAuthentication": {
"sensitive": false,
"value": "true"
}
UseProxy
(Boolean)
|
Whether the Crawling Module should use a proxy to access the content to be crawled.
The default value is false
.
"UseProxy": {
"sensitive": false,
"value": "false"
}
UserAgentString
(String)
Identifier used by the Coveo crawler when requesting pages from the web server.
The default value is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)
.
"UserAgentString": {
"sensitive": false,
"value": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)"
}