Web source JSON modification
Web source JSON modification
AdditionalHeaders
(Object | null)ExpandBeforeFiltering
(Boolean)FollowAutoRedirects
(Boolean)FollowCanonicalLink
(Boolean)IgnoreUrlFragment
(Boolean)IndexExternalPages
(Boolean)IndexExternalPagesLinks
(Boolean)IndexJsonLdMetadata
(Boolean)IndexSubdomains
(Boolean)MaxAutoRedirects
(Integer)MaxCrawlDelayInSeconds
(Integer)MaxCrawlDepth
(Integer)MaxPageSizeInBytes
(Integer)MaxPagesToCrawl
(Integer)MaxPagesToCrawlPerDomain
(Integer)MaxRequestsRetryCount
(Integer)MinCrawlDelayPerDomainInMilliseconds
(Integer)RequestsRetryDelayInMilliseconds
(Integer)RequestsTimeoutInSeconds
(Integer)RespectUrlCasing
(Boolean)RobotsDotTextUserAgentString
(String | null)SendCookies
(Boolean)SkipOnStartingAddressError
(Boolean)UseHiddenBasicAuthentication
(Boolean)UseProxy
(Boolean)UserAgentString
(String)
Many source configuration parameters can be set through the Coveo Administration Console user interface. Others, such as rarely used parameters or new parameters that aren’t yet editable through the user interface, can only be changed or added in the JSON configuration.
This article presents Web source hidden parameters that you can change by modifying the source JSON configuration from the Coveo Administration Console. All parameters must be added to the parameters
section of the source JSON configuration.
|
Set the For instructions on accessing the JSON file and best practices to follow, see Edit a source JSON configuration. |
AdditionalHeaders
(Object | null)
Specifies a map of HTTP headers in JSON format to be sent with every HTTP request. The value must be a valid JSON map in the format: {\"header1\": \"value1\", \"header2\": \"value2\", … }
. The value is null
by default.
"AdditionalHeaders": {
"sensitive": true,
"value": "{\"X-CSRF-Token\": \"6b609652ef8e448eb6de270139d489c3\"}"
}
|
|
ExpandBeforeFiltering
(Boolean)
Whether the crawler expands the pages before applying the inclusion and exclusion filters. This might make the crawling process slower since filtered pages are still going to be downloaded. The default value is false
.
You want to index the pages that are hyperlinked on https://mysite.com/
, but you don’t want to index https://mysite.com/
itself. You therefore proceed as follows:
-
You set a Starting URL value to
https://mysite.com/
in the source configuration user interface. -
You add an exclusion rule for
https://mysite.com/
in the source configuration user interface. -
You set
ExpandBeforeFiltering
totrue
.
"ExpandBeforeFiltering": {
"sensitive": false,
"value": "true"
}
FollowAutoRedirects
(Boolean)
Whether the crawler should follow automatic redirections. The default value is true
.
"FollowAutoRedirects": {
"sensitive": false,
"value": "false"
}
FollowCanonicalLink
(Boolean)
Whether the crawler should redirect to the canonical link (if any) in the <head>
section of the page. Using canonical links helps reduce duplicates, the canonical link being the "preferred" version of a page. The default value is false
.
"FollowCanonicalLink": {
"sensitive": false,
"value": "true"
}
IgnoreUrlFragment
(Boolean)
Whether to ignore the fragment part of found pages. The fragment part of the URL is everything after the hash (#
) sign. The default value is true
. Setting IgnoreUrlFragment
to false
may be necessary, for example, with one-page web apps.
"IgnoreUrlFragment": {
"sensitive": false,
"value": "false"
}
IndexExternalPages
(Boolean)
Whether to index linked web pages that aren’t part of the domain of the specified Starting URL.
The default value is false
.
|
Note
The |
"IndexExternalPages": {
"sensitive": false,
"value": "true"
}
One of your Starting URLs is https://www.mycompany.com
and one of its pages contains a link to https://en.wikipedia.org/
.
If IndexExternalPages
is set to false
, the linked Wikipedia page isn’t crawled and included in your searchable content.
However, a local URL that redirects the visitor to an external page will override this setting.
One of your Starting URLs is https://www.mycompany.com
and one of its pages contains a link to https://www.mycompany.com/mypage
.
When clicked, this link redirects the reader to https://en.wikipedia.org/
.
The Wikipedia page will be crawled and included in your searchable content.
IndexExternalPagesLinks
(Boolean)
Whether linked pages in external web pages are indexed. The default value is false
.
|
Note
|
|
Including pages linked in external pages can lead anywhere on the Internet, infinitely discovering linked pages on other websites. If you do enable this option, you should add one or more rules, preferably inclusion rules, to restrict the discovery to identifiable sites. |
"IndexExternalPagesLinks": {
"sensitive": false,
"value": "true"
}
IndexJsonLdMetadata
(Boolean)
Whether to index metadata from JSON-LD <script>
tags. The default value is false
.
When enabled, JSON-LD objects in the web page are flattened and represented in jsonld.parent.child
metadata format in your Coveo organization.
Given the following JSON-LD script tag in a web page:
<script id="jsonld" type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "NewsArticle",
"url": "http://www.bbc.com/news/world-us-canada-39324587",
"publisher": {
"@type": "Organization",
"name": "BBC News",
"logo": "http://www.bbc.co.uk/news/special/2015/newsspec_10857/bbc_news_logo.png?cb=1"
},
"headline": "Canada Strikes Gold in Olympic Hockey Final"
}
</script>
To index the publisher name value (i.e., BBC News
in this page) in an index field
-
Set your configuration as follows:
"IndexJsonLdMetadata": { "sensitive": false, "value": "true" }
-
Use the following mapping rule for your field:
%[jsonld.publisher.name]
|
Note
Contextual JSON-LD key-value pairs whose keys begin with |
IndexSubdomains
(Boolean)
Whether to index subdomains of the base Starting URL you entered when creating the Web source.
The default value is false
.
Setting IndexSubdomains
to true
in effect adds all subdomains to the list of starting URLs without having to add them individually in the user interface.
It doesn’t change your inclusion and exclusion rules though.
If you set IndexSubdomains
to true
, make sure you adjust your inclusion and exclusion rules.
When IndexSubdomains
is set to true
, the Web source strips the www
subdomain when determining subdomains.
For example, if your base Starting URL is https://www.abc.com
, the Web source considers https://docs.abc.com
as being a subdomain.
"IndexSubdomains": {
"sensitive": false,
"value": "true"
}
MaxAutoRedirects
(Integer)
The maximum number of redirects that the requests may follow. No maximum is applied when the value is 0
. The default value is 7
.
"MaxAutoRedirects": {
"sensitive": false,
"value": "10"
}
MaxCrawlDelayInSeconds
(Integer)
The maximum number of seconds to respect in the robots.txt
directive of Crawl-delay: X. The robots.txt directive override source crawling setting in the user interface must not be enabled for MaxCrawlDelayInSeconds
to be used. If MaxCrawlDelayInSeconds
is set to 0
, the crawler won’t respect the directive of the robots.txt
file. The default value is 5
.
"MaxCrawlDelayInSeconds": {
"sensitive": false,
"value": "3"
}
MaxCrawlDepth
(Integer)
Maximum number of link levels under your Starting URLs to include in the source. When the value is 0
, only the starting URLs are crawled and all of their links are ignored. The default value is 100
.
"MaxCrawlDepth": {
"sensitive": false,
"value": "10"
}
MaxPageSizeInBytes
(Integer)
The maximum size of a web resource in bytes. If the resource size exceeds this value, it isn’t downloaded or processed. No maximum is applied when the value is 0
. The default value is 536870912
(512 MB).
|
"MaxPageSizeInBytes": {
"sensitive": false,
"value": "400000000"
}
MaxPagesToCrawl
(Integer)
The maximum number of pages to index. No maximum is applied when the value is 0
. The default value is 2500000
.
|
This setting prevents excessively long rebuilds and rescans. Increase its value only when absolutely necessary. |
"MaxPagesToCrawl": {
"sensitive": false,
"value": "10000"
}
MaxPagesToCrawlPerDomain
(Integer)
The maximum number of pages to index per domain. No maximum is applied when the value is 0
. The default value is 0
.
"MaxPagesToCrawlPerDomain": {
"sensitive": false,
"value": "8000"
}
MaxRequestsRetryCount
(Integer)
The maximum number of request retries for a URL, if an error occurs. No retry is attempted when the value is 0
. The default value is 3
.
"MaxRequestsRetryCount": {
"sensitive": false,
"value": "5"
}
MinCrawlDelayPerDomainInMilliseconds
(Integer)
The number of milliseconds between consecutive HTTP requests sent to retrieve content on a given domain. The default value is 1000
, which means that a maximum of 3,600 items are indexed per hour on any given domain.
"MinCrawlDelayPerDomainInMilliseconds": {
"sensitive": false,
"value": "1500"
}
RequestsRetryDelayInMilliseconds
(Integer)
The minimum delay in milliseconds between a failed HTTP request and the ensuing retry. The default value is 1000
.
"RequestsRetryDelayInMilliseconds": {
"sensitive": false,
"value": "1500"
}
RequestsTimeoutInSeconds
(Integer)
Timeout value in seconds for web requests. Web requests don’t timeout when the value is set to 0
. The default value is 60
.
Consider increasing the value if a target website sometimes responds slower. This prevents getting errors that can be avoided.
"RequestsTimeoutInSeconds": {
"sensitive": false,
"value": "60"
}
RespectUrlCasing
(Boolean)
Whether page URLs that you include are case sensitive. The default value is true
.
"RespectUrlCasing": {
"sensitive": false,
"value": "true"
}
|
When
RespectUrlCasing is Set to TrueThe crawler uses page URLs as is. If the same web page appears twice on the Web server (i.e., with different URL casings), you can expect duplicate items in your source, one copy for each URL casing variant found by the crawler. When web page URLs are case sensitive (i.e., when |
|
When
RespectUrlCasing is Set to FalseThe crawler first converts the current page URL to lowercase letters. It then uses the lowercased page URL to perform an HTTP request to the Web server. If the Web server is case-sensitive and resource URLs on the server contain uppercase letters, the Web server won’t serve the request. The way the Web server deals with this situation (e.g., HTTP error) varies from site to site. When web page URLs aren’t case sensitive (i.e., when |
RobotsDotTextUserAgentString
(String | null)
The User Agent used by the Coveo crawler to interact with the target Web content. The User Agent is compared against website robots.txt
file allow/disallow directives, provided the robots.txt directive override user interface setting isn’t enabled.
The default value is CoveoBot
.
"RobotsDotTextUserAgentString": {
"sensitive": false,
"value": "AcmeBot"
}
SendCookies
(Boolean)
Whether the crawler can exchange HTTP cookies with the crawled website. The default value is true
.
"SendCookies": {
"sensitive": false,
"value": "false"
}
SkipOnStartingAddressError
(Boolean)
Whether the crawler should skip a Starting URL when erroneous.
This setting may be useful on a source that includes multiple starting URLs.
When SkipOnStartingAddressError
is set to true
, a rebuild doesn’t fail when unable to retrieve a given starting URL.
Instead, it moves on to the next starting URL specified in the source configuration user interface.
The default value is true
.
|
When |
"SkipOnStartingAddressError": {
"sensitive": false,
"value": "false"
}
UseHiddenBasicAuthentication
(Boolean)
Whether the crawler should send a basic authentication header when the crawled website responds with an HTTP 404 Not Found
status code. This is unsafe in most cases and thus UseHiddenBasicAuthentication
is false
by default.
"UseHiddenBasicAuthentication": {
"sensitive": false,
"value": "true"
}
UseProxy
(Boolean)
|
|
Whether the Crawling Module should use a proxy to access the content to be crawled.
The default value is false
.
"UseProxy": {
"sensitive": false,
"value": "false"
}
UserAgentString
(String)
Identifier used by the Coveo crawler when requesting pages from the web server.
The default value is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)
.
"UserAgentString": {
"sensitive": false,
"value": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Coveobot/2.0;+http://www.coveo.com/bot.html)"
}