Creating a Crawling Module Source

The Coveo Cloud V2 On-Premises Crawling Module is still a beta product. Several manual installation and configuration steps will be automated in subsequent beta releases.

Moreover, the Crawling Module is not available in trial organizations yet.

Once you started your workers, you can create a source to crawl (see Managing the On-Premises Crawling Module Using the REST API). For now, a configuration UI in the Coveo Cloud V2 administration console is available for File and Database sources only (see Creating a File or Database Source). If you want to index a different source type, you must use the Coveo Cloud V2 Platform API (see Creating a Source Using the Source API).

For now, these procedures do not take into account the option you have to index the permissions corresponding to your secured content. If you want to index secured content and take access permissions into account, contact the Coveo Support team.

Creating a Crawling Module Source Using the Source API

Since there is no administration console UI yet to create On-Premises Crawling Module sources in Coveo Cloud V2, you must create your source via the Coveo Cloud V2 Platform API (see Source API). To do so, in the Create a source from simple configuration call, under Parameters, you must provide:

  • The organization ID of the organization you linked with Maestro (see Review the Organization ID and Linking Maestro to the Coveo Cloud V2 Platform). This replaces {organizationId} in the request path.

    The Crawling Module is not available in trial organizations yet.

  • The JSON source configuration with placeholders replaced by your source values. Each source has a different JSON configuration to provide in the request body. The JSON configurations to provide, along with your organizationId, in the Create a source from simple configuration request are shown in the source-specific sections below.

    If you need a more advanced configuration, contact the Coveo Support team for assistance.

    Basic Properties

    The following basic properties are common to all or most of the Coveo On-Premises Crawling Module source configurations:

    • The name property value is the source name as it should appear in the Coveo Cloud V2 administration console. Replace <SourceDisplayName> with the desired source name.

      You cannot change the name of a source once it has been created, so make sure the name you choose fits the content you intend to index with that source.

    • The sourceVisibility property determines who can see the source items in their search results (see Source Permission Types). Choose SHARED, SECURED, or PRIVATE.
    • The sourceType value determines the type of source you create. Provide the value corresponding to the source to index (see Possible sourceType Values).
    • Boolean properties pushEnabled and onPremisesEnabled must always be true.
    • The username and password values allow the crawler to access the secured content to index. Replace the value placeholders with the corresponding credentials.

Also in the Create a source from simple configuration call, leave the updateSecurityProviders and rebuild parameter values to true unless otherwise instructed by the Coveo Support team.

For an example of a request and of a response body, see Creating a Basic Shared Web Source in the Source API documentation.

Source API Swagger Parameters Section

Confluence

Depending on the content you want to index, you might need to download the Coveo Plugin for Atlassian Confluence before creating a Confluence source (see Installing the Coveo Plugin for Atlassian Confluence).

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|SECURED|PRIVATE>",
  "sourceType": "CONFLUENCE2_HOSTED",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "urls": ["<URL>", "<URL>"],
  "username": "<User>",
  "password": "<Password>"
}

In the request body, beside providing an adequate value for the the basic properties listed above, make sure to replace <["URL", "URL"]> with the addresses to crawl (see Basic Properties).

Jira

Depending on the content you want to index, you might need to download the Coveo Plugin for Atlassian Jira before creating a Jira source (see Installing the Coveo Plugin for Atlassian Jira).

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|SECURED|PRIVATE>",
  "sourceType": "JIRA2_HOSTED",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "serverUrl": "<SERVER_URL>",
  "indexAttachments": "<true|false>",
  "indexComments": "<true|false>",
  "indexWorkLogs": "<true|false>",
  "supportCommentPermissions": "<true|false>",
  "username": "<User>",
  "password": "<Password>"
}

In the request body, beside providing an adequate value for the the basic properties listed above, make sure to (see Basic Properties):

  • Replace <SERVER_URL> with the address of your Jira server.

    http://MyJiraServer:8080/

  • Indicate true or false for:

    • indexAttachments, which determines whether binary files attached to an issue should be indexed. Attachments are indexed with the same level and sets of their parent issue.
    • indexComments, which determines whether comments on an issue should be indexed. Comments are indexed with the same level and sets of their parent issue. When permissions on the comments are supported, if a comment is restricted to a group or a project role, an additional set with the group or the role is added.
    • indexWorkLogs, which determines whether time entry on a issue should be indexed. Work logs are indexed with the same level and sets of their parent issue. If a work log is restricted to a group or a project role, an additional set with the group or the role is added.
    • supportCommentPermissions, which determines whether only users allowed to see a comment in Jira can also see it in their search results. If this property value is true, an issue and its comments are indexed as separate items, leading to lower search relevance. If the value is false, the issue and its comments are indexed as one item, allowing to find an item via either an issue or its comments. However, there are no restrictions on users seeing comments on an issue.

Jive

Depending on the content you want to index, you might need to download the Coveo Plugin for Jive before creating a Jive source (see Installing the Coveo Plugin for Jive).

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|SECURED|PRIVATE>",
  "sourceType": "JIVE_HOSTED",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "permissions": null,
  "instanceUrl": "<INSTANCE_URL>",
  "startingSpaceUrl": <null|"value">,
  "substitutionLocale": <null|"value">,
  "substitutionTheme": <null|"value">,
  "placesTitles": <null|"value">,
  "ignoreItemsOfTypes": <null|"value">,
  "indexSpaces": <true|false>,
  "indexProjects": <true|false>,
  "indexGroups": <true|false>,
  "indexSystemBlogs": <true|false>,
  "indexPeople": <true|false>,
  "indexPublishedOnly": <true|false>,
  "allowAnonymousAccess": <true|false>,
  "username": "<User>",
  "password": "<Password>"
}

In the request body, beside providing an adequate value for the the basic properties listed above, make sure to (see Basic Properties):

  • Replace <INSTANCE_URL> with the address of your Jive server.

    https://myjiveserver.mycompany.com

  • Enter a value for the following properties if you wish to leverage the corresponding feature or customization. If not, indicate null.

  • For startingSpaceURL, enter the URL of the space at which the crawling should start. Only content within this space and its subspaces will be available in the index. This does not affect social groups and people.
    • If you use phrase substitutions in your Jive community, you must provide a value for substitutionLocale and substitutionTheme.   For substitutionLocale, enter  the phrase substitution locale that the connector should use. The default locale is default , meaning that the actual machine locale is used. Enter the locale in any of the following three formats:
      • Language only (e.g. en)
      • Language and country (e.g. en_US)
      • Language, country, and variant (e.g. en_US_NY)
    • For substitutionTheme, enter the phrase substitution theme that the connector should use. The default theme is custom.
    • For placesTitles, enter the metadata to use as a title for Jive places. The metadata must be specified in the following order: space_meta;group_meta;project_meta;person_meta;blog_meta.

    • For ignoreItemsOfTypes, enter a list of all Jive item types to ignore, using semi-colons to separate entries. The crawler indexes all items, except those of the specified types. Possible values are: Announcement, Attachment, Checkpoint, Comment, Discussion, Dm, Document, File, Group, Idea, Message, Poll, Project, Space, SystemBlog, Task, Update, and Video.
  • Indicate true or false for:
    • indexSpaces, which determines whether Jive spaces and any item they contain should be indexed.
    • indexProjects, which determines whether Jive projects and any item they contain should be indexed.
    • indexGroups, which determines whether Jive groups and any item they contain should be indexed.
    • indexSystemBlogs, which determines whether system blogs and any item they contain should be indexed.
    • indexPeople, which determines whether  Jive people and any item they contain should be indexed.
    • indexPublishedOnly, which determines whether only items with a Published status are indexed. Content that is incomplete, pending approval, or rejected is not indexed.
    • allowAnonymousAccess, which determines whether the source maps the Everyone groupe in Jive to the Everyone group of the Email identity security provider (see Coveo Cloud V2 Management of Security Identities and Item Permissions). Therefore, when this property value is true, all users (authenticated and anonymous) can see the Jive content allowed to Jive Everyone in search results.

Open Database Connectivity (Database)

You can configure a Database source though the Coveo Cloud V2 administration console if you use a 64-bit driver to connect to your database (see Creating a File or Database Source). If you use a 32-bit driver, you must use the Coveo Cloud V2 Platform API and the JSON configuration below.

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|SECURED|PRIVATE>",
  "sourceType": "DATABASE",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "connectionString": "<Driver=DRIVER;Server=SERVERNAME;Database=DBNAME;Uid=@uid;Pwd=@pwd;>",
  "ForceX86": <true|false>,
  "ConfigFileContent": "<Base64EncodedContent>",
  "DriverType": "Odbc|SqlClient",
  "ItemType": "<COMMA,SEPARATED,VALUES>",
  "username": "<User>",
  "password": "<Password>"
}

In the request body, beside providing an adequate value for the the basic properties listed above, make sure to (see Basic Properties):

  • For connectionString, replace <Driver=DRIVER;Server=SERVERNAME;Database=DBNAME;Uid=@uid;Pwd=@pwd;> with the connection string used to connect to the database.  

    • The connection string syntax differs from one database type to another. Refer to the appropriate documentation for the format of the connection string specific to your database (see Connection Strings).

    • The same connection string can be used for different sources. However, there can only be one connection string per source.

    • Specify the exact name of the desired driver. Refer to the list of available drivers, depending on the driver type you select using the ForceX86 property below (see Viewing the Available Drivers for an ODBC Source).

    • You can hide the password and the user ID in the connection string (see CES 7.0 Replacing the Identity in Database Connection Strings topic).

  • Indicate true for a 32-bit driver and false for a 64-bit driver for the ForceX86 property, as it determines which drive type should be used.

    The driver type you choose must match the type of the driver you specified in your connection string.

  • Provide an XML configuration file with the desired content, mappings, allowed users, etc. to enable Coveo to retrieve and copy the data from record fields to Coveo default and standard source fields (see Example of a Configuration File). This configuration file must however be base64-encoded for your JSON source configuration to be valid. Use the Base64 Encode and Decode online tool, and then replace <Base64EncodedContent> with the encoded output.
  • For DriverType, specify which type of driver provides access to the database.

  • For ItemType, replace <COMMA,SEPARATED,VALUES> with the Mapping type values from your XML configuration file above. These values are the table or object names to retrieve and must be separated by commas.

SharePoint

You can use a SharePoint source to make your SharePoint 2016, 2013, or 2010 content searchable.

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|SECURED|PRIVATE>",
  "sourceType": "SHAREPOINT",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "urls": <["URL", "URL"]>,
  "authenticationType": "WindowsClassic|WindowsUnderClaims|AdfsUnderClaims",
  "crawlScope": "WebApplication|SiteCollection|WebAndSubWebs|List",
  "username": "<User>",
  "password": "<Password>"
}

In the request body, beside providing an adequate value for the the basic properties listed above, make sure to (see Basic Properties):

  • Replace  <["URL", "URL"]>  with the site collection, list, website, or subsite addresses to crawl.

    • For a specific web application: https://site:8080/

    • For a specific site collection: https://site:8080/sites/support

    • For a specific website: https://site:8080/sites/support/subsite

    • For a specific list: https://site:8080/sites/support/lists/contacts/allItems.aspx

      Indexing a specific folder in a list is not supported.

  • For authenticationType, indicate the authentication type value corresponding to your SharePoint environment. Available values are:

    Value Description
    WindowsClassic Default Microsoft NTLM authentication mode
    WindowsUnderClaims Windows authentication mode under claims
    AdfsUnderClaims Authentication for Trusted Security Providers
  • For crawlScope, indicate the content type that you want to crawl in relation with the source urls that you specified. Indicate WebApplication, the default value and highest element type in the SharePoint farm hierarchy to crawl everything.

    Value Content to crawl
    WebApplication All site collections of the specified web application
    SiteCollection All web sites of the specified site collection
    WebAndSubWebs Only the specified web site and its sub webs
    List Only the specified list or document library
  • To crawl a web application:

      {
        "name": "My SharePoint Web Application",
        "sourceVisibility": "SECURED",
        "sourceType": "SHAREPOINT",
        "urls": ["http://mysharepointserver:35318/"],
        "authenticationType": "AdfsUnderClaims",
        "username": "john.smith@mycompany.com.com",
        "AdfsServerUrl": "https://adfs.server.com/",
        "SharePointTrustIdentifier": "urn:federation:MicrosoftOnline",
        "crawlScope": "WebApplication",
        "password": "MyPassword",
        "loadUserProfiles": true,
        "loadPersonalSites": true,
        "indexListFolders": true
      }
    
  • To crawl a sub web:

      {
        "name": "My SharePoint Sub Web",
        "sourceVisibility": "SECURED",
        "sourceType": "SHAREPOINT",
        "urls": ["http://mysharepointserver:35318/site/web/subweb"],
        "authenticationType": "WindowsUnderClaims",
        "username": "mycompany\\john.smith",
        "crawlScope": "WebAndSubWebs",
        "password": "MyPassword"
      }
    

Sitemap

You can use a Sitemap source to make searchable the content of listed web pages from a Sitemap (Sitemap file or a Sitemap index file).

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|PRIVATE>",
  "sourceType": "SITEMAP",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "urls": <["URL", "URL"]>,
  "userAgent": "<>",
  "enableJavaScript": <true|false>,
  "scrapingConfiguration": <""|SCRAPING_CONFIG>
}

In the request body, beside providing an adequate value for the the basic properties listed above, make sure to (see Basic Properties):

  • Replace <["URL", "URL"]> with the addresses to crawl.

  • Provide the value of the user-agent HTTP header to use as userAgent. This is the identifier used when downloading web pages. Default is Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36.

  • Indicate true or false for enableJavaScript, which determines whether JavaScript should be evaluated and rendered before indexation. This option is useful when you want to index the dynamically rendered content of crawled pages. However, activating this option has a significant impact on the crawling performance.

  • Provide a JSON scraping configuration to use as scrapingConfiguration or leave the quotations marks empty (see Web Scraping Configuration).

Web

You can use the Coveo On-Premises Crawling Module to crawl an internal website, i.e. pages available on a certain network only.

If you want to crawl a public website, i.e. a website that is globally available on the Internet, you can use the Coveo Cloud V2 administration console (see Add a Web Source).

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|PRIVATE>",
  "sourceType": "WEB2",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "urls": ["<http://www.example.com>"]
}

File

You can also create a File source using the Coveo Cloud V2 administration console. However, if you must create a File source via an API call, you can use the following configuration.

{
  "name": "<SourceDisplayName>",
  "sourceVisibility": "<SHARED|SECURED|PRIVATE>",
  "sourceType": "FILE",
  "pushEnabled": true,
  "onPremisesEnabled": true,
  "startingAddresses": ["<file://Path/To/Shared/Folder/1>", "<file://Path/To/Shared/Folder/2>", ...],
  "expandMailArchives": <true|false>,
  "indexSharePermissions": <true|false>,
  "username": "<domain\user>",
  "password": "<Password>"
}

In the request body, beside providing an adequate value for the the basic properties listed above, make sure to (see Basic Properties):

  • Replace <file://Path/To/Shared/Folder> with the address to crawl.

  • Indicate true or false for:

    • expandMailArchives, which determines whether the content of mail archives (.pst) should be indexed. Default value is false.
    • indexSharePermissions, which determines whether share and NTFS permissions should be taken into account and applied in Coveo Cloud V2.

      If you want to take permissions into account, contact the Coveo Support team.

What’s Next?

Review the default refresh schedule and optionally change it to better fit your needs (see Edit a Source Schedule).