Prepare to index content

This is for:

In this article

Locate content
Select the appropriate connectors
Identify content to include or exclude
What’s next?

A well-managed index should only contain items that are relevant to some, or all of its intended audience. Before you even start indexing anything, you should therefore take time to determine exactly what you need to index, and which connectors your search project requires.

This article provides guidelines and examples to help you undertake those essential preliminary steps.

Locate content

Early in your search project, you must identify the various types of content your users need.

You also need to gather information to find the exact location of each of those types of content:

Is the content Cloud-accessible, or is it on-premises?
Is the content available through a specific application or platform (for example, Jive or SharePoint)?
- If so, what’s the application or platform version (if applicable)?
- If not, how would you describe the system through which the content is available?
Is the content secured, or is it publicly available?

Doing this will help you pinpoint the various connectors that your project requires.

Example content list

You carry out interviews and shadowing sessions with potential users to learn what content they expect to find in their search results.

After some investigation, you put together the following content list:

Content	Cloud-accessible	System	Secured
How-to videos	Yes	YouTube	No
Knowledge base articles	Yes	Salesforce	Yes
Support cases	Yes	Salesforce	Yes
Q&A forum discussions	Yes	Khoros Community	No
Internal documentation	No	Confluence Server 6.0.1	Yes
Public documentation	Yes	Static site	No
Project management data	Yes	Basecamp	Yes
Employee data	No	Custom intranet	Yes

Content

Cloud-accessible

System

Secured

How-to videos

Yes

YouTube

Knowledge base articles

Yes

Salesforce

Yes

Support cases

Yes

Salesforce

Yes

Q&A forum discussions

Yes

Khoros Community

Internal documentation

Confluence Server 6.0.1

Yes

Public documentation

Yes

Static site

Project management data

Yes

Basecamp

Yes

Employee data

Custom intranet

Yes

Select the appropriate connectors

The following diagram illustrates the process of determining what connector to use to index content from a given repository.

Determining which connector to use | Coveo

When you must index content residing in a system for which a specific connector is available, you should use that connector (for example, to index content from a Salesforce instance, you should use a Salesforce source).

Otherwise, consider using one of the following generic source connectors:

Sitemap or Web source

To index an HTML site, or data that can be exported to HTML, consider using a Sitemap source (if a sitemap is available) or a Web source. Both feature a powerful web scraping tool that lets you exclude unwanted HTML elements, extract metadata key-values, and generate sub-items.

We recommend using Sitemap sources because they support refresh, which is a faster and more efficient update operation. Web sources don’t support refresh, so they must perform a daily rescan which indexes reachable pages in random order.

If necessary, you can configure either of these connectors to rely on the Coveo Crawling Module. However, neither Web nor Sitemap sources support indexing permissions.

REST or GraphQL API source

To index data that’s accessible through an exposed API, consider using a REST API or GraphQL API source. While this connector is quite flexible, you must have a deep understanding of the API upon which you base its configuration.

The API sources use Coveo-hosted crawlers. They support incremental refresh and indexing permissions.

Database or File Server source

To index data from an on-premises database or a Windows server, consider using a Database or File System source respectively.

These connectors rely on the Coveo Crawling Module. They both support incremental refresh and indexing permissions.

Note

Behind the scenes, the Coveo Crawling Module interacts with the Push API directly, as do Sitecore sources.

Push source

If none of the existing specific or generic connectors meet your needs, then you will have to use a Push source to index your content. It supports indexing permissions and relies on a custom process to crawl content.

To index your content with a Push source, you will have to write, host, and execute your own custom connector to retrieve data, which may include permission models and security identities. You will also have to convert this data to JSON.

You can interact with a Push source in the following ways:

Use the Coveo CLI source:push commands (add, delete, and new)
Use one of the open source SDKs (Python, NodeJS, or Java)
Use the exposed Push API services to get your content into the index

Example

You use the content list that you drafted to figure out which connectors or sources you require for your search project.

You determine that you need YouTube, Salesforce, and Khoros Community sources to index how-to videos, knowledge base articles or support cases, and Q&A forum discussions respectively.
The Coveo Platform offers several Confluence connectors. After some investigation, you conclude that you need a Confluence (Server) source to index internal documentation from a Confluence Server 6.0.1.
The public documentation site contains a sitemap, so a Sitemap (Cloud) source seems like the best choice to index content from that location.
Basecamp exposes a REST API which can apparently retrieve all of the project management data content your users need. Therefore, you will probably need a REST API source to index that content.
Employee data is available through a custom intranet which relies on a database. You could either use a Web (Crawling Module) source or a Database source to retrieve this content. However, since your users require item-level security (which a Web source doesn’t support), you opt for a Database source.

Content	Cloud-accessible	System	Secured	Required connector
How-to videos	Yes	YouTube	No	YouTube
Knowledge base articles	Yes	Salesforce	Yes	Salesforce
Support cases	Yes	Salesforce	Yes	Salesforce
Q&A forum discussions	Yes	Khoros Community	No	Khoros Community
Internal documentation	No	Confluence Server 6.0.1	Yes	Confluence (Server)
Public documentation	Yes	Static site	No	Sitemap (Cloud)
Project management data	Yes	Basecamp	Yes	REST API
Employee data	No	Custom intranet	Yes	Database

Content

Cloud-accessible

System

Secured

Required connector

How-to videos

Yes

YouTube

Knowledge base articles

Yes

Salesforce

Yes

Salesforce

Support cases

Yes

Salesforce

Yes

Salesforce

Q&A forum discussions

Yes

Khoros Community

Internal documentation

Confluence Server 6.0.1

Yes

Confluence (Server)

Public documentation

Yes

Static site

Sitemap (Cloud)

Project management data

Yes

Basecamp

Yes

REST API

Employee data

Custom intranet

Yes

Database

Identify content to include or exclude

Almost every repository from which you need to index content will hold content that serves no purpose for your users.

Indexing irrelevant items would force you to write and maintain query time filtering or demoting rules, which you should typically avoid. Populating hundreds of fields with inconsequential metadata would also be a questionable indexing strategy, as your index only supports a limited number fields.

There are various ways to refine and enhance content before and during the indexing process. Rather than immediately indexing a content repository, you should therefore first determine what your users actually need.

Consider indexing content such as:

Database records
Knowledge base articles
MS Office files (Word, Excel, PowerPoint, etc.)
PDF files
Support case records
Technical documentation
User profiles

Avoid indexing content such as:

Archived data
Foreign key tables containing only IDs
Dump files
Log files

Example

Your users expect to be able to find up-to-date internal documentation in their search results.

However, you notice that the Confluence Server hosting this information also contains several archived spaces which no longer seem relevant. You confirm that none of your users are likely to search for that content, and therefore decide to ensure that it doesn’t land in your index at all.

What’s next?

The Apply indexing techniques article explains how you can ensure that your index only contains relevant content.