Prepare to index content

A well-managed index should only contain items that are relevant to some, or all of its intended audience. Before you even start indexing anything, you should therefore take time to determine exactly what you need to index, and which connectors your search project requires.

This article provides guidelines and examples to help you undertake those essential preliminary steps.

Locate content

Early in your search project, you must identify the various types of content your users need.

You also need to gather information to find the exact location of each of those types of content:

  • Is the content Cloud-accessible, or is it on-premises?

  • Is the content available through a specific application or platform (for example, Jive or SharePoint)?

    • If so, what’s the application or platform version (if applicable)?

    • If not, how would you describe the system through which the content is available?

  • Is the content secured, or is it publicly available?

Doing this will help you pinpoint the various connectors that your project requires.

Example content list

You carry out interviews and shadowing sessions with potential users to learn what content they expect to find in their search results.

After some investigation, you put together the following content list:

Content Cloud-accessible System Secured

How-to videos

Yes

YouTube

No

Knowledge base articles

Yes

Salesforce

Yes

Support cases

Yes

Salesforce

Yes

Q&A forum discussions

Yes

Khoros Community

No

Internal documentation

No

Confluence Server 6.0.1

Yes

Public documentation

Yes

Static site

No

Project management data

Yes

Basecamp

Yes

Employee data

No

Custom intranet

Yes

Select the appropriate connectors

The following diagram illustrates the process of determining what connector to use to index content from a given repository.

Determining which connector to use | Coveo

When you must index content residing in a system for which a specific connector is available, you should use that connector (for example, to index content from a Salesforce instance, you should use a Salesforce source).

Otherwise, consider using one of the following generic source connectors:

Sitemap or Web source

To index an HTML site, or data that can be exported to HTML, consider using a Sitemap source (if a sitemap is available) or a Web source. Both feature a powerful web scraping tool that lets you exclude unwanted HTML elements, extract metadata key-values, and generate sub-items.

We recommend using Sitemap sources because they support refresh, which is a faster and more efficient update operation. Web sources don’t support refresh, so they must perform a daily rescan which indexes reachable pages in random order.

If necessary, you can configure either of these connectors to rely on the Coveo Crawling Module. However, neither Web nor Sitemap sources support indexing permissions.

See also:

REST or GraphQL API source

To index data that’s accessible through an exposed API, consider using a REST API or GraphQL API source. While this connector is quite flexible, you must have a deep understanding of the API upon which you base its configuration.

The API sources use Coveo-hosted crawlers. They support incremental refresh and indexing permissions.

See also:

Database or File Server source

To index data from an on-premises database or a Windows server, consider using a Database or File System source respectively.

These connectors rely on the Coveo Crawling Module. They both support incremental refresh and indexing permissions.

Note

Behind the scenes, the Coveo Crawling Module interacts with the Push API directly, as do Sitecore sources.

See also:

Push source

If none of the existing specific or generic connectors meet your needs, then you will have to use a Push source to index your content. It supports indexing permissions and relies on a custom process to crawl content.

To index your content with a Push source, you will have to write, host, and execute your own custom connector to retrieve data, which may include permission models and security identities. You will also have to convert this data to JSON.

You can interact with a Push source in the following ways:

  • Use the Coveo CLI source:push commands (add, delete, and new)

  • Use one of the open source SDKs (Python, NodeJS, or Java)

  • Use the exposed Push API services to get your content into the index

Example

You use the content list that you drafted to figure out which connectors or sources you require for your search project.

  • You determine that you need YouTube, Salesforce, and Khoros Community sources to index how-to videos, knowledge base articles or support cases, and Q&A forum discussions respectively.

  • The Coveo Platform offers several Confluence connectors. After some investigation, you conclude that you need a Confluence (Server) source to index internal documentation from a Confluence Server 6.0.1.

  • The public documentation site contains a sitemap, so a Sitemap (Cloud) source seems like the best choice to index content from that location.

  • Basecamp exposes a REST API which can apparently retrieve all of the project management data content your users need. Therefore, you will probably need a REST API source to index that content.

  • Employee data is available through a custom intranet which relies on a database. You could either use a Web (Crawling Module) source or a Database source to retrieve this content. However, since your users require item-level security (which a Web source doesn’t support), you opt for a Database source.

Content Cloud-accessible System Secured Required connector

How-to videos

Yes

YouTube

No

YouTube

Knowledge base articles

Yes

Salesforce

Yes

Salesforce

Support cases

Yes

Salesforce

Yes

Salesforce

Q&A forum discussions

Yes

Khoros Community

No

Khoros Community

Internal documentation

No

Confluence Server 6.0.1

Yes

Confluence (Server)

Public documentation

Yes

Static site

No

Sitemap (Cloud)

Project management data

Yes

Basecamp

Yes

REST API

Employee data

No

Custom intranet

Yes

Database

Identify content to include or exclude

Almost every repository from which you need to index content will hold content that serves no purpose for your users.

Indexing irrelevant items would force you to write and maintain query time filtering or demoting rules, which you should typically avoid. Populating hundreds of fields with inconsequential metadata would also be a questionable indexing strategy, as your index only supports a limited number fields.

There are various ways to refine and enhance content before and during the indexing process. Rather than immediately indexing a content repository, you should therefore first determine what your users actually need.

Important

Consider indexing content such as:

  • Database records

  • Knowledge base articles

  • MS Office files (Word, Excel, PowerPoint, etc.)

  • PDF files

  • Support case records

  • Technical documentation

  • User profiles

Warning

Avoid indexing content such as:

  • Archived data

  • Foreign key tables containing only IDs

  • Dump files

  • Log files

Example

Your users expect to be able to find up-to-date internal documentation in their search results.

However, you notice that the Confluence Server hosting this information also contains several archived spaces which no longer seem relevant. You confirm that none of your users are likely to search for that content, and therefore decide to ensure that it doesn’t land in your index at all.

What’s next?

The Apply indexing techniques article explains how you can ensure that your index only contains relevant content.