Preparing for Indexing

A well-managed index should only contain items that are relevant to some, or all of its intended audience. Before you even start indexing anything, you should therefore take time to determine exactly what you need to index, and what connectors your search project requires.

This article provides guidelines and examples to help you undertake those essential preliminary steps.

Locate Content

Early in your search project, you must identify the various types of content your end users need (see Know Your End-User Expectations).

You also need to gather information to find the exact location of each of those types of content:

  • Is the content Cloud-accessible, or is it on-premises?

  • Is the content available through a specific software/platform (e.g., Jive, OneDrive, SharePoint, etc.)?

    • If so, what’s the software/platform version (if applicable)?

    • Otherwise, how would you describe the system through which the content is available?

  • Is the content secured, or is it publicly available?

Doing so will help you pinpoint the various connectors your project requires (see Select Appropriate Connectors).

You carry out interviews and shadowing sessions with potential end users to learn what content they expect to find in their search results.

After some investigation, you manage to locate this content.

Content Cloud-accessible System Secured
How-to videos Yes YouTube No
Knowledge base articles Yes Salesforce Yes
Support cases Yes Salesforce Yes
Q&A forum discussions Yes Khoros Community No
Internal documentation No Confluence server 6.0.1 Yes
Public documentation Yes Static site No
Project management data Yes Basecamp Yes
Employee data No Custom intranet Yes

Select Appropriate Connectors

The following diagram illustrates the process of determining what connector to use to index content from a given repository.

Determining what connector to use

When you must index content residing in a system for which a specific connector is available, you should use that connector (e.g., to index content from a Salesforce instance, you should use a Salesforce source).

Otherwise, consider using one of the following generic source connectors:

Sitemap or Web

To index an HTML site, or data that can be exported to HTML, consider using a Sitemap source (if a sitemap is available) or a Web source. Both sources feature a powerful web scraping tool allowing you to exclude unwanted HTML elements, extract metadata key-values, and generate sub-items.

A Sitemap source is generally preferred over a Web source since it supports refresh which is a faster and more efficient update type. A Web source doesn’t support refresh and therefore performs a daily rescan and indexes pages that are reachable in random order.

If necessary, you can configure a Sitemap or Web source to rely on the Coveo On-Premises Crawling Module. However, neither Web nor Sitemap sources support indexing permissions.

See also:

Generic REST API

To index data accessible through an exposed REST API, consider using a Generic REST API source. While fairly flexible, a Generic REST API source requires you to have a deep understanding of the REST API you base its configuration on (see Generic REST API Source Tutorial).

The Generic REST API connector uses Coveo Cloud-hosted crawlers. This connector type supports incremental refresh as well as indexing permissions.

See also:

Database or File Server

To index data from an on-premises ODBC database, or from a Windows server, consider using a Database or File System source respectively.

These connectors both rely on the Coveo On-Premises Crawling Module. They both support incremental refresh, as well as indexing permissions.

See also:

Finally, when none of the aforementioned connectors suits your needs, use a Push source.

A Push source supports indexing permissions., and relies on a custom process to crawl content. This means that you must write, host, and execute your own custom connector to retrieve data (possibly including permission models and security identities), convert this data to JSON, and interact with the Push API directly to get this content into the index.

Behind the scenes, the Coveo On-Premises Crawling Module interacts with the Push API directly, as do Sitecore sources.

Using the list you have drafted before (see Locate Content), you move on to determine what connectors/sources are required for your search project.

  • You quickly determine that you need YouTube, Salesforce, and Khoros Community sources to index how-to videos, knowledge base articles/support cases, and Q&A forum discussions respectively.

  • The Coveo Platform offers several Confluence connectors. After some investigation, you conclude that you need a Confluence (Server) source to index internal documentation from a Confluence server 6.0.1.

  • The public documentation site contains a sitemap. Therefore, a Sitemap (Cloud) source seems like the best choice to index content from that location.

  • Basecamp exposes a REST API which can apparently retrieve all of the project management data content your end-users need. Therefore, you will probably need a Generic REST API source to index that content.

  • Employee data is available through a custom intranet relying on an ODBC database. You could either use a Web (Crawling Module) source, or a Database source to retrieve this content. However, since your end users require item-level security (which a Web source doesn’t support), you opt for a Database source.

Content Cloud-accessible System Secured Required connector
How-to videos Yes YouTube No YouTube
Knowledge base articles Yes Salesforce Yes Salesforce
Support cases Yes Salesforce Yes Salesforce
Q&A forum discussions Yes Khoros Community No Khoros Community
Internal documentation No Confluence server 6.0.1 Yes Confluence (Server)
Public documentation Yes Static site No Sitemap (Cloud)
Project management data Yes Basecamp Yes Generic REST API
Employee data No Custom intranet Yes Database

Identify Content to Include/Exclude

Almost every repository from which you need to index content will hold content that serves no purpose for your end users.

Indexing irrelevant items would force you to write and maintain query time filtering/demoting rules, which you should typically avoid. Populating hundreds of fields with inconsequential metadata would also be a questionable indexing strategy, as your index only supports a limited number fields.

There are various ways to refine/enhance content before and during the indexing process (see Indexing Techniques). Rather than immediately indexing a content repository, you should therefore first determine what your end users actually need.

Consider indexing content such as:

  • Database records

  • Knowledge base articles

  • MS Office files (Word, Excel, PowerPoint, etc.)

  • PDF files

  • Support case records

  • Technical documentation

  • User profiles

Typically avoid indexing content such as:

  • Archived data

  • Foreign key tables containing only IDs

  • Dump files

  • Log files

Your end users expect to be able to find up-to-date internal documentation in their search results.

However, you notice that the Confluence server hosting this information also contains several archived spaces which no longer seem relevant. You confirm that none of your end users are likely to search for that content, and therefore decide to ensure that it doesn’t land in your index at all.

What’s Next?

The next article in this section covers the various indexing techniques you can apply to ensure that your index only contains relevant content (see Applying Indexing Techniques).

Recommended Articles