Preparing for Indexing
A well-managed index should only contain items that are relevant to some, or all of its intended audience. Before you even start indexing anything, you should therefore take time to determine exactly what you need to index, and what connectors your search project requires.
This article provides guidelines and examples to help you undertake those essential preliminary steps.
Early in your search project, you must identify the various types of content your end users need (see Know Your End-User Expectations).
You also need to gather information to find the exact location of each of those types of content:
Is the content Cloud-accessible, or is it on-premises?
Is the content available through a specific software/platform (e.g., Jive, OneDrive, SharePoint, etc.)?
If so, what’s the software/platform version (if applicable)?
Otherwise, how would you describe the system through which the content is available?
Is the content secured, or is it publicly available?
Doing so will help you pinpoint the various connectors your project requires (see Select Appropriate Connectors).
You carry out interviews and shadowing sessions with potential end users to learn what content they expect to find in their search results.
After some investigation, you manage to locate this content.
|Knowledge base articles||Yes||Salesforce||Yes|
|Q&A forum discussions||Yes||Khoros Community||No|
|Internal documentation||No||Confluence server 6.0.1||Yes|
|Public documentation||Yes||Static site||No|
|Project management data||Yes||Basecamp||Yes|
|Employee data||No||Custom intranet||Yes|
Select Appropriate Connectors
The following diagram illustrates the process of determining what connector to use to index content from a given repository.
When you must index content residing in a system for which a specific connector is available, you should use that connector (e.g., to index content from a Salesforce instance, you should use a Salesforce source).
Otherwise, consider using one of the following generic source connectors:
Sitemap or Web
To index an HTML site, or data that can be exported to HTML, consider using a Sitemap source (if a sitemap is available) or a Web source. Both sources feature a powerful web scraping tool allowing you to exclude unwanted HTML elements, extract metadata key-values, and generate sub-items.
A Sitemap source is generally preferred over a Web source since it supports refresh which is a faster and more efficient update type. A Web source doesn’t support refresh and therefore performs a daily rescan and indexes pages that are reachable in random order.
Generic REST API
To index data accessible through an exposed REST API, consider using a Generic REST API source. While fairly flexible, a Generic REST API source requires you to have a deep understanding of the REST API you base its configuration on (see Generic REST API Source Tutorial).
The Generic REST API connector uses Coveo Cloud-hosted crawlers. This connector type supports incremental refresh as well as indexing permissions.
Database or File Server
To index data from an on-premises ODBC database, or from a Windows server, consider using a Database or File System source respectively.
These connectors both rely on the Coveo On-Premises Crawling Module. They both support incremental refresh, as well as indexing permissions.
Finally, when none of the aforementioned connectors suits your needs, use a Push source.
A Push source supports indexing permissions., and relies on a custom process to crawl content. This means that you must write, host, and execute your own custom connector to retrieve data (possibly including permission models and security identities), convert this data to JSON, and interact with the Push API directly to get this content into the index.
Behind the scenes, the Coveo On-Premises Crawling Module interacts with the Push API directly, as do Sitecore sources.
Using the list you have drafted before (see Locate Content), you move on to determine what connectors/sources are required for your search project.
You quickly determine that you need YouTube, Salesforce, and Khoros Community sources to index how-to videos, knowledge base articles/support cases, and Q&A forum discussions respectively.
The Coveo Platform offers several Confluence connectors. After some investigation, you conclude that you need a Confluence (Server) source to index internal documentation from a Confluence server 6.0.1.
The public documentation site contains a sitemap. Therefore, a Sitemap (Cloud) source seems like the best choice to index content from that location.
Basecamp exposes a REST API which can apparently retrieve all of the project management data content your end-users need. Therefore, you will probably need a Generic REST API source to index that content.
Employee data is available through a custom intranet relying on an ODBC database. You could either use a Web (Crawling Module) source, or a Database source to retrieve this content. However, since your end users require item-level security (which a Web source doesn’t support), you opt for a Database source.
|Knowledge base articles||Yes||Salesforce||Yes||Salesforce|
|Q&A forum discussions||Yes||Khoros Community||No||Khoros Community|
|Internal documentation||No||Confluence server 6.0.1||Yes||Confluence (Server)|
|Public documentation||Yes||Static site||No||Sitemap (Cloud)|
|Project management data||Yes||Basecamp||Yes||Generic REST API|
|Employee data||No||Custom intranet||Yes||Database|
Identify Content to Include/Exclude
Almost every repository from which you need to index content will hold content that serves no purpose for your end users.
Indexing irrelevant items would force you to write and maintain query time filtering/demoting rules, which you should typically avoid. Populating hundreds of fields with inconsequential metadata would also be a questionable indexing strategy, as your index only supports a limited number fields.
There are various ways to refine/enhance content before and during the indexing process (see Indexing Techniques). Rather than immediately indexing a content repository, you should therefore first determine what your end users actually need.
Consider indexing content such as:
Knowledge base articles
MS Office files (Word, Excel, PowerPoint, etc.)
Support case records
Typically avoid indexing content such as:
Foreign key tables containing only IDs
Your end users expect to be able to find up-to-date internal documentation in their search results.
However, you notice that the Confluence server hosting this information also contains several archived spaces which no longer seem relevant. You confirm that none of your end users are likely to search for that content, and therefore decide to ensure that it doesn’t land in your index at all.
The next article in this section covers the various indexing techniques you can apply to ensure that your index only contains relevant content (see Applying Indexing Techniques).