Prepare to index content
Prepare to index content
A well-managed index should only contain items that are relevant to some, or all of its intended audience. Before you even start indexing anything, you should therefore take time to determine exactly what you need to index, and which connectors your search project requires.
This article provides guidelines and examples to help you undertake those essential preliminary steps.
Locate content
Early in your search project, you must identify the various types of content your users need.
You also need to gather information to find the exact location of each of those types of content:
-
Is the content Cloud-accessible, or is it on-premises?
-
Is the content available through a specific application or platform (for example, Jive or SharePoint)?
-
If so, what’s the application or platform version (if applicable)?
-
If not, how would you describe the system through which the content is available?
-
-
Is the content secured, or is it publicly available?
Doing this will help you pinpoint the various connectors that your project requires.
Example content list
You carry out interviews and shadowing sessions with potential users to learn what content they expect to find in their search results.
After some investigation, you put together the following content list:
Content | Cloud-accessible | System | Secured |
---|---|---|---|
How-to videos |
Yes |
YouTube |
No |
Knowledge base articles |
Yes |
Salesforce |
Yes |
Support cases |
Yes |
Salesforce |
Yes |
Q&A forum discussions |
Yes |
Khoros Community |
No |
Internal documentation |
No |
Confluence Server 6.0.1 |
Yes |
Public documentation |
Yes |
Static site |
No |
Project management data |
Yes |
Basecamp |
Yes |
Employee data |
No |
Custom intranet |
Yes |
Select the appropriate connectors
The following diagram illustrates the process of determining what connector to use to index content from a given repository.
When you must index content residing in a system for which a specific connector is available, you should use that connector (for example, to index content from a Salesforce instance, you should use a Salesforce source).
Otherwise, consider using one of the following generic source connectors:
Sitemap or Web source
To index an HTML site, or data that can be exported to HTML, consider using a Sitemap source (if a sitemap is available) or a Web source. Both feature a powerful web scraping tool that lets you exclude unwanted HTML elements, extract metadata key-values, and generate sub-items.
We recommend using Sitemap sources because they support refresh, which is a faster and more efficient update operation. Web sources don’t support refresh, so they must perform a daily rescan which indexes reachable pages in random order.
If necessary, you can configure either of these connectors to rely on the Coveo Crawling Module. However, neither Web nor Sitemap sources support indexing permissions.
See also:
-
(Sitemap source) Index XML sitemap metadata
-
(Web source) Populate a field with meta tag content
REST or GraphQL API source
To index data that’s accessible through an exposed API, consider using a REST API or GraphQL API source. While this connector is quite flexible, you must have a deep understanding of the API upon which you base its configuration.
The API sources use Coveo-hosted crawlers. They support incremental refresh and indexing permissions.
See also:
-
For the REST API source:
-
For the GraphQL API source:
Database or File Server source
To index data from an on-premises database or a Windows server, consider using a Database or File System source respectively.
These connectors rely on the Coveo Crawling Module. They both support incremental refresh and indexing permissions.
Note
Behind the scenes, the Coveo Crawling Module interacts with the Push API directly, as do Sitecore sources. |
See also:
Push source
If none of the existing specific or generic connectors meet your needs, then you will have to use a Push source to index your content. It supports indexing permissions and relies on a custom process to crawl content.
To index your content with a Push source, you will have to write, host, and execute your own custom connector to retrieve data, which may include permission models and security identities. You will also have to convert this data to JSON.
You can interact with a Push source in the following ways:
Example
You use the content list that you drafted to figure out which connectors or sources you require for your search project.
-
You determine that you need YouTube, Salesforce, and Khoros Community sources to index how-to videos, knowledge base articles or support cases, and Q&A forum discussions respectively.
-
The Coveo Platform offers several Confluence connectors. After some investigation, you conclude that you need a Confluence (Server) source to index internal documentation from a Confluence Server 6.0.1.
-
The public documentation site contains a sitemap, so a Sitemap (Cloud) source seems like the best choice to index content from that location.
-
Basecamp exposes a REST API which can apparently retrieve all of the project management data content your users need. Therefore, you will probably need a REST API source to index that content.
-
Employee data is available through a custom intranet which relies on a database. You could either use a Web (Crawling Module) source or a Database source to retrieve this content. However, since your users require item-level security (which a Web source doesn’t support), you opt for a Database source.
Content | Cloud-accessible | System | Secured | Required connector |
---|---|---|---|---|
How-to videos |
Yes |
YouTube |
No |
YouTube |
Knowledge base articles |
Yes |
Salesforce |
Yes |
Salesforce |
Support cases |
Yes |
Salesforce |
Yes |
Salesforce |
Q&A forum discussions |
Yes |
Khoros Community |
No |
Khoros Community |
Internal documentation |
No |
Confluence Server 6.0.1 |
Yes |
Confluence (Server) |
Public documentation |
Yes |
Static site |
No |
Sitemap (Cloud) |
Project management data |
Yes |
Basecamp |
Yes |
REST API |
Employee data |
No |
Custom intranet |
Yes |
Database |
Identify content to include or exclude
Almost every repository from which you need to index content will hold content that serves no purpose for your users.
Indexing irrelevant items would force you to write and maintain query time filtering or demoting rules, which you should typically avoid. Populating hundreds of fields with inconsequential metadata would also be a questionable indexing strategy, as your index only supports a limited number fields.
There are various ways to refine and enhance content before and during the indexing process. Rather than immediately indexing a content repository, you should therefore first determine what your users actually need.
Consider indexing content such as:
|
Avoid indexing content such as:
|
Your users expect to be able to find up-to-date internal documentation in their search results.
However, you notice that the Confluence Server hosting this information also contains several archived spaces which no longer seem relevant. You confirm that none of your users are likely to search for that content, and therefore decide to ensure that it doesn’t land in your index at all.
What’s next?
The Apply indexing techniques article explains how you can ensure that your index only contains relevant content.