Enable optical character recognition

When optical character recognition (OCR) is enabled for a source, Coveo extracts text from image files or PDF files containing images (that is, scanned or illustrated documents) as they go through the indexing pipeline. OCR-extracted text is processed as item data, meaning that it’s searchable and appears in the item Quick view.

Examples
  • Your Dropbox account contains a picture of a city skyline in which there’s a billboard reading Hoover Dam and Grand Canyon Tours. Since you’ve enabled OCR for this source, this image is analyzed by the OCR feature during the indexing process.

    As a result, when you type Grand Canyon in a Coveo-powered search interface, this picture is part of your search results. This item’s Quick view contains the text extracted from the picture: Hoover Dam and Grand Canyon Tours.

  • Your company is digitalizing its archives. Hundreds of pages of documents are therefore scanned and saved as PDF files. To make these documents searchable, Coveo needs to access the text they contain. You enable OCR for this source so that when an end-user enters keywords in a search interface, any matching PDF file is returned in the search results.

Supported file formats

OCR can be enabled for sources that are likely to index images and PDF files.

The supported file formats are the following: JPG, PNG, BMP, SVG, and PDF.

Limitations

  • All characters used in English, French, Spanish, and German are supported. Characters used in other languages may not be recognized properly.

  • Handwriting isn’t supported, although very clear handwritings may yield acceptable results.

Enable optical character recognition

Since the OCR feature is available at an extra charge, you must first contact Coveo Sales to add this feature to your organization license. You can then enable it for the desired source.

The OCR options are located in a source’s addition or modification panel, in the Configuration tab. Check the instructions specific to the desired source for details on their location.

Review OCR logs

Once you enabled OCR in your source, you can use the Log Browser (platform-ca | platform-eu | platform-au) to review a list of the items that went through the OCR stage. In the Log Browser Resources facet, select OCR_EXTENSION to show logs for the OCR stage only. The possible results are the following:

Result Description

Completed

OCR has been performed on this item. Expand the entry to show the process duration.

Skipped

The item skipped the OCR stage because its format isn’t supported.

Error

An error occurred during the OCR process. You can report it to the Coveo Support team for investigation.

See Review item logs for details on the Log Browser.

Impact on performance

Only images and PDF files go through the OCR stage of the indexing pipeline. For these items, expect an additional indexing time of 10 to 15 seconds per image or page. To minimize this impact on performance, keep OCR disabled for sources for which it’s not relevant.

Items of other types skip the OCR stage, and indexing performance is therefore not affected by these items.