Enable Optical Character Recognition

When optical character recognition (OCR) is enabled for a source, Coveo Cloud extracts text from image files and/or PDF files containing images as they go through the indexing pipeline (see Optical Character Recognition and Coveo Cloud V2 Indexing Pipeline). OCR-extracted text is processed as item data, meaning that it is fully searchable and will appear in the item Quick View (see Search Result Quick View).

  • Your Dropbox account contains a picture of a city skyline in which there is a large billboard reading Hoover Dam and Grand Canyon Tours. Since you have enabled OCR for this source, this image is analyzed by the OCR feature during the indexing process.

    As a result, when you type Grand Canyon in a Coveo Cloud search interface, this picture is part of your search results. This item Quick View contains the text extracted from the picture: Hoover Dam and Grand Canyon Tours.

  • Your company is digitalizing its archives. Hundreds of pages of documents are therefore scanned and saved as PDF files. To make these documents searchable, Coveo needs to access the text they contain. You enable OCR for this source so that when an end-user enters keywords in a search interface, any matching PDF file is returned in the search results.

Limitations

  • All characters used in English, French, Spanish, and German are supported. Characters used in other languages may not be recognized properly.
  • Only image files and PDF files (e.g., scanned or illustrated documents) are supported.
  • Hand writing is not supported, although very clear handwritings may yield acceptable results.

Enable Optical Character Recognition

Since the OCR feature is available at an extra charge, you must first contact Coveo Sales to add this feature to your organization license. You can then enable it for the desired source.

  1. Open the addition or edition panel for the desired source (see Available Coveo Cloud V2 Connectors).
  2. Under Optical character recognition (OCR), check the Make text found in images and PDF files searchable box, and then select the file types to analyze:
    • Select All non-text files if you want the OCR feature to analyze image files as well as PDF files that contain images.
    • Select Image files only if you want the OCR feature to analyze only files with an .jpg, .png, .bmp, or .svg extension.
    • Select PDF files with images only if you want the OCR feature to analyze scanned or illustrated PDF files only.
  3. Add or save your source.

Once you enabled OCR in your source, you can use the Log Browser to review a list of the items that went through the OCR stage: in the Log Browser Resources facet, select OCR_EXTENSION to show logs for the OCR stage only (see Review Item Logs). The possible results are the following:

Result Description
Completed OCR has been performed on this item. Expand the entry to show the process duration.
Skipped The item skipped the OCR stage because it is not of an allowed type (see step 2).
Error An error occurred during OCR process. You can report it to the Coveo Support team for investigation.

Impact on Performance

Only images and PDF files go through the OCR stage of the indexing pipeline. For these items, expect an additional indexing time of 10 to 15 seconds per page or image. To minimize this impact on performance, keep OCR disabled for sources for which it is not relevant.

Items of other types skip the OCR stage, and indexing performance is therefore not affected by these items.