- Content Retrieval Methods
- Content Security
- Source Item Types
- Source Credentials Leading Practices
- Refresh vs Rescan vs Rebuild
- Edit a Source Schedule
- Edit Source Extensions
- Add or Edit a Source Using One of the Available Connectors
- Security - Tab
- Manage Source Mappings
- Edit a Source JSON Configuration
- JSON Modification Examples
- Understanding Crawling Performance
- Enable Optical Character Recognition
- Limit the Indexing Process to a Certain Number of Items
Enable Optical Character Recognition
When optical character recognition (OCR) is enabled for a source, Coveo Cloud extracts text from image files or PDF files containing images as they go through the indexing pipeline (see Optical Character Recognition and Coveo Cloud V2 Indexing Pipeline). OCR-extracted text is processed as item data, meaning that it is fully searchable and will appear in the item Quick View (see Search Result Quick View).
Your Dropbox account contains a picture of a city skyline in which there is a large billboard reading
Hoover Dam and Grand Canyon Tours. Since you have enabled OCR for this source, this image is analyzed by the OCR feature during the indexing process.
As a result, when you type
Grand Canyonin a Coveo Cloud search interface, this picture is part of your search results. This item Quick View contains the text extracted from the picture:
Hoover Dam and Grand Canyon Tours.
Your company is digitalizing its archives. Hundreds of pages of documents are therefore scanned and saved as PDF files. To make these documents searchable, Coveo needs to access the text they contain. You enable OCR for this source so that when an end-user enters keywords in a search interface, any matching PDF file is returned in the search results.
- All characters used in English, French, Spanish, and German are supported. Characters used in other languages may not be recognized properly.
- Only image files and PDF files (e.g., scanned or illustrated documents) are supported.
- Hand writing is not supported, although very clear handwritings may yield acceptable results.
Enable Optical Character Recognition
- Open the addition or edition panel for the desired source.
- Under Optical character recognition (OCR), check the Make text found in images and PDF files searchable box, and then select the file types to analyze:
- Select All non-text files if you want the OCR feature to analyze image files as well as PDF files that contain images.
- Select Image files only if you want the OCR feature to analyze only files with an
- Select PDF files with images only if you want the OCR feature to analyze scanned or illustrated PDF files only.
- Add or save your source.
Once you enabled OCR in your source, you can use the Log Browser to review a list of the items that went through the OCR stage: in the Log Browser Resources facet, select OCR_EXTENSION to show logs for the OCR stage only (see Review Item Logs). The possible results are the following:
|Completed||OCR has been performed on this item. Expand the entry to show the process duration.|
|Skipped||The item skipped the OCR stage because it is not of an allowed type (see step 2).|
|Error||An error occurred during OCR process. You can report it to the Coveo Support team for investigation.|
Impact on Performance
Only images and PDF files go through the OCR stage of the indexing pipeline. For these items, expect an additional indexing time of 10 to 15 seconds per page or image. To minimize this impact on performance, keep OCR disabled for sources for which it is not relevant.
Items of other types skip the OCR stage, and indexing performance is therefore not affected by these items.