About field decomposition
About field decomposition
Field decomposition breaks down a field value, such as a product identifier, into multiple permutations stored in a separate field. This allows users to easily search for complex string values, like model numbers or SKUs, by entering only a partial value while still retrieving accurate results.
In B2B commerce scenarios, visitors often search for products using product identifiers.
Product identifiers often consist of a combination of letters, numbers, and special characters, making them difficult to remember or type correctly.
For example, a product identifier might look like this: ABC123DEF456.
By decomposing the product identifier, you can generate additional search terms from it, such as:
123DEF456 ; 23DEF456 ; 3DEF456 ; 456 ; ABC ; ABC1 ; ABC12 ; ABC123 ; ABC123D ; ABC123DE ; ABC123DEF ; ABC123DEF4 ; ABC123DEF45 ; ABC123DEF456 ; BC123DEF456 ; C123DEF456 ; DEF456 ; EF456 ; F456
Leveraging field decomposition
You can implement field decomposition either in your own systems before sending data to Coveo, or using Coveo indexing pipeline extension (IPE). If you prefer to handle decomposition upstream in your data processing workflows, you can generate the decomposed values and send them to a dedicated field in Coveo.
Alternatively, you can create a Coveo indexing pipeline extension (IPE) to decompose fields during the indexing process. Using this feature, you can run a custom script to transform data as it’s being indexed.
Regardless of which approach you choose, the decomposition best practices and algorithms outlined in this article apply to both strategies. The example Python script provided later can serve as a reference for implementing field decomposition in either your own systems or within a Coveo IPE.
Step 1: Create a decomposition field
The first step is to create a field that will store the decomposed values.
-
Access the Fields (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console.
-
Create a new field that will store the decomposed values (for example,
product_id_decomposed). Make sure that this field has the Multi-value facet and Free text search options enabled.-
You should also ensure that you have a field that stores the original field value (for example,
ec_product_id).
-
Step 2: Create the Coveo indexing pipeline extension
After creating the field that will store the decomposed values, create a Coveo indexing pipeline extension (IPE) that will run the field decomposition script. You’ll then need to apply this IPE to the source that stores the items you want to decompose.
To create the Coveo indexing pipeline extension:
-
Access the Extensions (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console.
-
Create a new indexing pipeline extension. This is where you’ll write the script that will decompose the field values. See Use indexing pipeline extensions for detailed instructions.
Step 3: Write the field decomposition script
When creating the Coveo IPE, you’ll be required to write a script.
Consider the following best practices when writing a field decomposition script. These practices minimize index noise while maintaining effective searchability. Note that an example script that includes these best practices is provided later in this article.
-
Cleaning input data:
Remove any non-alphanumeric characters (except for specific allowed characters, like hyphens) from the field values to standardize the field value for processing.
-
Handling hyphen-separated values:
Split the field value into separate parts based on the hyphens used in the field value. For example, the product identifier
XYZ-789-DEFwould be split intoXYZ,789, andDEF. -
Generating meaningful incremental variations:
Create incremental substring variations containing at least 3-4 characters to avoid index noise from overly short terms. For example, the product identifier
ABC123DEF456would generate the following variations:-
ABC,ABC1,ABC12,ABC123,ABC123D,ABC123DE,ABC123DEF,ABC123DEF4,ABC123DEF45,ABC123DEF456
This approach avoids generating single or double-character variations like
A,B,12that can create noise and reduce query precision. -
-
Hyphen-progressive concatenation:
For hyphenated identifiers, progressively add characters across hyphen boundaries to generate incremental variants that reflect how users type. For example, the product identifier
XYZ-789-DEFwould generate the following variations:-
Progressive hyphenated:
XYZ-,XYZ-7,XYZ-78,XYZ-789,XYZ-789-,XYZ-789-D,XYZ-789-DE,XYZ-789-DEF -
Fully concatenated:
XYZ789DEF -
Individual parts:
XYZ,XYZ789,XYZ789DEF
-
-
Normalization of separators:
Create variations that replace dashes with spaces and provide the fully concatenated version. For example, the product identifier
ABC123-GHI789would generate:-
Space-separated:
ABC123 GHI789 -
Fully concatenated:
ABC123GHI789
-
-
Edge trimming for flexibility:
Generate variations that remove one or two leading or trailing characters to account for partial recalls. For example, the product identifier
ABC123DEF456would generate:-
Without leading characters:
C123DEF456,123DEF456 -
Without trailing characters:
ABC123DEF45,ABC123DEF4
-
-
Combining all variations:
Combine all the variations generated in the previous steps into a single list of decomposed values. Ensure that there are no duplicates, and sort the list alphabetically. For example, the product identifier
ABC123DEF456would generate the following optimized list:123DEF456 ; 23DEF456 ; 3DEF456 ; 456 ; ABC ; ABC1 ; ABC12 ; ABC123 ; ABC123D ; ABC123DE ; ABC123DEF ; ABC123DEF4 ; ABC123DEF45 ; ABC123DEF456 ; BC123DEF456 ; C123DEF456 ; DEF456 ; EF456 ; F456Note how this approach eliminates single-character noise while maintaining comprehensive searchability.
Step 4: Apply the indexing pipeline extension to a source
When you’re done creating your IPE, you’ll need to apply it to the source that contains the items you want to decompose.
To apply the IPE to a source:
-
Access the Sources (platform-ca | platform-eu | platform-au) page of the Coveo Administration Console.
-
Click the source for which you want to apply the indexing pipeline extension.
-
In the Action bar, click More > Edit extensions.
-
Click Add > Extension, and then select the indexing pipeline extension you created.
-
Under Stage, select Post-conversion.
-
Under Action on error, select Skip extension.
-
Click Apply extension. You’ll need to rebuild the source for the IPE to apply to the items.
Step 5: Validate the field decomposition
After applying the IPE to the source, you should validate that the field decomposition is working as expected. You can use the Content Browser (platform-ca | platform-eu | platform-au) to view the indexed items and verify that the decomposed field exists and contains the expected values.
If the field decomposition isn’t working as expected, you can review the Log Browser (platform-ca | platform-eu | platform-au) to identify any errors or issues that may have occurred during the indexing process and adjust the field decomposition script accordingly.
Example field decomposition script
The following example Python script decomposes a product identifier field value according to the best practices outlined in the previous section. Note that this script considers the product identifier field to be a single value. If your product identifier field contains multiple values, you’ll need to adjust the script accordingly.
import re
def get_safe_meta_data(meta_data_name):
meta_data_value = document.get_meta_data_value(meta_data_name)
return list(meta_data_value)
def generate_variations(product_identifier):
variations = set()
MIN_VARIATION_LENGTH = 3 # Minimum decomposition length to avoid index noise
# Remove non-alphanumeric characters (excluding hyphens) from the product identifier
cleaned_product_identifier = re.sub(r'[^A-Za-z0-9-]+', '', product_identifier)
# Handle hyphen-separated parts
parts = cleaned_product_identifier.split('-')
concatenated_id = ''.join(parts)
# 1. Generate incremental variations starting from minimum length for concatenated version
# This avoids single/double character noise in the index
for i in range(MIN_VARIATION_LENGTH, len(concatenated_id) + 1):
variations.add(concatenated_id[:i])
# 2. Generate suffix variations (from the end) for better searchability
for i in range(MIN_VARIATION_LENGTH, len(concatenated_id) + 1):
suffix = concatenated_id[-i:]
if len(suffix) >= MIN_VARIATION_LENGTH:
variations.add(suffix)
# 3. Hyphen-progressive concatenation for hyphenated identifiers
if len(parts) > 1:
current_progressive = ""
for part_idx, part in enumerate(parts):
if part_idx > 0:
current_progressive += "-"
# Add variation with trailing hyphen if it meets minimum length
if len(current_progressive) >= MIN_VARIATION_LENGTH:
variations.add(current_progressive)
# Add each character progressively within the current part
for char_idx in range(len(part)):
current_progressive += part[char_idx]
if len(current_progressive) >= MIN_VARIATION_LENGTH:
variations.add(current_progressive)
# 4. Generate variations for individual parts that meet minimum length
for part in parts:
if len(part) >= MIN_VARIATION_LENGTH:
variations.add(part)
# Add incremental variations for parts longer than minimum
for i in range(MIN_VARIATION_LENGTH, len(part)):
variations.add(part[:i])
# 5. Edge trimming: remove leading/trailing characters for flexibility
if len(concatenated_id) > MIN_VARIATION_LENGTH + 1:
# Remove 1-2 leading characters
for trim_count in range(1, min(3, len(concatenated_id) - MIN_VARIATION_LENGTH + 1)):
trimmed = concatenated_id[trim_count:]
if len(trimmed) >= MIN_VARIATION_LENGTH:
variations.add(trimmed)
# Remove 1-2 trailing characters
for trim_count in range(1, min(3, len(concatenated_id) - MIN_VARIATION_LENGTH + 1)):
trimmed = concatenated_id[:-trim_count]
if len(trimmed) >= MIN_VARIATION_LENGTH:
variations.add(trimmed)
# 6. Always include the original cleaned identifier
variations.add(cleaned_product_identifier)
# 7. Add space-separated version for better matching
spaced_version = cleaned_product_identifier.replace('-', ' ')
variations.add(spaced_version)
# 8. Filter to ensure minimum length requirement (except for original forms)
filtered_variations = []
for v in variations:
if (len(v) >= MIN_VARIATION_LENGTH or
v == cleaned_product_identifier or
v == spaced_version):
filtered_variations.append(v)
return sorted(list(set(filtered_variations)))
def main():
product_identifier_meta_field = '<MY_PRODUCT_IDENTIFIER_FIELD>'
decomposed_meta_field = '<MY_DECOMPOSED_FIELD>'
product_identifiers = get_safe_meta_data(product_identifier_meta_field)
decomposed_product_identifiers = []
for product_identifier in product_identifiers:
log(f"Processing Product Identifier: {product_identifier}")
variations = generate_variations(product_identifier)
# Properly join variations with a semicolon
decomposed_product_identifier = ';'.join(variations)
log(f"Decomposed Product Identifier: {decomposed_product_identifier}")
decomposed_product_identifiers.append(decomposed_product_identifier)
if decomposed_product_identifiers:
# Add metadata as a semicolon-separated string
document.add_meta_data({decomposed_meta_field: decomposed_product_identifiers})
main()
Replace <MY_PRODUCT_IDENTIFIER_FIELD> with the name of the field that holds the product identifiers (for example, ec_product_id). |
|
Replace <MY_DECOMPOSED_FIELD> with the name of the field that will store the decomposed values (for example, product_id_decomposed). |