XML Configuration Further Customization

To index database content, you must provide an XML configuration defining the content that Coveo should retrieve and how it should index it.

In addition, you can optionally customize your XML configuration to enable additional functionalities that will make your source and the indexing process more efficient.

Defining Items to Delete

If you want to index certain database items depending on a characteristic that may vary, your configuration must include instructions as to which items are no longer relevant and can be deleted from the index.

You index MESSAGE items with a message_id that’s less than 100. The configuration is the following for your MS-SQL database:

  <?xml version="1.0" encoding="utf-8" ?>
  <ODBC>
  <CommonMapping>
  <AllowedUsers>
      <AllowedUser type="Windows" allowed="true">
      <Name>everyone</Name>
      <Server></Server>
      </AllowedUser>
  </AllowedUsers>
  </CommonMapping>
  <Mapping type="MESSAGE">
  <Accessor type="query"
              IncrementalRefreshFieldName="date">
  SELECT * FROM (
      SELECT ROW_NUMBER() OVER (ORDER BY mid) AS mid, -- message.mid,
      message.sender,
      message.date,
      message.message_id,
      message.subject,
      Cast(@LastRefresh as nvarchar(4000)) as FunkyDate,
      message.body,
      message.folder
      FROM message
  )  AS T
  WHERE mid &lt;= 100
  </Accessor>
  <Fields>
      <Uri>https://www.coveo.com/Emails/details.aspx?Id=%[mid]</Uri>
      <ClickableUri>https://www.coveo.com</ClickableUri>
      <FileName>Message_%[mid].txt</FileName>
      <Title>Message_%[mid] - %[FunkyDate]</Title>
      <ModifiedDate>%[date]</ModifiedDate>
      <Body>%[body]</Body>
      <CustomFields>
      <CustomField name="sysAuthor">%[sender]</CustomField>
      </CustomFields>
  </Fields>
  <AllowedUsers>
      <AllowedUser type="CustomGroup" allowed="true">
      <Name>everyone</Name>
      <Server></Server>
      </AllowedUser>
  </AllowedUsers>
  </Mapping>
  </ODBC>

However, if, among the items with a message_id inferior to 100, you want to index items of a certain age only, you need to add an instruction for Coveo Cloud to delete the items that don’t satisfy your age condition. Since the age of an item changes constantly, each of the indexed items will eventually become irrelevant and will need to be removed from your index.

You want items that are less than a month old only to be searchable. With the following instruction inserted before <Fields> in your XML configuration, every time your Database source is refreshed, items that no longer meet your age criterion are deleted from the index:

<AccessorForItemsToDelete type="query">
  SELECT message.mid
  FROM message
  WHERE message.date &lt; DATEADD(month, -1, GETDATE())
    AND message.mid &lt; 100
</AccessorForItemsToDelete>

Enabling Refresh on a Database Source

A refresh operation keeps items up to date by scanning repositories and re-indexing modified items at short intervals. For this to be possible with a Database source, each of the database items to retrieve, such as tables or views, must have a Date type field indicating their latest modification date. This date must also be updated whenever the record is modified. The name of this field is irrelevant to Coveo, but it’s crucial that it contains a date, as Coveo uses this information to determine whether the content of a database item has changed since the last update operation. If so, the item must be re-indexed, so that the content searchable in your Coveo-powered search page reflects your actual database content.

The refresh takes into account deleted items when the AccessorForItemsToDelete accessor is configured in the XML configuration. Otherwise, a source rescan or rebuild is required.

In the SQL query, the SELECT statement must have a WHERE clause with a criterion on the last modification date field.

The following example should work with common database engines such as Microsoft SQL Server 2012, PostgreSQL, and MySQL. The [PARAMETER] field is sent by the crawler to the query to indicate when the last refresh was performed.

  • With a MSSQL or SQLServer database, one must select the SQL Client driver and, in the XML, replace [PARAMETER] with @LastRefresh.

  • With a database of a different type (e.g., a NorthWind database), one must select the ODBC driver and, in the XML, replace [PARAMETER] with ?.

However, regardless of your database type, the value of the OrderByFieldName and IncrementalRefreshFieldName parameters must be the name of the latest modification date field in your database.

<Accessor type="query"
  OrderByFieldName="dateModifiedField"
  OrderByFieldType="DateTime"
  IncrementalRefreshFieldName="dateModifiedField">
  <![CDATA[
    Select
    id,
    title,
    dateModified,
    content,
    author
    FROM blog
    WHERE dateModified>=[PARAMETER]
    order by dateModified
    OFFSET @startRow ROWS FETCH NEXT (@endRow-@startRow) ROWS ONLY;
  ]]>
</Accessor>

The example also includes support for pagination (see OFFSET FETCH Clause (SQL Server Compact)).

Enable Pausing/Resuming Update Operations

By default, Database sources don’t offer the option to pause/resume their update operations. To enable these actions, you must alter your source configuration.

  1. In the SQL query, add an ORDER BY with the same chronological fields as in the SELECT statement.

  2. In your source XML configuration, add XML attributes on the Accessor element:

  • OrderByFieldName

    Specifies the name of the column on which the ORDER BY is applied. This attribute must be present to enable the Pause/Resume options. It can be, for example, the name of the latest modification date field in your database.

  • OrderByFieldType

    Specifies the .NET data type of that column. This attribute isn’t normally required. The source automatically tries to determine the data type by preparing the SQL query - without however executing it - and looking at the schema of the results. However, if a specific DBMS doesn’t handle that process correctly, you can manually specify the data type with this attribute. The allowed types are the following:

    • short (16-bit signed integer)

    • ushort (16-bit unsigned integer)

    • int (32-bit signed integer)

    • uint (32-bit unsigned integer)

    • long (64-bit signed integer)

    • ulong (64-bit unsigned integer)

    • float (single-precision floating point number)

    • double (double-precision floating point number)

    • string (String)

  • IncrementalRefreshFieldName

    To support both Refresh and Pause/Resume options, the value of the IncrementalRefreshFieldName parameter must be the same as that of the OrderByFieldName parameter.

    See this example for an excerpt of a configuration file for a source with Pause/Resume and Refresh options enabled:

    You must replace <PARAMETER> by @LastRefresh in an SqlClient scenario or by ? otherwise.

    <Mapping type="Orders">
      <Accessor type="query"
                OrderByFieldName="OrderDate"
                OrderByFieldType="DateTime"
                IncrementalRefreshFieldName="OrderDate">
       SELECT Shippers.CompanyName AS ShipperName,
              Orders.OrderID AS ID,
              Orders.CustomerID,
              Orders.OrderDate,
              Orders.RequiredDate,
              Orders.ShippedDate,
              Customers.CompanyName,
              Employees.LastName,
              Employees.FirstName
       FROM   Orders, Shippers, Customers, Employees
       WHERE  Orders.ShipVia = Shippers.ShipperID AND
              Orders.CustomerID = Customers.CustomerID AND
              Orders.EmployeeID = Employees.EmployeeID AND
              Orders.OrderDate >= <PARAMETER>
        ORDER BY Orders.OrderDate
      </Accessor>
    </Mapping>
    

Complement Information Retrieval Using Subqueries

The Database source acquires information about each indexed item through a query performed against a database. For each query, it’s possible to associate one or more subqueries to be executed and used to complement information.

You can run a main query, and for each row, run a subquery that crawls more/different information. All the results of a single row from the main query, along with everything from the subquery, are merged into a single item.

The source requires a mapping configuration to execute properly. For each mapping type, it’s necessary to specify an Accessor representing the SQL query to execute.

Specifying Subqueries

To specify subqueries, you must set the type of the Accessor to query.

<Accessor type="query">

Following the Accessor definition, add an AccessorSubQueries node with all subqueries:

The master key (value following SELECT) in the AccessorSubQuery node must match exactly the one returned by the server. The key you include must also have the same casing. The following error is thrown when the key couldn’t be found:

Unable to index document : There's a formatting error in a sub query. Cannot find master key %[key].

<AccessorSubQueries>
  <AccessorSubQuery name="FirstNameLastName" separator=";" behaviorOnMultiRows="join" allowDuplicates = "false">
    SELECT firstName,
    lastName
    FROM employeelist
    WHERE Email_id = %[sender]
  </AccessorSubQuery>
</AccessorSubQueries>

Subquery Attributes

  • name

    Subquery name referred to in the Fields section of the mapping.

  • separator

    Separator used when concatenating many rows.

  • behaviorOnMultiRows

    Action to take when a subquery returns more than one row. The only supported behavior is join, which concatenates values with the provided separator.

  • allowDuplicates (optional)

    This attribute is mainly used when your subquery returns many rows. If set to false, duplicates in the results are ignored in the concatenation of the results. If set to true, duplicates are present.

  • singleQuoteEscapeSequence (optional)

    When the value of the returned field contains single quotes, these single quotes must be escaped. By default when you omit this attribute, the source escapes the single quotes by doubling them (e.g., ''). Usually, this escaping mechanism should work. However, some database types require a different escaping sequence for single quotes. In such cases, use this attribute to specify the single quote escape sequence.

    For the MySQL database, the single quote escaping sequence is \'. In this case, in the AccessorSubQuery tag, include the singleQuoteEscapeSequence attribute as follows:

      <AccessorSubQuery name="FirstNameLastName" separator=";" behaviorOnMultiRows="join" allowDuplicates = "false" singleQuoteEscapeSequence="\'">
    

Subquery Master Key

In a subquery, a master key used in the WHERE clause must respect the format %[fieldName], which corresponds to metadata acquired from the main accessor. The master key is used to make the join between the main query and subqueries.

Specifying Subquery Metadata for Fields

The <Fields> section of the mapping configuration is used to specify the metadata to use for indexing.

See the following example for a typical <Fields> section of a mapping configuration:

<Fields>
  <Uri>https://www.coveo.com/Emails/details.aspx?Id=%[mid]</Uri>
  <ClickableUri>https://www.coveo.com</ClickableUri>
  <FileName>Message_%[mid].txt</FileName>
  <Title>Message_%[mid]</Title>
  <ModifiedDate>%[date]</ModifiedDate>
  <Body>%[body]</Body>
  <CustomFields>
    <CustomField name="sysAuthor">%[sender]</CustomField>
    <CustomField name="firstName">%[FirstNameLastName.firstName]</CustomField>
    <CustomField name="lastName">%[FirstNameLastName.lastName]</CustomField>
  </CustomFields>
</Fields>

The metadata of a subquery can be specified for a field or a custom field. The way to specify is similar to the way it’s done when referring a field coming from the main accessor: %[subQueryName.fieldName]. In the above example, custom field firstName is referring subquery named FirstNameLastName and uses the firstName metadata.

See the following for a complete mapping configuration, used in Coveo unit tests:

<?xml version="1.0" encoding="utf-8" ?>
<ODBC>
  <CommonMapping excludedItems="employeelist">
    <AllowedUsers>
      <AllowedUser type="Windows" allowed="true">
        <Name>everyone</Name>
        <Server></Server>
      </AllowedUser>
    </AllowedUsers>
  </CommonMapping>
  <Mapping type="message">
    <Accessor type="query">
      SELECT message.mid,
      message.sender,
      message.date,
      message.message_id,
      message.subject,
      message.body,
      message.folder
      FROM message
      WHERE DATE like '2001-04-07%'
    </Accessor>
      <AccessorSubQueries>
        <AccessorSubQuery name="FirstNameLastName" separator=";" behaviorOnMultiRows="join">
          SELECT firstName, lastName
          FROM employeelist
          WHERE Email_id = %[sender]
        </AccessorSubQuery>
      </AccessorSubQueries>
    <Fields>
      <Uri>https://www.coveo.com/Emails/details.aspx?Id=%[mid]</Uri>
      <ClickableUri>https://www.coveo.com</ClickableUri>
      <FileName>Message_%[mid].txt</FileName>
      <Title>Message_%[mid]</Title>
      <ModifiedDate>%[date]</ModifiedDate>
      <Body>%[body]</Body>
      <CustomFields>
        <CustomField name="sysAuthor">%[sender]</CustomField>
        <CustomField name="firstName">%[FirstNameLastName.firstName]</CustomField>
        <CustomField name="lastName">%[FirstNameLastName.lastName]</CustomField>
      </CustomFields>
    </Fields>
    <AllowedUsers>
      <AllowedUser type="CustomGroup" allowed="true">
        <Name>everyone</Name>
        <Server></Server>
      </AllowedUser>
    </AllowedUsers>
  </Mapping>
</ODBC>
Recommended Articles