Web Scraping Configuration

Web and Sitemap sources can include a web scraping configuration, a powerful tool allowing members of the Administrators and Content Managers built-in groups to precisely select the web page content to index, exclude specific parts, extract content to create metadata, and create sub-items.

As a member of one of these groups, you must however add the web scraping configuration to your Web or Sitemap source in JSON format, which requires more technical skills (see Add or Edit a Web Source or Adding a Sitemap Source).

Consider using the Coveo Labs Web Scraper Helper available on the Chrome Web Store and on GitHub (web-scraper-helper project) to more easily create and especially test web scraping configurations.

Application Examples

  • Exclude parts of the page from being indexed.

    For 3-pane web pages, you exclude the repetitive header, footer, and left navigation panel to only index the original content of each page.

  • Create metadata with values extracted from the page.

    In a question and answers site, no meta element provides the number of votes for the question, but the value appears on the page. You use a CSS selector rule to extract the string of this value and index it as question metadata.

  • Extract part of an item to create sub-items that become independent items in the index.

    In a blog post, you extract comments as sub-items.

  • Set different web scraping rules for different item types or sub-items.

    In the same blog post where comments are extracted as sub-items, you can extract different metadata for the parent item and for the comments.

For Stack Overflow website pages, you want to split the question and each answer in separate index items and later use result folding in the search interface to wrap the answers under the corresponding question item (see Understanding Result Folding).

You can do that with the following web scraping configuration, which:

  • Excludes non-content sections (header, herobox, advertizement, sidebar, footer).

  • Extracts some question metadata.

  • Defines answer sub-items.

  • Extracts some answer metadata.

[
  {
    "for": {
    "urls": [".*"]
    },
    "exclude": [
      {
        "type": "CSS",
        "path": "body header"
      },
      {
        "type": "CSS",
        "path": "#herobox"
      },
      {
        "type": "CSS",
        "path": "#mainbar .everyonelovesstackoverflow"
      },
      {
        "type": "CSS",
        "path": "#sidebar"
      },
      {
        "type": "CSS",
        "path": "#footer"
      },
      {
        "type": "CSS",
        "path": "#answers"
      }
    ],
    "metadata": {
      "askeddate":{
        "type": "CSS",
        "path": "div#sidebar table#qinfo p::attr(title)"
      },
      "upvotecount": {
        "type": "XPATH",
        "path": "//div[@id='question'] //span[@itemprop='upvoteCount']/text()"
      },
      "author":{
        "type": "CSS",
        "path": "td.post-signature.owner div.user-details a::text"
      }
    },
    "subItems": {
      "answer": {
        "type": "css",
        "path": "#answers div.answer"
      }
    }
  }, {
    "for": {
      "types": ["answer"]
    },
    "metadata": {
      "upvotecount": {
        "type": "XPATH",
        "path": "//span[@itemprop='upvoteCount']/text()"
      },
      "author": {
        "type": "CSS",
        "path": "td.post-signature:last-of-type div.user-details a::text"
      }
    }
  }
]

Web Scraping Leading Practices

When a web scraping JSON configuration causes an error, the Coveo Cloud administration console currently does not return error details, making it hard to troubleshoot. The following procedure suggests steps that may help to successfully develop a web scraping configuration.

  • Use a test source that includes only a few typical pages to test your web scraping configuration as you develop it. Rebuilding the source will be quick.

  • Use a text editor to create your web scraping JSON configuration (see JSON Configuration Schema).

  • Incrementally add web scraping elements to your JSON configuration (see Filters (for), SubItems, Exclusion, Metadata).

  • Ensure that the syntax of your JSON configuration is valid. You can use online tools such as JSONLint.

  • In the source configuration panel, paste the valid JSON in the Web scraping configuration box.

  • Click Save and Rebuild to make sure that changes perform as expected.

    When an error is returned, roll back your changes to return to a functional configuration to help identify changes that are causing the errors. Contact Coveo Support if you face a configuration problem that you cannot solve.

  • Use the Content Browser to validate your changes (see Inspect Search Result Items):

    • Use the debug window to view fields and their values for one item.

    • Use the Export to Excel option to easily compare field values for many items.

  • When your web scraping extracts metadata:

  • Once the configuration works as desired for your test source, apply it to more items or all of them and validate the results.

  • Apply the configuration to your production source.

JSON Configuration Schema

The configuration consists in a JSON array of setting objects. The JSON structure is defined and commented in the following schema.

[                               -> Array of settings.
                                   The first matching "for" setting is applied.
  {
    "for": {                    -> Filters which pages have these settings applied.
      "urls": string[]          -> An array of REGEX to match specific URLs.
      "types": string[]         -> An array of types when you want to match specific subItems.
    },
    "exclude": [                -> Array of selector objects to remove elements from the page.
                                   These selectors cannot be Boolean or absolute.
      {
        "type": string,         -> "CSS" or "XPATH".
        "path": string          -> The actual selector value.
      }
    ],
    "metadata": {               -> Map of selector objects.
                                   The key represents the name of the piece of metadata.
      "meta1": {
        "type": string,         -> "CSS" or "XPATH"
        "path": string,         -> The actual selector value.
        "isBoolean": boolean,   -> Whether to evaluate this selector as a Boolean.
                                   If the selected element is found, the returned value is "true".
        "isAbsolute": boolean   -> Whether to retrieve the metadata from the parent page instead of the current subItem.
      }
    },
    "subItems": {               -> Map of selectors.
                                   The key represents the type of subItem to retrieve.
                                   These selectors cannot be Boolean or absolute.
      "post": {
        "type": string,         -> "CSS" or "XPATH"
        "path": string          -> The actual selector value.
      }
    }
  },
  {...}
]

You can copy and paste the following web scraper configuration JSON sample, and then tailor the filter, exclusions, metadata, and sub-items to your needs.

[{
  "for": {
    "urls": [".*"]
  },
  "exclude": [{
    "type": "CSS|XPATH",
    "path": "css|xpath selector"
  }, {
    "type": "CSS|XPATH",
    "path": "css|xpath selector"
  }],
  "metadata": {
    "meta1": {
      "type": "CSS|XPATH",
      "path": "css|xpath selector"
    }
  },
  "subItems": {
    "MySubItemName": {
      "type": "CSS|XPATH",
      "path": "css|xpath selector"
    }
  }
}, {
  "for": {
    "types": ["MySubItemName"]
  },
  "metadata": {
    "meta1": {
      "type": "CSS|XPATH",
      "path": "css|xpath selector"
    },
    "title": {
      "type": "CSS|XPATH",
      "path": "css|xpath selector"
    },
    "uri": {
      "type": "CSS|XPATH",
      "path": "css|xpath selector"
    },
    "meta2": {
      "type": "CSS|XPATH",
      "path": "css|xpath selector"
    }
  }
}]

Filters (for)

The for element represents the restriction of the current setting object, i.e., to which pages or sub-Item types the settings are applied. This element is mandatory. At least one is required to match URLs, otherwise the settings are not applied to any pages.

For each page, only the first matching for element is applied. When you include more than one for element, the order in which the elements appear may impact the scraping outcome. You typically start with the most specific filters.

A for element specifies a matching value for either urls (pages) or types (sub-Items) (see SubItems). The value can be a REGEX expression.

Simply copying and pasting those examples will return an error. You must integrate the examples in the complete web scraping configuration schema (see JSON Configuration Schema).

  • You can apply a setting object to all site items with the .* REGEX:

      "for": {
        "urls": [".*"]
      }
    
  • You can apply a setting object only to HTML pages:

      "for": {
        "urls": ["\\.html"]
      }
    
  • You can apply a setting object only to a post subItem:

      "for": {
        "types": ["post"]
      }
    

Selectors

The web scraping configuration supports XPath or CSS (JQuery-style) selector types. Selectors allow you to precisely select the HTML page elements (or their text content) that you want to include or exclude in your source for a given page.

You should know the following about selectors in a web scraping configuration:

  • You can use either XPath or CSS or both types in the same web scraping configuration.

  • When no type is specified, Coveo Cloud considers it is CSS by default.

  • By default, if a selector matches many elements, they are returned as a multi-value metadata (an array of strings).

  • If the selector path matches DOM elements, the elements are returned.

  • If the selector matches text nodes, such as when you use “text()” in an XPath, only the text values are returned.

  • You cannot chain selectors.

You can use the developer tools of your browser, such as those of Google Chrome:

  • To inspect page elements and easily get CSS or XPath selector expressions by right-clicking the desired element, selecting Copy, and then respectively Copy selector or Copy XPath.
  • To test the selector in the Elements tab search box and see how many elements match your selector.

ChromeDevToolsSelectorEx1

CSS Selectors

CSS selectors are the most commonly used web selectors (see CSS Selector Reference). They are used extensively with jQuery (see Category: Selectors). CSS selectors rely on DOM elements names, classes, IDs, and there hierarchy in HTML pages to isolate specific elements.

The following CSS selector selects the element which has the class content that is inside a span under a div element.

div > span > .content

The web scraping configuration supports to use CSS selector pseudo-elements to easily retrieve element text or attribute values.

  • Text

    Add ::text at the end of a CSS selector to select the inner text of an HTML element.

    The following expression selects the text of a span element with a class title that is under a div element with a class post.

    div.post > span.title::text

  • Attribute

    Add ::attr(<attributeName>) at the end of a CSS selector to select an HTML element attribute value, where <attributeName> is the name of the attribute you want to extract.

    • You want to get the URL from a post title link:

      div.post > a.title::attr(href)

    • For a Stack Overflow website page, you want to extract the asked date that appears in the side bar, but you want to get the date and time in the title attribute, not the text.

        <p class="label-key" title="2012-06-27 13:51:36Z">
          <b>4 years ago</b>
        </p>
      

      The following expression selects the value of the title attribute of the p element.

      div#sidebar table#qinfo p::attr(title)

XPath Selectors

XPath allows you to select nodes in an XML item in a tree-like fashion using URI-style expressions (see XPath Syntax). While XPath is an older technology and more verbose than CSS selectors, it offers features not available with CSS selectors, such as selecting an element containing a specific value or that has an attribute with a specific value.

  • The following expression returns the value of the content attribute for the meta element (under the head element) that has an attribute property with the og:url value.

    //head/meta[@property="og:url"]/@content

  • The following expression returns the class of the paragraph that contains the sentence.

    //p[text()='This is some content I want to match the element']/@class

An advantage of XPath over CSS is that you can use common XPATH functions such as boolean(), count(), contains(), and substrings() to evaluate things that are not available using CSS.

  • You want to get a date string from a title attribute in a <strong> element that can only be uniquely identified by the parent element that contains the text question asked.

      <p>question asked:
       <strong title="Dec. 15, 2016, 12:18 p.m.">15 Dec, 12:18</strong>
      </p>
    

    You can take advantage of the contains() function in the following XPATH selector to get the attribute text:

    //p[contains(.,'question asked')]/strong/@title

  • You want to extract the number of answers in a Q&A page. Each question is in a <div class="answer">. You can take advantage of the count() method to get the number of answers in the page:

    count(//tr[@class='answer'])

The XPath selector must be compatible with XPath 1.0.

Exclusion

You can use the exclude element to remove one or more parts of the page body, meaning their content will not be indexed. However, links in excluded parts will be followed, so you can exclude navigation sections such as a table of content (TOC), but the source will still discover the pages linked by the TOC.

You want to index Stack Overflow site pages such as this one:

http://stackoverflow.com/questions/11227809/

You want to index only the title, the question and the answers, so you want to remove the top bar, the header, the top advertizement, the Google add below the title, and finally eliminate the sidebar on the right. Your exclude element uses selector and could be:

"exclude": [
  {
    "type": "CSS",
    "path": "div.topbar"
  },
  {
    "type": "CSS",
    "path": "#header"
  },
  {
    "type": "CSS",
    "path": "#herobox"
  },
  {
    "type": "CSS",
    "path": "div.everyonelovesstackoverflow"
  },
  {
    "type": "CSS",
    "path": "#sidebar"
  }
]

Simply copying and pasting this example will return an error. You must integrate the example in the complete web scraping configuration schema (see JSON Configuration Schema).

Excluding sections may affect the processing performance as the page is reloaded after the exclusion, but the performance hit may be perceived only when you crawl at full speed and the website response is fast (>200,000 items/hour).

Metadata

The metadata element defines how to retrieve specific metadata for the current page or sub-item. It is a map of selectors, each key representing the metadata name.

In this section, each selector can be set to return a Boolean instead of values using the isBoolean property that returns true when the selector matches any elements on the page, and false otherwise.

When extracting metadata from a sub-item, selectors are evaluated on the sub-item body only by default. They can also be set absolute using the isAbsolute property to be evaluated from the parent page instead of the current element. Use this setting only sub-items settings when you want to retrieve a metadata that is in the parent page only.

From the Stack Overflow site pages such as this one:

http://stackoverflow.com/questions/11227809/

You want to set metadata for:

  • The number of votes for the question.

  • Get the date and time the question was asked.

  • Check if there is at least one answer for the question.

"metadata":
    {
      "questionVotes": {
        "type": "XPATH",
        "path": "//*[@id='question']//div[@class='vote']/span/text()"
      },
      "questionAskedDate": {
        "type": "CSS",
        "path": "#question div.user-action-time > span::attr(title)"
      },
      "questionHasAnswer": {
        "type": "CSS",
        "path": "div.answer",
        "isBoolean": true
      }
    }
  • You must map the metadata to an appropriate field, creating one when none exist, and then rebuild the source (see Adding and Managing Source Mappings).

  • Simply copying and pasting this example will return an error. You must integrate the example in the complete web scraping configuration schema (see JSON Configuration Schema).

SubItems

The Sitemap source type does not support the subItems element. Such a configuration will simply be ignored.

The subItems element defines how to retrieve sub-items from a single page, when you want to create many index source items from a single web page.

  • A web forum page contains many posts. You want to create a source item for each post in the page to make them individually searchable.

  • On a Q&A site, each page contains a question and several answers. In your Coveo Cloud organization source, you want one item for the question part, and one item for each answer.

The subItems element is a map of selectors with the keys representing the sub-item types.

  • In the subItems section:

    • Define one or more sub-item types.

    • You can chose any sub-item type name, but since this value will be mapped to the @documenttype field, you should enter a value that fits well with the other field values.

    In a Q&A website, a page may contain a question, answers, and comments. You could define answer and comment sub-items, while the main item would contain only the question.

  • When you want to perform web scraping tasks on matching parts of a sub-item type:

    • Add a section starting with a for statement, and use the types parameter to match the sub-item name.

    • Include the desired web scraping tasks (exclude, metadata).

[{
  "for": {
    "urls": [".*"]
  },
  "exclude": [{}],
  "metadata": {},
  "subItems": {
    "mySubItemName": {
      "type": "CSS",
      "path": "css selector"
    }
  }
}, {
  "for": {
    "types": ["mySubItemName"]
  },
  "exclude": [{}],
  "metadata": {
    "meta1": {
      "type": "CSS",
      "path": "css selector"
    },
    "meta2": {
      "type": "CSS",
      "path": "css selector"
    }
  }
}]
  • The full page is still indexed (taking into account applicable web scraping tasks) and becomes a source item.

    A specific Q&A website page contains 5 answers and 3 comments. You defined answer and comment sub-items. For this website page, your Coveo Cloud source will contain 9 items (1 for the page itself, 5 answer sub-items and 3 comment sub-items).

  • The sub-item indexing does not include the CSS, so the Quick View of sub-items shows their content without the formatting.

  • You may want to create a specific mapping rule that only applies to the sub-items. You can do so by creating a specific mapping on the sub-item type (e.g., mySubItemName), as you would for a regular item type (see Adding and Managing Source Mappings).

Recommended Articles