Index an InSided Community

You can index an inSided community using a Web source and some Python indexing pipeline extensions (IPEs) (see Add or edit a Web source and Manage Extensions).

Similarly to a community form, each post in an inSided community is considered a topic, and each topic is classified in a category. If it’s a question, the topic has either the status Question or Solved. A topic can also be an announcement from a community manager.

Once you have made your inSided community searchable, you can implement result folding so that, in your search page, the answers to a topic appear under the topic search result (see About Result Folding).

  1. Create a Web source.

    1. Enter a Domain to use as the indexing process starting point.

    2. Use exclusion rules to exclude web page duplicates, unwanted categories, and/or the community member list.

      Example
      • https://community.example.com/news-and-announcements-*

      • https://community.example.com/off-topic-*

      • https://community.example.com/search?*

      • https://community.example.com/members/*

    3. Similarly, add query parameters to ignore to exclude other duplicates.

      Example

      Enter sort to ignore URLs representing alternative ways to sort the community content.

    4. In the Web scraping subtab, click Edit with JSON, and then enter a custom JSON configuration to ignore unwanted web page parts and to index the desired topic metadata. If you intend to implement result folding, define topic comments as subitems.

      Example
        [
          {
            "name": "myconfig",
            "for": {
              "urls": [
                ".*"
              ]
            },
            "exclude": [
              {
                "type": "CSS",
                "path": ".ssi-header"
              },
              {
                "type": "CSS",
                "path": ".qa-main-navigation"
              },
              {
                "type": "CSS",
                "path": ".breadcrumb-container"
              },
              {
                "type": "CSS",
                "path": ".qa-brand-hero"
              },
              {
                "type": "CSS",
                "path": ".Template-brand-stats"
              },
              {
                "type": "CSS",
                "path": ".Template-brand-featured"
              },
              {
                "type": "CSS",
                "path": ".Sidebar"
              },
              {
                "type": "CSS",
                "path": ".Template-footer"
              },
              {
                "type": "CSS",
                "path": ".Template-brand-footer"
              }
            ],
            "metadata": {
              "status": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-thread-status::text"
              },
              "sticky": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-sticky",
                "isBoolean": true
              },
              "category": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-forum::text"
              },
              "questiondate": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-meta time::attr(datetime)"
              },
              "replies": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-replies::text"
              },
              "question": {
                "type": "CSS",
                "path": ".qa-topic-header",
                "isBoolean": true
              }
            },
            "subItems": {
              "reply": {
                "type": "CSS",
                "path": "#comments .qa-topic-post-box"
              }
            }
          },
          {
            "for": {
              "types": [
                "reply"
              ]
            },
            "metadata": {
              "bestanswer": {
                "type": "CSS",
                "path": ".post--bestanswer",
                "isBoolean": true
              },
              "content": {
                "type": "CSS",
                "path": ".qa-topic-post-content::text"
              }
            }
          }
        ]
  2. Map the fields you configured in the Web scraping section.

  3. Add fields to use in indexing pipeline extensions (IPEs) (see Manage fields).

  4. Add indexing pipeline extensions to (see Manage Extensions):

    • Add CSS for subitem Quick view in result folding, as by default there’s no Quick view for subitems (see About Result Folding).

      Example
         try:
             if (document.uri.find("SubItem:") != -1):
                 extracted_html = [x.strip('\r\n\t') for x in document.get_data_stream('body_html').readlines() if x.strip('\r\n\t')]
                 new_html = "<link rel='stylesheet' type='text/css' href='https:/mycsslink.css'>"
                 for line in extracted_html:
                     new_html += line
                 html = document.DataStream('body_html')
                 html.write(new_html)
                 document.add_data_stream(html)
         except Exception as e:
             log(str(e))
    • Populate the fields needed to fold answers under topics.

      Example
         import re
         try:
             clickableuri = document.get_meta_data_value('clickableuri')[0]
             common_field = clickableuri.rsplit('/', 1)[-1]
             common_field = re.sub('[^0-9a-zA-Z]+', '', common_field)[:49]
             if (document.uri.find("SubItem:") == -1):
                 document.add_meta_data({'foldfoldingfield': common_field})
                 document.add_meta_data({'foldparentfield': common_field})
             else:
                 document.add_meta_data({'foldfoldingfield': common_field})
                 document.add_meta_data({'foldchildfield': common_field})
         except Exception as e:
             log(str(e))
    • Exclude .html pages causing duplicates.

      Example
         import re
         try:
             filename = document.get_meta_data_value("filename")[0]
             if (re.search( r"index.*\.html", filename) is not None):
                 document.reject()
         except Exception as e:
             log(str(e))
    • Process information collected on a web page.

      Example

      To get the year on which a topic (question) was published:

         try:
             if (document.uri.find("SubItem:") == -1):
                 date = document.get_meta_data_value("questiondate")[0]
                 document.add_meta_data({'questionyear': date[0:4]})
         except Exception as e:
             log(str(e))