Index an InSided Community

You can index an inSided community using a Web source and some Python indexing pipeline extensions (IPEs) (see Add or Edit a Web Source and Manage Extensions).

Similarly to a community form, each post in an inSided community is considered a topic, and each topic is classified in a category. If it’s a question, the topic has either the status Question or Solved. A topic can also be an announcement from a community manager.

Once you have made your inSided community searchable, you can implement result folding so that, in your search page, the answers to a topic appear under the topic search result (see Understanding Result Folding).

  1. Create a Web source as follows (see Add or Edit a Web Source):

    1. Enter a Site URL to use as the indexing process starting point.

    2. In the Content to Include section, use exclusion filters to exclude web page duplicates, unwanted categories, and/or the community member list.

      • https://community.example.com/news-and-announcements-*

      • https://community.example.com/off-topic-*

      • https://community.example.com/search?*

      • https://community.example.com/members/*

    3. Similarly, add Query parameters to ignore to exclude other duplicates.

      Enter sort to ignore URLs representing alternative ways to sort the community content.

    4. In the Web Scraping section, enter a custom JSON script to ignore unwanted web page parts, index the desired topic metadata, and, if you intend to implement result folding, define topic comments as subitems (see Web Scraping Configuration and Understanding Result Folding).

        [
          {
            "for": {
              "urls": [
                ".*"
              ]
            },
            "exclude": [
              {
                "type": "CSS",
                "path": ".ssi-header"
              },
              {
                "type": "CSS",
                "path": ".qa-main-navigation"
              },
              {
                "type": "CSS",
                "path": ".breadcrumb-container"
              },
              {
                "type": "CSS",
                "path": ".qa-brand-hero"
              },
              {
                "type": "CSS",
                "path": ".Template-brand-stats"
              },
              {
                "type": "CSS",
                "path": ".Template-brand-featured"
              },
              {
                "type": "CSS",
                "path": ".Sidebar"
              },
              {
                "type": "CSS",
                "path": ".Template-footer"
              },
              {
                "type": "CSS",
                "path": ".Template-brand-footer"
              }
            ],
            "metadata": {
              "status": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-thread-status::text"
              },
              "sticky": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-sticky",
                "isBoolean": true
              },
              "category": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-forum::text"
              },
              "questiondate": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-meta time::attr(datetime)"
              },
              "replies": {
                "type": "CSS",
                "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-replies::text"
              },
              "question": {
                "type": "CSS",
                "path": ".qa-topic-header",
                "isBoolean": true
              }
            },
            "subItems": {
              "reply": {
                "type": "CSS",
                "path": "#comments .qa-topic-post-box"
              }
            }
          },
          {
            "for": {
              "types": [
                "reply"
              ]
            },
            "metadata": {
              "bestanswer": {
                "type": "CSS",
                "path": ".post--bestanswer",
                "isBoolean": true
              },
              "content": {
                "type": "CSS",
                "path": ".qa-topic-post-content::text"
              }
            }
          }
        ]
      
  2. Map the fields you configured in the Web Scraping section (see Manage Source Mappings).

  3. Add fields to use in indexing pipeline extensions (IPEs) (see Manage Fields).

  4. Add indexing pipeline extensions to (see Manage Extensions):

    • Add CSS for subitem quick view in result folding, as by default there’s no quick view for subitems (see Understanding Result Folding)
         try:
             if (document.uri.find("SubItem:") != -1):
                 extracted_html = [x.strip('\r\n\t') for x in document.get_data_stream('body_html').readlines() if x.strip('\r\n\t')]
                 new_html = "<link rel='stylesheet' type='text/css' href='https:/mycsslink.css'>"
                 for line in extracted_html:
                     new_html += line
                 html = document.DataStream('body_html')
                 html.write(new_html)
                 document.add_data_stream(html)
         except Exception as e:
             log(str(e))
      
    • Populate the fields needed to fold answers under topics
         import re
         try:
             clickableuri = document.get_meta_data_value('clickableuri')[0]
             common_field = clickableuri.rsplit('/', 1)[-1]
             common_field = re.sub('[^0-9a-zA-Z]+', '', common_field)[:49]
             if (document.uri.find("SubItem:") == -1):
                 document.add_meta_data({'foldfoldingfield': common_field})
                 document.add_meta_data({'foldparentfield': common_field})
             else:
                 document.add_meta_data({'foldfoldingfield': common_field})
                 document.add_meta_data({'foldchildfield': common_field})
         except Exception as e:
             log(str(e))
      
    • Exclude .html pages causing duplicates
         import re
         try:
             filename = document.get_meta_data_value("filename")[0]
             if (re.search( r"index.*\.html", filename) is not None):
                 document.reject()
         except Exception as e:
             log(str(e))
      
    • Process information collected on a web page

      To get the year on which a topic (question) was published:

         try:
             if (document.uri.find("SubItem:") == -1):
                 date = document.get_meta_data_value("questiondate")[0]
                 document.add_meta_data({'questionyear': date[0:4]})
         except Exception as e:
             log(str(e))
      
Recommended Articles