Index an InSided Community

This is for:

You can index an inSided community using a Web source and some Python indexing pipeline extensions (IPEs) (see Manage Extensions).

Similarly to a community form, each post in an inSided community is considered a topic, and each topic is classified in a category. If it’s a question, the topic has either the status Question or Solved. A topic can also be an announcement from a community manager.

Once you have made your inSided community searchable, you can implement result folding so that, in your search page, the answers to a topic appear under the topic search result (see About Result Folding).

Add a Web source.

Enter a Domain to use as the indexing process starting point.
Use exclusion rules to exclude web page duplicates, unwanted categories, and/or the community member list.
Example
- https://community.example.com/news-and-announcements-*
- https://community.example.com/off-topic-*
- https://community.example.com/search?*
- https://community.example.com/members/*
Similarly, add query parameters to ignore to exclude other duplicates.

Example

Enter sort to ignore URLs representing alternative ways to sort the community content.

In the Web scraping subtab, click Edit with JSON, and then enter a custom JSON configuration to ignore unwanted web page parts and to index the desired topic metadata. If you intend to implement result folding, define topic comments as subitems.

Example

  [
    {
      "name": "myconfig",
      "for": {
        "urls": [
          ".*"
        ]
      },
      "exclude": [
        {
          "type": "CSS",
          "path": ".ssi-header"
        },
        {
          "type": "CSS",
          "path": ".qa-main-navigation"
        },
        {
          "type": "CSS",
          "path": ".breadcrumb-container"
        },
        {
          "type": "CSS",
          "path": ".qa-brand-hero"
        },
        {
          "type": "CSS",
          "path": ".Template-brand-stats"
        },
        {
          "type": "CSS",
          "path": ".Template-brand-featured"
        },
        {
          "type": "CSS",
          "path": ".Sidebar"
        },
        {
          "type": "CSS",
          "path": ".Template-footer"
        },
        {
          "type": "CSS",
          "path": ".Template-brand-footer"
        }
      ],
      "metadata": {
        "status": {
          "type": "CSS",
          "path": ".qa-topic-header > .qa-thread-status::text"
        },
        "sticky": {
          "type": "CSS",
          "path": ".qa-topic-header > .qa-topic-sticky",
          "isBoolean": true
        },
        "category": {
          "type": "CSS",
          "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-forum::text"
        },
        "questiondate": {
          "type": "CSS",
          "path": ".qa-topic-header > .qa-topic-meta time::attr(datetime)"
        },
        "replies": {
          "type": "CSS",
          "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-replies::text"
        },
        "question": {
          "type": "CSS",
          "path": ".qa-topic-header",
          "isBoolean": true
        }
      },
      "subItems": {
        "reply": {
          "type": "CSS",
          "path": "#comments .qa-topic-post-box"
        }
      }
    },
    {
      "for": {
        "types": [
          "reply"
        ]
      },
      "metadata": {
        "bestanswer": {
          "type": "CSS",
          "path": ".post--bestanswer",
          "isBoolean": true
        },
        "content": {
          "type": "CSS",
          "path": ".qa-topic-post-content::text"
        }
      }
    }
  ]

Map the fields you configured in the Web scraping section.
Add fields to use in IPE (see Manage fields).

Add indexing pipeline extensions to (see Manage Extensions):

Add CSS for subitem Quick view in result folding, as by default there’s no Quick view for subitems (see About Result Folding).

Example

   try:
       if (document.uri.find("SubItem:") != -1):
           extracted_html = [x.strip('\r\n\t') for x in document.get_data_stream('body_html').readlines() if x.strip('\r\n\t')]
           new_html = "<link rel='stylesheet' type='text/css' href='https:/mycsslink.css'>"
           for line in extracted_html:
               new_html += line
           html = document.DataStream('body_html')
           html.write(new_html)
           document.add_data_stream(html)
   except Exception as e:
       log(str(e))

Populate the fields needed to fold answers under topics.

Example

   import re
   try:
       clickableuri = document.get_meta_data_value('clickableuri')[0]
       common_field = clickableuri.rsplit('/', 1)[-1]
       common_field = re.sub('[^0-9a-zA-Z]+', '', common_field)[:49]
       if (document.uri.find("SubItem:") == -1):
           document.add_meta_data({'foldfoldingfield': common_field})
           document.add_meta_data({'foldparentfield': common_field})
       else:
           document.add_meta_data({'foldfoldingfield': common_field})
           document.add_meta_data({'foldchildfield': common_field})
   except Exception as e:
       log(str(e))

Exclude .html pages causing duplicates.

Example

   import re
   try:
       filename = document.get_meta_data_value("filename")[0]
       if (re.search( r"index.*\.html", filename) is not None):
           document.reject()
   except Exception as e:
       log(str(e))

Process information collected on a web page.

Example

To get the year on which a topic (question) was published:

   try:
       if (document.uri.find("SubItem:") == -1):
           date = document.get_meta_data_value("questiondate")[0]
           document.add_meta_data({'questionyear': date[0:4]})
   except Exception as e:
       log(str(e))

Was this article useful?

Very useful

Not really