Index an InSided Community
Index an InSided Community
You can index an inSided community using a Web source and some Python indexing pipeline extensions (IPEs) (see Add or edit a Web source and Manage Extensions).
Similarly to a community form, each post in an inSided community is considered a topic, and each topic is classified in a category.
If it’s a question, the topic has either the status Question
or Solved
.
A topic can also be an announcement from a community manager.
Once you have made your inSided community searchable, you can implement result folding so that, in your search page, the answers to a topic appear under the topic search result (see About Result Folding).
-
-
Enter a Domain to use as the indexing process starting point.
-
Use exclusion rules to exclude web page duplicates, unwanted categories, and/or the community member list.
Example-
https://community.example.com/news-and-announcements-*
-
https://community.example.com/off-topic-*
-
https://community.example.com/search?*
-
https://community.example.com/members/*
-
-
Similarly, add query parameters to ignore to exclude other duplicates.
ExampleEnter
sort
to ignore URLs representing alternative ways to sort the community content. -
In the Web scraping section, enter a custom JSON script to ignore unwanted web page parts, index the desired topic metadata, and, if you intend to implement result folding, define topic comments as subitems (see About Result Folding).
Example[ { "name": "myconfig", "for": { "urls": [ ".*" ] }, "exclude": [ { "type": "CSS", "path": ".ssi-header" }, { "type": "CSS", "path": ".qa-main-navigation" }, { "type": "CSS", "path": ".breadcrumb-container" }, { "type": "CSS", "path": ".qa-brand-hero" }, { "type": "CSS", "path": ".Template-brand-stats" }, { "type": "CSS", "path": ".Template-brand-featured" }, { "type": "CSS", "path": ".Sidebar" }, { "type": "CSS", "path": ".Template-footer" }, { "type": "CSS", "path": ".Template-brand-footer" } ], "metadata": { "status": { "type": "CSS", "path": ".qa-topic-header > .qa-thread-status::text" }, "sticky": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-sticky", "isBoolean": true }, "category": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-forum::text" }, "questiondate": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-meta time::attr(datetime)" }, "replies": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-replies::text" }, "question": { "type": "CSS", "path": ".qa-topic-header", "isBoolean": true } }, "subItems": { "reply": { "type": "CSS", "path": "#comments .qa-topic-post-box" } } }, { "for": { "types": [ "reply" ] }, "metadata": { "bestanswer": { "type": "CSS", "path": ".post--bestanswer", "isBoolean": true }, "content": { "type": "CSS", "path": ".qa-topic-post-content::text" } } } ]
-
-
Map the fields you configured in the Web scraping section (see Manage source mappings).
-
Add fields to use in indexing pipeline extensions (IPEs) (see Manage fields).
-
Add indexing pipeline extensions to (see Manage Extensions):
-
Add CSS for subitem quick view in result folding, as by default there’s no quick view for subitems (see About Result Folding).
Exampletry: if (document.uri.find("SubItem:") != -1): extracted_html = [x.strip('\r\n\t') for x in document.get_data_stream('body_html').readlines() if x.strip('\r\n\t')] new_html = "<link rel='stylesheet' type='text/css' href='https:/mycsslink.css'>" for line in extracted_html: new_html += line html = document.DataStream('body_html') html.write(new_html) document.add_data_stream(html) except Exception as e: log(str(e))
-
Populate the fields needed to fold answers under topics.
Exampleimport re try: clickableuri = document.get_meta_data_value('clickableuri')[0] common_field = clickableuri.rsplit('/', 1)[-1] common_field = re.sub('[^0-9a-zA-Z]+', '', common_field)[:49] if (document.uri.find("SubItem:") == -1): document.add_meta_data({'foldfoldingfield': common_field}) document.add_meta_data({'foldparentfield': common_field}) else: document.add_meta_data({'foldfoldingfield': common_field}) document.add_meta_data({'foldchildfield': common_field}) except Exception as e: log(str(e))
-
Exclude
.html
pages causing duplicates.Exampleimport re try: filename = document.get_meta_data_value("filename")[0] if (re.search( r"index.*\.html", filename) is not None): document.reject() except Exception as e: log(str(e))
-
Process information collected on a web page.
ExampleTo get the year on which a topic (question) was published:
try: if (document.uri.find("SubItem:") == -1): date = document.get_meta_data_value("questiondate")[0] document.add_meta_data({'questionyear': date[0:4]}) except Exception as e: log(str(e))
-