Index an InSided Community
Index an InSided Community
You can index an inSided community using a Web source and some Python indexing pipeline extensions (IPEs) (see Manage Extensions).
Similarly to a community form, each post in an inSided community is considered a topic, and each topic is classified in a category.
If it’s a question, the topic has either the status Question
or Solved
.
A topic can also be an announcement from a community manager.
Once you have made your inSided community searchable, you can implement result folding so that, in your search page, the answers to a topic appear under the topic search result (see About Result Folding).
-
-
Enter a Domain to use as the indexing process starting point.
-
Use exclusion rules to exclude web page duplicates, unwanted categories, and/or the community member list.
Example-
https://community.example.com/news-and-announcements-*
-
https://community.example.com/off-topic-*
-
https://community.example.com/search?*
-
https://community.example.com/members/*
-
-
Similarly, add query parameters to ignore to exclude other duplicates.
ExampleEnter
sort
to ignore URLs representing alternative ways to sort the community content. -
In the Web scraping subtab, click Edit with JSON, and then enter a custom JSON configuration to ignore unwanted web page parts and to index the desired topic metadata. If you intend to implement result folding, define topic comments as subitems.
Example[ { "name": "myconfig", "for": { "urls": [ ".*" ] }, "exclude": [ { "type": "CSS", "path": ".ssi-header" }, { "type": "CSS", "path": ".qa-main-navigation" }, { "type": "CSS", "path": ".breadcrumb-container" }, { "type": "CSS", "path": ".qa-brand-hero" }, { "type": "CSS", "path": ".Template-brand-stats" }, { "type": "CSS", "path": ".Template-brand-featured" }, { "type": "CSS", "path": ".Sidebar" }, { "type": "CSS", "path": ".Template-footer" }, { "type": "CSS", "path": ".Template-brand-footer" } ], "metadata": { "status": { "type": "CSS", "path": ".qa-topic-header > .qa-thread-status::text" }, "sticky": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-sticky", "isBoolean": true }, "category": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-forum::text" }, "questiondate": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-meta time::attr(datetime)" }, "replies": { "type": "CSS", "path": ".qa-topic-header > .qa-topic-meta .qa-link-to-replies::text" }, "question": { "type": "CSS", "path": ".qa-topic-header", "isBoolean": true } }, "subItems": { "reply": { "type": "CSS", "path": "#comments .qa-topic-post-box" } } }, { "for": { "types": [ "reply" ] }, "metadata": { "bestanswer": { "type": "CSS", "path": ".post--bestanswer", "isBoolean": true }, "content": { "type": "CSS", "path": ".qa-topic-post-content::text" } } } ]
-
-
Map the fields you configured in the Web scraping section.
-
Add fields to use in IPE (see Manage fields).
-
Add indexing pipeline extensions to (see Manage Extensions):
-
Add CSS for subitem Quick view in result folding, as by default there’s no Quick view for subitems (see About Result Folding).
Exampletry: if (document.uri.find("SubItem:") != -1): extracted_html = [x.strip('\r\n\t') for x in document.get_data_stream('body_html').readlines() if x.strip('\r\n\t')] new_html = "<link rel='stylesheet' type='text/css' href='https:/mycsslink.css'>" for line in extracted_html: new_html += line html = document.DataStream('body_html') html.write(new_html) document.add_data_stream(html) except Exception as e: log(str(e))
-
Populate the fields needed to fold answers under topics.
Exampleimport re try: clickableuri = document.get_meta_data_value('clickableuri')[0] common_field = clickableuri.rsplit('/', 1)[-1] common_field = re.sub('[^0-9a-zA-Z]+', '', common_field)[:49] if (document.uri.find("SubItem:") == -1): document.add_meta_data({'foldfoldingfield': common_field}) document.add_meta_data({'foldparentfield': common_field}) else: document.add_meta_data({'foldfoldingfield': common_field}) document.add_meta_data({'foldchildfield': common_field}) except Exception as e: log(str(e))
-
Exclude
.html
pages causing duplicates.Exampleimport re try: filename = document.get_meta_data_value("filename")[0] if (re.search( r"index.*\.html", filename) is not None): document.reject() except Exception as e: log(str(e))
-
Process information collected on a web page.
ExampleTo get the year on which a topic (question) was published:
try: if (document.uri.find("SubItem:") == -1): date = document.get_meta_data_value("questiondate")[0] document.add_meta_data({'questionyear': date[0:4]}) except Exception as e: log(str(e))
-