User feedback and recommendations play a crucial role in customer purchasing decisions. Research shows that customers put much trust in user reviews and product recommendations. Automatically identifying and extracting such comments from Web pages is far from trivial. This thesis focuses on an approach for automatically identifying customer feedback and forum discussions on Internet Web sites using site style trees (SST). The SST algorithm detects recurring sections on Web pages by comparing the DOM tree of different pages to each other and therefore identifies the HTML elements containing user reviews or forum discussions.
The goals of this thesis are
- to implement an Java or Python based software which identifies relevant content using site style trees,
- to assemble of a test corpus for evaluating content extraction methods and
- to carry out an evaluation based on the programmed software.
Table of Contents
- Importance of User Feedback and Word of Mouth in Marketing
- Web Page Parsing Approaches
- Style Trees
- Detect User Comments based on Style Trees
- Test Corpus (TripAdvisor, …)
- Metrics (precision, recall, F1)
- Outlook and Conclusions
- A profound knowledge of HTML.
- Good Java or Python skills. The implementation is an integral part of this thesis.
- merge non structural elements (a, br, b, p, …)
- remove signatures (lines starting with “___” or “–”)
- observation: a complete comment can be seen in one subtree
- Bank, Mathias and Mattes, Michael (2009). Automatic User Comment Detection in Flat Internet Fora, Twentieth International Workshop on Database and Expert Systems Application (DEXA 2009); Sixth International Workshop on Text-Based Information Retrieval TIR 2009, pages 373–377
- T. Gottron. Content extraction: Identifying the main content in HTML documents, PhD Thesis, Johannes-Gutenberg University Mainz, 2008 download
- Yi, Lan, Liu, Bing and Li, Xiaoli (2003). Eliminating noisy information in Web pages for data mining, KDD ‘03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ISBN: 1-58113-737-0, ACM, pages 296–305