Identifying User Comments in Web sites using Site Style Trees

1 minute read

Introduction

User feedback and recommendations play a crucial role in customer purchasing decisions. Research shows that customers put much trust in user reviews and product recommendations. Automatically identifying and extracting such comments from Web pages is far from trivial. This thesis focuses on an approach for automatically identifying customer feedback and forum discussions on Internet Web sites using site style trees (SST). The SST algorithm detects recurring sections on Web pages by comparing the DOM tree of different pages to each other and therefore identifies the HTML elements containing user reviews or forum discussions.

The goals of this thesis are

to implement an Java or Python based software which identifies relevant content using site style trees,
to assemble of a test corpus for evaluating content extraction methods and
to carry out an evaluation based on the programmed software.

Introduction
Importance of User Feedback and Word of Mouth in Marketing
Web Page Parsing Approaches
- Style Trees
Implementation
- Detect User Comments based on Style Trees
Evaluation
- Test Corpus (TripAdvisor, …)
- Metrics (precision, recall, F1)
Outlook and Conclusions

Student Profile

A profound knowledge of HTML.
Good Java or Python skills. The implementation is an integral part of this thesis.

Hints

merge non structural elements (a, br, b, p, …)
remove signatures (lines starting with “___” or “–”)
observation: a complete comment can be seen in one subtree

Literature

Bank, Mathias and Mattes, Michael (2009). Automatic User Comment Detection in Flat Internet Fora, Twentieth International Workshop on Database and Expert Systems Application (DEXA 2009); Sixth International Workshop on Text-Based Information Retrieval TIR 2009, pages 373–377
T. Gottron. Content extraction: Identifying the main content in HTML documents, PhD Thesis, Johannes-Gutenberg University Mainz, 2008 download
Yi, Lan, Liu, Bing and Li, Xiaoli (2003). Eliminating noisy information in Web pages for data mining, KDD ‘03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ISBN: 1-58113-737-0, ACM, pages 296–305

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Identifying User Comments in Web sites using Site Style Trees

Introduction

Table of Contents

Student Profile

Hints

Literature

Share on

You may also enjoy

Extracting text (and annotations) from HTML with Python

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

Managing DavMail with systemd and preventing service timeouts after network reconnects.

Setting up Gnome CalDAV and CardDAV support with Radicale