Web Content Extraction

1 minute read

Introduction

The amount of information available in the Internet increases continuously, providing interesting research options that in many fields, including informatics and social sciences.

Many of these methods are text corpus based and therefore require an accurate textual representation of the investigated Web pages. Advertisements, navigational elements and more recently Web 2.0 techniques, considerably complicate the task of extracting the Web page’s content.

Web content extraction is concerned with extracting the relevant text from Web pages by removing unrelated textual noise like advertisements, navigational elements, contact and copyright notes.

The goals of this thesis are to

define requirements for content extraction techniques applicable to a search and web analysis tool,
investigate state of the art content extraction techniques, and
implement and test one of those techniques.

Structure

introduction and problem definition
requirement analysis
content extraction techniques
implementation
1. justification of the chosen technique
2. system architecture and implementation details
testing
outlook and conclusions

Literature

Juffinger, A., Neidhart, T., Granitzer, M., R., Kern, Weichselbraun, A., Wohlgenannt, G. and Scharl, A. (2007). Distributed Web2.0 Crawling for Ontology Evolution, Proceedings of the Second International Conference on Digital Information Management (ICDIM’07)
Gottron, Thomas (2008). Content Code Blurring: A New Approach to Content Extraction, Nineteenth International Workshop on Database and Expert Systems Application (DEXA 2008); Fifth International Workshop on Text-based Information Retrieval (TIR’08), ISBN: 978-0-7695-3299-8, IEEE Computer Society Press, pages 29–33
Weninger, Tim and William, Hsu H. (2008). Text Extraction from the Web via Text-to-Tag Ratio, Nineteenth International Workshop on Database and Expert Systems Application (DEXA 2008); Fifth International Workshop on Text-based Information Retrieval (TIR’08), ISBN: 978-0-7695-3299-8, IEEE Computer Society Press, pages 23–28

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Web Content Extraction

Introduction

Structure

Literature

Share on

You may also enjoy

Extracting text (and annotations) from HTML with Python

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

Managing DavMail with systemd and preventing service timeouts after network reconnects.

Setting up Gnome CalDAV and CardDAV support with Radicale