Web Content Extraction
Introduction
The amount of information available in the Internet increases continuously, providing interesting research options that in many fields, including informatics and social sciences.
Many of these methods are text corpus based and therefore require an accurate textual representation of the investigated Web pages. Advertisements, navigational elements and more recently Web 2.0 techniques, considerably complicate the task of extracting the Web page’s content.
Web content extraction is concerned with extracting the relevant text from Web pages by removing unrelated textual noise like advertisements, navigational elements, contact and copyright notes.
The goals of this thesis are to
- define requirements for content extraction techniques applicable to a search and web analysis tool,
- investigate state of the art content extraction techniques, and
- implement and test one of those techniques.
Structure
- introduction and problem definition
- requirement analysis
- content extraction techniques
- implementation
- justification of the chosen technique
- system architecture and implementation details
- testing
- outlook and conclusions
Literature
- Juffinger, A., Neidhart, T., Granitzer, M., R., Kern, Weichselbraun, A., Wohlgenannt, G. and Scharl, A. (2007). Distributed Web2.0 Crawling for Ontology Evolution, Proceedings of the Second International Conference on Digital Information Management (ICDIM’07)
- Gottron, Thomas (2008). Content Code Blurring: A New Approach to Content Extraction, Nineteenth International Workshop on Database and Expert Systems Application (DEXA 2008); Fifth International Workshop on Text-based Information Retrieval (TIR’08), ISBN: 978-0-7695-3299-8, IEEE Computer Society Press, pages 29–33
- Weninger, Tim and William, Hsu H. (2008). Text Extraction from the Web via Text-to-Tag Ratio, Nineteenth International Workshop on Database and Expert Systems Application (DEXA 2008); Fifth International Workshop on Text-based Information Retrieval (TIR’08), ISBN: 978-0-7695-3299-8, IEEE Computer Society Press, pages 23–28