Intelligent Mirroring the Web 2.0

1 minute read

Motivation

The http-protocol provides mechanisms to determine whether a Web page has changed since the last request. Leveraging this technology, web spiders and proxy servers only download modified resources, leading to a considerably better performance of Web spider architectures. Mirroring and analysis of sites leveraging Web 2.0 technologies like AJAX and dynamic HTML pose another serious challenge for current web spider architectures. Due to dynamic elements modified by JavaScript and similar technologies, simple HTML-to-text conversion techniques like lynx and html2text do not deliver usable text representations of Web documents. New approaches based on more advanced rendering architectures are required to tackle this problem.

Tasks

extend httrack to make use of etags and Last-Modified-Headers
implement, test and compare crawling and archiving strategies for httrack
literature review: conversion of Web 2.0 sites to text.

Literature to start

RFC2616 (http://www.ietf.org/rfc/rfc2616.txt)
httrack (http://www.httrack.com)
Jansen, B.J., Mullen, T., Spink, A. and Pedersen, J. (2006). “Automated Gathering of Web Information: An In-depth Examination of Agents Interacting with Search Engines”, ACM Transactions on Internet Technology, 6(4): 442-464.
Wolf, J.L., Squillante, M.S., et al. (2002). “Optimal Crawling Strategies for Web Search Engines”, 11th International Conference on World Wide Web. Honolulu, USA. 136-147.
Broder, A.Z., Najork, M. and Wiener, J.L. (2003). “Efficient URL Caching for World Wide Web Crawling”, 12th International World Wide Web Conference. Budapest, Hungary. 679-689.

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Intelligent Mirroring the Web 2.0

Motivation

Tasks

Literature to start

Share on

You may also enjoy

Extracting text (and annotations) from HTML with Python

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

Managing DavMail with systemd and preventing service timeouts after network reconnects.

Setting up Gnome CalDAV and CardDAV support with Radicale