Indexing dynamic Web2.0 sites and AJAX applications

1 minute read

Motivation

Mirroring and analysis of sites leveraging Web 2.0 technologies like AJAX and dynamic HTML pose a serious challenge for current web spider architectures. Due to dynamic elements modified by JavaScript and similar technologies, simple HTML-to-text conversion techniques like lynx and html2text do not deliver usable text representations of Web documents. New approaches based on more advanced rendering architectures are required to tackle this problem. This thesis shall investigate to which extend web browsers can be facilitated to extract relevant content from already rendered Web pages.

Tasks

  • literature review
    • problem
    • conversion of Web 2.0 sites to text
    • text conversion metrics
  • choose develop your own metric(s)
  • apply the metrics to the output of different html to text conversion methods
  • evaluate the results with human experts (questionnaire)

Tools

  • greasemonkey (a firefox extension)
  • html2text
  • lynx
  • w3m

Literature

  • Motivation: Does Google Index Dynamic JavaScript Content
  • Dive into Greasemonkey (http://diveintogreasemonkey.org/)
  • Levering, R. and Cutler, M. 2006. The portrait of a common HTML web page. In Proceedings of the 2006 ACM Symposium on Document Engineering (Amsterdam, The Netherlands, October 10 - 13, 2006). DocEng ‘06. ACM Press, New York, NY, 198-204.
  • Related:
    • Tang, J., Li, H., Cao, Y., and Tang, Z. 2005. Email data cleaning. In Proceeding of the Eleventh ACM SIGKDD international Conference on Knowledge Discovery in Data Mining (Chicago, Illinois, USA, August 21 - 24, 2005). KDD ‘05. ACM Press, New York, NY, 489-498. http://doi.acm.org/10.1145/1081870.1081926
    • Thiessen, P. and Chen, C. 2007. Ajax live regions: chat as a case example. In Proceedings of the 2007 international Cross-Disciplinary Conference on Web Accessibility (W4a) (Banff, Canada, May 07 - 08, 2007). W4A ‘07, vol. 225. ACM Press, New York, NY, 7-14. DOI= http://doi.acm.org/10.1145/1243441.1243450
    • Data Quality Tools

Updated: