Website Data Extraction and Annotation

less than 1 minute read

Motivation

New Web standards like Microformats enable Web designers to enrich their documents with semantic markup. Web authors start adapting Microformats such as hCal and hReview, thereby enabling Web browsers to automatically extract and store the annotations. Microformats are embedded into the content by means of XHTML; their extraction is computationally expensive, which poses some technical challenges when trying to gather them from larges samples of Web sites. A tool that is capable of extracting and storing microformats in a relational database would help overcome the described problems, and significantly accelerate reasoning and searching of annotated Web data.

todo

write a tool capable of
- extracting various microformats from Web documents
- extract favicons (if present) and _one_ representative image associated with the Web page
- design a database schema for storing annotations
- put it all together ;)

starting points

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Website Data Extraction and Annotation

Motivation

todo

starting points

Share on

You may also enjoy

Extracting text (and annotations) from HTML with Python

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

Managing DavMail with systemd and preventing service timeouts after network reconnects.

Setting up Gnome CalDAV and CardDAV support with Radicale