Website Data Extraction and Annotation

less than 1 minute read


New Web standards like Microformats enable Web designers to enrich their documents with semantic markup. Web authors start adapting Microformats such as hCal and hReview, thereby enabling Web browsers to automatically extract and store the annotations. Microformats are embedded into the content by means of XHTML; their extraction is computationally expensive, which poses some technical challenges when trying to gather them from larges samples of Web sites. A tool that is capable of extracting and storing microformats in a relational database would help overcome the described problems, and significantly accelerate reasoning and searching of annotated Web data.


  • write a tool capable of
    • extracting various microformats from Web documents
    • extract favicons (if present) and _one_ representative image associated with the Web page
    • design a database schema for storing annotations
    • put it all together ;)

starting points