The Media Watch on Climate Change aggregates, annotates, and visualizes environmental articles from 150 Anglo-American news media sites. From 300,000 news media articles gathered in weekly intervals, the system selects, annotates and indexes about 10,000 articles focusing on environmental issues, before storing them in a central knowledge repository, which can be navigated along multiple dimensions like time, topics, keywords, and geographic proximity.
Despite innovative browsing aids like ontologies, semantic and geographic maps most users still rely on full text search for navigating through this vast document collection. Efficient full text search and ranking algorithms therefore play a crucial role in providing a user-friendly interface for document repositories.
Currently the Media Watch on Climate Change facilitates Tsearch2, a search framework tightly integrated with its Postgresql database. Comparing this framework with other open source solutions like Lucene, the Full Text Search Engine and popular search algorithms like Google Page Rank, HITS, and TrustRank shall pave the way for improving the user’s search experience in terms of speed and accuracy and providing more advanced search options with a minimum on overheads.
The goals of this thesis are
- a literature and web research evaluating:
- approaches towards full text search and ranking (algorithms, …)
- open source solutions and the techniques employed by them
- evaluation techniques for full text searches (results, ranking, …)
- the design of testcases for evaluating search results
- the implementation and evaluation of 1-3 full text search algorithms within PostgreSQL
The sources and the completed theses can be downloaded from the svn repository: https://svn.semanticlab.net/svn/oss/thesis/fts/trunk