Term Extraction from Web Corpora

1 minute read


In recent years natural language processing techniques and statistical methods for extracting relevant terms and keywords from Web corpora (large text of several gigabytes) have gained on importance in many research fields. The goal of term extraction is to automatically extract relevant terms from a given corpus. For example, if the extraction method uses the nouns-only option, it extracts terms like bicycle and landscape from the sentence Experiencing wonderful landscapes with a bicycle. Ontology learning for example, a subtask of information extraction, which is aimed at automatic extraction of relevant concepts from a given large and structured set of texts facilitates these methods mentioned above. A method like the one described by Liu et. al is an example.

The aims of this thesis are

  1. to provide an overview over state of the art term extraction methods, applicable for web corpora
  2. to implement some of the most promising methods for the extraction of the most relevant contents from web sites (for example: only retrieving the web contents without advertisment or other non-content elements), and
  3. to compare these methods to each other


  1. introduction
  2. term extraction methods in social sciences
  3. criteria for term extraction methods
  4. state of the art
  5. implementation
    • justification of the choice
    • system architecture and implementation details
  6. performance tests
  7. outlook and conclusions


  1. Liu, Wei, Weichselbraun, Albert, Scharl, Arno and Chang, Elizabeth (2005). Semi-Automatic Ontology Extension Using Spreading Activation, Journal of Universal Knowledge Management, pages 50–58, 0(1)
  2. Frantzi, K, Ananiadou, S. and Mima, H. (2000). Automatic recognition of multi-word terms: the c-value/nc-value method, International Journal on Digital Libraries, pages 115-130, 3(2)
  3. Abramowicz, Witold and Wisniewski, Marek (2008). Proximity Window Context Method for Term Extraction in Ontology Learning from Text, Nineteenth International Workshop on Database and Expert Systems Application (DEXA 2008); Seventh International Workshop on Web Semantics (WebS’08), ISBN: 978-0-7695-3299-8, IEEE Computer Society Press, pages 215–219