Term Extraction from Web Corpora

1 minute read

Motivation

In recent years natural language processing techniques and statistical methods for extracting relevant terms and keywords from Web corpora (large text of several gigabytes) have gained on importance in many research fields. The goal of term extraction is to automatically extract relevant terms from a given corpus. For example, if the extraction method uses the nouns-only option, it extracts terms like bicycle and landscape from the sentence Experiencing wonderful landscapes with a bicycle. Ontology learning for example, a subtask of information extraction, which is aimed at automatic extraction of relevant concepts from a given large and structured set of texts facilitates these methods mentioned above. A method like the one described by Liu et. al is an example.

The aims of this thesis are

to provide an overview over state of the art term extraction methods, applicable for web corpora
to implement some of the most promising methods for the extraction of the most relevant contents from web sites (for example: only retrieving the web contents without advertisment or other non-content elements), and
to compare these methods to each other

Tasks

introduction
term extraction methods in social sciences
criteria for term extraction methods
state of the art
implementation
- justification of the choice
- system architecture and implementation details
performance tests
outlook and conclusions

Literature

Liu, Wei, Weichselbraun, Albert, Scharl, Arno and Chang, Elizabeth (2005). Semi-Automatic Ontology Extension Using Spreading Activation, Journal of Universal Knowledge Management, pages 50–58, 0(1)
Frantzi, K, Ananiadou, S. and Mima, H. (2000). Automatic recognition of multi-word terms: the c-value/nc-value method, International Journal on Digital Libraries, pages 115-130, 3(2)
Abramowicz, Witold and Wisniewski, Marek (2008). Proximity Window Context Method for Term Extraction in Ontology Learning from Text, Nineteenth International Workshop on Database and Expert Systems Application (DEXA 2008); Seventh International Workshop on Web Semantics (WebS’08), ISBN: 978-0-7695-3299-8, IEEE Computer Society Press, pages 215–219

Share on

Twitter Facebook LinkedIn

Albert Weichselbraun

Term Extraction from Web Corpora

Motivation

Tasks

Literature

Share on

You may also enjoy

Extracting text (and annotations) from HTML with Python

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

Managing DavMail with systemd and preventing service timeouts after network reconnects.

Setting up Gnome CalDAV and CardDAV support with Radicale