Parallelizing Text Processing

Thesis

Albert Weichselbraun

Parallelizing Text Processing

less than 1 minute read

The goal of this work is to parallelize important text processing tasks such as Text Preprocessing and Co-occurrence analysis using the hadoop Map-/Reduce Framework.

Tasks

familiarize yourself with the hadoop map-/reduce framework
create a hadoop hello-world application
transfer the text cleanup & pre-processing components to map-/redcue
transfer the co-occurrence components to map-/reduce

Table of Contents

Introduction
Theoretical Background
- Map-/Reduce
- Natural Language Detection
  - Text Preprocessing
  - Co-occurrence analysis
Method
Implementation
Evaluation
Outlook and Conclusions

Share on

Twitter Facebook LinkedIn

You may also enjoy

Extracting text (and annotations) from HTML with Python

10 minute read

Python offers a number of options for extracting text from HTML documents. Specialized python libraries such as Inscriptis and HTML2Text provide good convers...

Setup and automatic renewal of wildcard SSL certificates for Kubernetes with Certbot and NSD

1 minute read

Wildcard SSL certificates cover all subdomains under a certain domain - e.g. *.k8s.example.net will cover recognyze.k8s.example.net, inscripits.k8s.example.n...

Managing DavMail with systemd and preventing service timeouts after network reconnects.

1 minute read

DavMail enables access to Exchange servers over standard protocols such as IMAP, SMTP and Caldav. It, therefore, allows you to check your company e-mail fro...

Setting up Gnome CalDAV and CardDAV support with Radicale

1 minute read

Although Gnome supports CalDAV and CardDAV, it currently only allows configuring them for Nextcloud servers. Their is a long standing Bug Report which descri...