Parallelizing Text Processing

less than 1 minute read

The goal of this work is to parallelize important text processing tasks such as Text Preprocessing and Co-occurrence analysis using the hadoop Map-/Reduce Framework.


  • familiarize yourself with the hadoop map-/reduce framework
  • create a hadoop hello-world application
  • transfer the text cleanup & pre-processing components to map-/redcue
  • transfer the co-occurrence components to map-/reduce

Table of Contents

  • Introduction
  • Theoretical Background
    • Map-/Reduce
    • Natural Language Detection
      • Text Preprocessing
      • Co-occurrence analysis
  • Method
  • Implementation
  • Evaluation
  • Outlook and Conclusions