Skip to content

Use Hadoop MapReduce to solve word counting task


Test if Java code compiles

Test if java is available:

Create directory to save Java test program:

Paste the following code:

Compile and execute:


Writing first MapReduce program


The Mapper class

For our own mapper implementations, we will subclass this base class and override the specified method as follows:

There are three additional methods that sometimes may be required to be overridden:


  • This method is called once before any key/value pairs are presented to the map method. The default implementation does nothing.

  • This method is called once after all key/value pairs have been presented to the map method. The default implementation does nothing.

  • This method controls the overall flow of task processing within a JVM. The default implementation calls the setup method once before repeatedly calling the map method for each key/value pair in the split, and then finally calls the cleanup method.

Compare these materials:


The Reducer class

The Reducer base class works very similarly to the Mapper class, and usually requires only subclasses to override a single reduce method. Here is the cut-down class definition:

This class also has the setup, run, and cleanup methods with similar default implementations as with the Mapper class that can optionally be overridden:


  • This method is called once before any key/lists of values are presented to the reduce method. The default implementation does nothing.

  • This method is called once after all key/lists of values have been presented to the reduce method. The default implementation does nothing.

  • This method controls the overall flow of processing the task within JVM. The default implementation calls the setup method before repeatedly calling the reduce method for as many key/values provided to the Reducer class, and then finally calls the cleanup method.


The Driver class

Although our mapper and reducer implementations are all we need to perform the MapReduce job, there is one more piece of code required: the driver that communicates with the Hadoop framework and specifies the configuration elements needed to run a MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer classes to use, where to find the input data and in what format, and where to place the output data and how to format it. There is an additional variety of other configuration options that can be set, some of them you will see later.

There is no default parent Driver class as a subclass; the driver logic usually exists in the main method of the class written to encapsulate a MapReduce job. Take a look at the following code snippet as an example driver:

A common model for less complex MapReduce jobs is to have the Mapper and Reducer classes as inner classes within the driver. This allows everything to be kept in a single file, which simplifies the code distribution.


Implementing WordCount

WordCount program in Hadoop ecosystem is an equivalent of HelloWorld program you can find in almost any programming language -- the simples piece of code which makes something useful.

You can also follow very detailed example given in MapReduce Tutorial.

  1. Create local (not HDFS) directory where you can save your code, create there a file WordCount.java:
  2. Open WordCount.java in your favourite editor:

    implement word count and save this file:
  3. Verify if and where hadoop hadoop-common and hadoop-mapreduce-client-core are available:
  4. Compile the code:
  5. Build a JAR file
    Before you run your job in Hadoop, you must collect the required class files into a single JAR file that you will submit to the system.
  6. Get some data you can work on
  7. Copy all required files to HDFS:

    If it's not running yet, start Hadoop. Do this as a Hadoop superuser (hadoop in my case):

    If Hadoop is running you can copy files:

    This part is only to show you how you can rename a directory with -mv command:

    Now you can continue:

  8. Run WordCount on a Hadoop cluster

    Make a call of the following form (this is only an example):

    There are four arguments in this call

    • The name of the JAR file.
    • The name of the driver class within the JAR file.
    • The location, on HDFS, of the input file (a relative reference to the user's home folder).
    • The desired location of the output folder (again, a relative path).

  9. Check the output
    If successful, the output file should be as follows:


Implementing WordCount with predefined mapper and reducer

You don't always have to write your own Mapper and Reducer classes from scratch. Hadoop provides several common Mapper and Reducer implementations that can be used in your jobs. If you don't override any of the methods in the Mapper and Reducer classes in the new API, the default implementations are the identity Mapper and Reducer classes, which simply output the input unchanged.

The mappers are found at org.apache.hadoop.mapreduce.lib.mapper, and include the following:

  • InverseMapper: This outputs (value, key).
  • RegexMapper: A Mapper that extracts text matching a regular expression.
  • TokenCounterMapper: This counts the number of discrete tokens in each line of input.

The reducers are found at org.apache.hadoop.mapreduce.lib.reduce, and currently include the following:

  • IntSumReducer: This outputs the sum of the list of integer values per key.
  • LongSumReducer: This outputs the sum of the list of long values per key.

Using predefined mapper and reduce you can make word count program much simpler.

  1. Create local (not HDFS) directory where you can save your code.
  2. Use text editor:

    and replace the contents of WordCountPredefined.java file with the following code:
  3. Compile the code by executing the following command:
  4. Build a JAR file
    Before you run your job in Hadoop, you must collect the required class files into a single JAR file that you will submit to the system.
  5. Prepare HDFS working directory
  6. Run WordCountPredefined on a Hadoop cluster
  7. Check the output
    If successful, the output file should be as follows:




Implementing WordCount using Streaming


Writing an Hadoop MapReduce program in Python

With the MapReduce Java API, both map and reduce tasks provide implementations for methods that contain the task functionality. These methods receive the input to the task as method arguments and then output results via the Context object. This is a clear and type-safe interface but is by definition Java specific.

Hadoop Streaming takes a different approach. With Streaming, you write a map task that reads its input from standard input, one line at a time, and gives the output of its results to standard output. The reduce task then does the same, again using only standard input and output for its data flow.

Any program that reads and writes from standard input and output can be used in Streaming, such as compiled binaries, Unix shell scripts, or programs written in a dynamic language such as Ruby or Python.

Note that in Java you know that map() method is invoked once for each input key/value pair and reduce() method is invoked for each key and its set of values. With Streaming you don't have the concept of the map or reduce methods anymore, instead you have scripts that process streams of received data. This changes how you need to write your reducer. In Java the grouping of values to each key was performed by Hadoop; each invocation of the reduce method would receive a single key and all its values. In Streaming, each instance of the reduce task is given the individual ungathered values one at a time. Hadoop Streaming does sort the keys, for example, if a mapper emitted the following data:

The Streaming reducer would receive this data in the following order:

Hadoop still collects the values for each key and ensures that each key is passed only to a single reducer. In other words, a reducer gets all the values for a number of keys and they are grouped together; however, they are not packaged into individual executions of the reducer, that is, one per key, as with the Java API.

  1. Create local (not HDFS) directory where you can save your code:
  2. Do map step in Python

    You will implement simple map approach to word count task. You will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it will output WORD 1 pairs immediately -- even though a specific word might occur multiple times in the input.

    Create mapper.py file:

    and paste into this file the followinf code:

  3. Make a test

  4. Do reduce step in Python -- save this code in reducer.py file:

  5. Make a test:

  6. Test all components together:

  7. Prepare HDFS working directory

  8. Run as a Hadoop stream:
  9. Check the output
    If successful, the output file should be as follow: