Use Hadoop MapReduce to solve word counting task

Test if Java code compiles
Writing first MapReduce program
Implementing WordCount using Streaming
- Writing an Hadoop MapReduce program in Python

Test if Java code compiles

Test if java is available:

nosql@nosql:~/Pulpit/nosql2$ pwd
/home/nosql/Pulpit/nosql2
nosql@nosql:~/Pulpit/nosql2$ java --version
openjdk 11.0.12 2021-07-20
OpenJDK Runtime Environment (build 11.0.12+7-Ubuntu-0ubuntu3)
OpenJDK 64-Bit Server VM (build 11.0.12+7-Ubuntu-0ubuntu3, mixed mode, sharing)

nosql@nosql:~/Pulpit/nosql2$ pwd

/home/nosql/Pulpit/nosql2

nosql@nosql:~/Pulpit/nosql2$ java --version

openjdk 11.0.12 2021-07-20

OpenJDK Runtime Environment (build 11.0.12+7-Ubuntu-0ubuntu3)

OpenJDK 64-Bit Server VM (build 11.0.12+7-Ubuntu-0ubuntu3, mixed mode, sharing)

Create directory to save Java test program:

nosql@nosql:~/Pulpit/nosql2$ mkdir -p java/test
nosql@nosql:~/Pulpit/nosql2$ cd java/test/
nosql@nosql:~/Pulpit/nosql2/java/test$ touch HelloWorld.java
nosql@nosql:~/Pulpit/nosql2/java/test$ nano HelloWorld.java

nosql@nosql:~/Pulpit/nosql2$ mkdir -p java/test

nosql@nosql:~/Pulpit/nosql2$ cd java/test/

nosql@nosql:~/Pulpit/nosql2/java/test$ touch HelloWorld.java

nosql@nosql:~/Pulpit/nosql2/java/test$ nano HelloWorld.java

Paste the following code:

class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World!"); 
    }
}

class HelloWorld {

public static void main(String[] args) {

System.out.println("Hello, World!");

}

Compile and execute:

nosql@nosql:~/Pulpit/nosql2/java/test$ ls -l
razem 4
-rw-r--r-- 1 nosql hadoopuser 118 gru  9 11:45 HelloWorld.java
nosql@nosql:~/Pulpit/nosql2/java/test$ javac HelloWorld.java 
nosql@nosql:~/Pulpit/nosql2/java/test$ ls -l
razem 8
-rw-r--r-- 1 nosql hadoopuser 427 gru  9 11:48 HelloWorld.class
-rw-r--r-- 1 nosql hadoopuser 118 gru  9 11:45 HelloWorld.java
nosql@nosql:~/Pulpit/nosql2/java/test$ java HelloWorld 
Hello, World!

nosql@nosql:~/Pulpit/nosql2/java/test$ ls -l

razem 4

-rw-r--r-- 1 nosql hadoopuser 118 gru 9 11:45 HelloWorld.java

nosql@nosql:~/Pulpit/nosql2/java/test$ javac HelloWorld.java

nosql@nosql:~/Pulpit/nosql2/java/test$ ls -l

razem 8

-rw-r--r-- 1 nosql hadoopuser 427 gru 9 11:48 HelloWorld.class

-rw-r--r-- 1 nosql hadoopuser 118 gru 9 11:45 HelloWorld.java

nosql@nosql:~/Pulpit/nosql2/java/test$ java HelloWorld

Hello, World!

Writing first MapReduce program

The Mapper class

For our own mapper implementations, we will subclass this base class and override the specified method as follows:

class Mapper<K1, V1, K2, V2> {
    void map(K1 key, V1 value, Mapper.Context context)
        throws IOException, InterruptedException {
        [... PUT SOME CODE HERE ...]
    }
}

class Mapper<K1, V1, K2, V2> {

void map(K1 key, V1 value, Mapper.Context context)

throws IOException, InterruptedException {

[... PUT SOME CODE HERE ...]

}

There are three additional methods that sometimes may be required to be overridden:

protected void setup(Mapper.Context context) throws IOException, Interrupted Exception

1
2

protected void setup(Mapper.Context context)
throws IOException, Interrupted Exception

This method is called once before any key/value pairs are presented to the map method. The default implementation does nothing.
protected void cleanup(Mapper.Context context) throws IOException, Interrupted Exception

1
2

protected void cleanup(Mapper.Context context)
throws IOException, Interrupted Exception

This method is called once after all key/value pairs have been presented to the map method. The default implementation does nothing.
protected void run(Mapper.Context context) throws IOException, Interrupted Exception

1
2

protected void run(Mapper.Context context)
throws IOException, Interrupted Exception

This method controls the overall flow of task processing within a JVM. The default implementation calls the setup method once before repeatedly calling the map method for each key/value pair in the split, and then finally calls the cleanup method.

Compare these materials:

The Reducer class

The Reducer base class works very similarly to the Mapper class, and usually requires only subclasses to override a single reduce method. Here is the cut-down class definition:

public class Reducer<K2, V2, K3, V3> {
    void reduce(K1 key, Iterable<V2> values, Reducer.Context context)
        throws IOException, InterruptedException {
        [... PUT SOME CODE HERE ...]
    } 
}

public class Reducer<K2, V2, K3, V3> {

void reduce(K1 key, Iterable<V2> values, Reducer.Context context)

throws IOException, InterruptedException {

[... PUT SOME CODE HERE ...]

}

This class also has the setup, run, and cleanup methods with similar default implementations as with the Mapper class that can optionally be overridden:

protected void setup(Reduce.Context context) throws IOException, InterruptedException

1
2

protected void setup(Reduce.Context context)
throws IOException, InterruptedException

This method is called once before any key/lists of values are presented to the reduce method. The default implementation does nothing.
protected void cleanup(Reducer.Context context) throws IOException, InterruptedException

1
2

protected void cleanup(Reducer.Context context)
throws IOException, InterruptedException

This method is called once after all key/lists of values have been presented to the reduce method. The default implementation does nothing.
protected void run(Reducer.Context context) throws IOException, InterruptedException

1
2

protected void run(Reducer.Context context)
throws IOException, InterruptedException

This method controls the overall flow of processing the task within JVM. The default implementation calls the setup method before repeatedly calling the reduce method for as many key/values provided to the Reducer class, and then finally calls the cleanup method.

The Driver class

Although our mapper and reducer implementations are all we need to perform the MapReduce job, there is one more piece of code required: the driver that communicates with the Hadoop framework and specifies the configuration elements needed to run a MapReduce job. This involves aspects such as telling Hadoop which Mapper and Reducer classes to use, where to find the input data and in what format, and where to place the output data and how to format it. There is an additional variety of other configuration options that can be set, some of them you will see later.

There is no default parent Driver class as a subclass; the driver logic usually exists in the main method of the class written to encapsulate a MapReduce job. Take a look at the following code snippet as an example driver:

public class ExampleDriver {
    ...
    public static void main(String[] args) throws Exception
    {
        // Create a Configuration object that is used to set other options
        Configuration conf = new Configuration() ;
        // Create the object representing the job
        Job job = Job.getInstance(conf, "ExampleJob") ;
        // Set the name of the main class in the job jarfile
        job.setJarByClass(ExampleDriver.class) ;
        // Set the mapper class
        job.setMapperClass(ExampleMapper.class) ;
        // Set the reducer class
        job.setReducerClass(ExampleReducer.class) ;
        // Set the types for the final output key and value
        job.setOutputKeyClass(Text.class) ;
        job.setOutputValueClass(IntWritable.class) ;
        // Set input and output file paths
        FileInputFormat.addInputPath(job, new Path(args[0])) ;
        FileOutputFormat.setOutputPath(job, new Path(args[1]))
        // Execute the job and wait for it to complete
        System.exit(job.waitForCompletion(true) ? 0 : 1);
     }
}

public class ExampleDriver {

...

public static void main(String[] args) throws Exception

{

// Create a Configuration object that is used to set other options

Configuration conf = new Configuration() ;

// Create the object representing the job

Job job = Job.getInstance(conf, "ExampleJob") ;

// Set the name of the main class in the job jarfile

job.setJarByClass(ExampleDriver.class) ;

// Set the mapper class

job.setMapperClass(ExampleMapper.class) ;

// Set the reducer class

job.setReducerClass(ExampleReducer.class) ;

// Set the types for the final output key and value

job.setOutputKeyClass(Text.class) ;

job.setOutputValueClass(IntWritable.class) ;

// Set input and output file paths

FileInputFormat.addInputPath(job, new Path(args[0])) ;

FileOutputFormat.setOutputPath(job, new Path(args[1]))

// Execute the job and wait for it to complete

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

A common model for less complex MapReduce jobs is to have the Mapper and Reducer classes as inner classes within the driver. This allows everything to be kept in a single file, which simplifies the code distribution.

Implementing WordCount

WordCount program in Hadoop ecosystem is an equivalent of HelloWorld program you can find in almost any programming language -- the simples piece of code which makes something useful.

You can also follow very detailed example given in MapReduce Tutorial.

Create local (not HDFS) directory where you can save your code, create there a file WordCount.java:

nosql@nosql:~/Pulpit/nosql2/java$ pwd
/home/nosql/Pulpit/nosql2/java
nosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count
nosql@nosql:~/Pulpit/nosql2/java$ cd word_count/
nosql@nosql:~/Pulpit/nosql2/java/word_count$ touch WordCount.java
nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l
razem 0
-rw-r--r-- 1 nosql hadoopuser 0 gru  9 12:11 WordCount.java

nosql@nosql:~/Pulpit/nosql2/java$ pwd

/home/nosql/Pulpit/nosql2/java

nosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count

nosql@nosql:~/Pulpit/nosql2/java$ cd word_count/

nosql@nosql:~/Pulpit/nosql2/java/word_count$ touch WordCount.java

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l

razem 0

-rw-r--r-- 1 nosql hadoopuser 0 gru 9 12:11 WordCount.java

Open WordCount.java in your favourite editor:

nosql@nosql:~/Pulpit/nosql2/java/word_count$ nano WordCount.java

1	nosql@nosql:~/Pulpit/nosql2/java/word_count$ nano WordCount.java

implement word count and save this file:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
       
public class WordCount {
	
    public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable ONE = new IntWritable(1);
        private Text word = new Text();
        
        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
            
            String[] words = value.toString().split(" ") ;
            for (String str: words) {
                word.set(str);
                context.write(word, ONE);
            }
        }
    }

    public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
            
            int total = 0;
            for (IntWritable val: values) {
                total++ ;
            }
            context.write(key, new IntWritable(total));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "WordCount");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable ONE = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

String[] words = value.toString().split(" ") ;

for (String str: words) {

word.set(str);

context.write(word, ONE);

}

public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int total = 0;

for (IntWritable val: values) {

total++ ;

}

context.write(key, new IntWritable(total));

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "WordCount");

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Verify if and where hadoop hadoop-common and hadoop-mapreduce-client-core are available:

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l /usr/local/hadoop/share/hadoop/common/
razem 8056
-rw-r--r-- 1 hadoop hadoop 4426299 cze 15 07:14 hadoop-common-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop 3344681 cze 15 07:14 hadoop-common-3.3.1-tests.jar
-rw-r--r-- 1 hadoop hadoop   96492 cze 15 07:15 hadoop-kms-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop  166441 cze 15 07:15 hadoop-nfs-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop  191537 cze 15 07:15 hadoop-registry-3.3.1.jar
drwxr-xr-x 2 hadoop hadoop    4096 cze 15 07:52 jdiff
drwxr-xr-x 2 hadoop hadoop    4096 cze 15 07:15 lib
drwxr-xr-x 2 hadoop hadoop    4096 cze 15 07:52 sources
drwxr-xr-x 3 hadoop hadoop    4096 cze 15 07:52 webapps
nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l /usr/local/hadoop/share/hadoop/mapreduce/
razem 5300
-rw-r--r-- 1 hadoop hadoop  590696 cze 15 07:39 hadoop-mapreduce-client-app-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop  805746 cze 15 07:39 hadoop-mapreduce-client-common-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop 1636326 cze 15 07:39 hadoop-mapreduce-client-core-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop  181630 cze 15 07:39 hadoop-mapreduce-client-hs-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop    9963 cze 15 07:39 hadoop-mapreduce-client-hs-plugins-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop   49779 cze 15 07:39 hadoop-mapreduce-client-jobclient-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop 1658803 cze 15 07:39 hadoop-mapreduce-client-jobclient-3.3.1-tests.jar
-rw-r--r-- 1 hadoop hadoop   90702 cze 15 07:39 hadoop-mapreduce-client-nativetask-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop   62090 cze 15 07:39 hadoop-mapreduce-client-shuffle-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop   22263 cze 15 07:39 hadoop-mapreduce-client-uploader-3.3.1.jar
-rw-r--r-- 1 hadoop hadoop  280989 cze 15 07:39 hadoop-mapreduce-examples-3.3.1.jar
drwxr-xr-x 2 hadoop hadoop    4096 cze 15 07:52 jdiff
drwxr-xr-x 2 hadoop hadoop    4096 cze 15 07:52 lib-examples
drwxr-xr-x 2 hadoop hadoop    4096 cze 15 07:52 sources

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l /usr/local/hadoop/share/hadoop/common/

razem 8056

-rw-r--r-- 1 hadoop hadoop 4426299 cze 15 07:14 hadoop-common-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 3344681 cze 15 07:14 hadoop-common-3.3.1-tests.jar

-rw-r--r-- 1 hadoop hadoop 96492 cze 15 07:15 hadoop-kms-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 166441 cze 15 07:15 hadoop-nfs-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 191537 cze 15 07:15 hadoop-registry-3.3.1.jar

drwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 jdiff

drwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:15 lib

drwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 sources

drwxr-xr-x 3 hadoop hadoop 4096 cze 15 07:52 webapps

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l /usr/local/hadoop/share/hadoop/mapreduce/

razem 5300

-rw-r--r-- 1 hadoop hadoop 590696 cze 15 07:39 hadoop-mapreduce-client-app-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 805746 cze 15 07:39 hadoop-mapreduce-client-common-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 1636326 cze 15 07:39 hadoop-mapreduce-client-core-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 181630 cze 15 07:39 hadoop-mapreduce-client-hs-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 9963 cze 15 07:39 hadoop-mapreduce-client-hs-plugins-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 49779 cze 15 07:39 hadoop-mapreduce-client-jobclient-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 1658803 cze 15 07:39 hadoop-mapreduce-client-jobclient-3.3.1-tests.jar

-rw-r--r-- 1 hadoop hadoop 90702 cze 15 07:39 hadoop-mapreduce-client-nativetask-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 62090 cze 15 07:39 hadoop-mapreduce-client-shuffle-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 22263 cze 15 07:39 hadoop-mapreduce-client-uploader-3.3.1.jar

-rw-r--r-- 1 hadoop hadoop 280989 cze 15 07:39 hadoop-mapreduce-examples-3.3.1.jar

drwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 jdiff

drwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 lib-examples

drwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 sources

Compile the code:

nosql@nosql:~/Pulpit/nosql2/java/word_count$ pwd
/home/nosql/Pulpit/nosql2/java/word_count
nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l
razem 4
-rw-r--r-- 1 nosql hadoopuser 2030 gru  9 12:13 WordCount.java
nosql@nosql:~/Pulpit/nosql2/java/word_count$ javac WordCount.java -cp /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.1.jar 
nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l
razem 16
-rw-r--r-- 1 nosql hadoopuser 1836 gru  9 12:33 'WordCount$WordCountMapper.class'
-rw-r--r-- 1 nosql hadoopuser 1634 gru  9 12:33 'WordCount$WordCountReducer.class'
-rw-r--r-- 1 nosql hadoopuser 1465 gru  9 12:33  WordCount.class
-rw-r--r-- 1 nosql hadoopuser 2030 gru  9 12:13  WordCount.java

nosql@nosql:~/Pulpit/nosql2/java/word_count$ pwd

/home/nosql/Pulpit/nosql2/java/word_count

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l

razem 4

-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java

nosql@nosql:~/Pulpit/nosql2/java/word_count$ javac WordCount.java -cp /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.1.jar

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l

razem 16

-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'

-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'

-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class

-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java

Build a JAR file
Before you run your job in Hadoop, you must collect the required class files into a single JAR file that you will submit to the system.

nosql@nosql:~/Pulpit/nosql2/java/word_count$ jar cvf wordcount.jar WordCount*.class
added manifest
adding: WordCount$WordCountMapper.class(in = 1836) (out= 791)(deflated 56%)
adding: WordCount$WordCountReducer.class(in = 1634) (out= 676)(deflated 58%)
adding: WordCount.class(in = 1465) (out= 787)(deflated 46%)
nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l
razem 20
-rw-r--r-- 1 nosql hadoopuser 1836 gru  9 12:33 'WordCount$WordCountMapper.class'
-rw-r--r-- 1 nosql hadoopuser 1634 gru  9 12:33 'WordCount$WordCountReducer.class'
-rw-r--r-- 1 nosql hadoopuser 1465 gru  9 12:33  WordCount.class
-rw-r--r-- 1 nosql hadoopuser 3014 gru  9 12:43  wordcount.jar
-rw-r--r-- 1 nosql hadoopuser 2030 gru  9 12:13  WordCount.java

nosql@nosql:~/Pulpit/nosql2/java/word_count$ jar cvf wordcount.jar WordCount*.class

added manifest

adding: WordCount$WordCountMapper.class(in = 1836) (out= 791)(deflated 56%)

adding: WordCount$WordCountReducer.class(in = 1634) (out= 676)(deflated 58%)

adding: WordCount.class(in = 1465) (out= 787)(deflated 46%)

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l

razem 20

-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'

-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'

-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class

-rw-r--r-- 1 nosql hadoopuser 3014 gru 9 12:43 wordcount.jar

-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java

Get some data you can work on

nosql@nosql:~/Pulpit/nosql2/java/word_count$ echo 'a b c d e a b c d a b c a b a' > words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count$ cat words_to_count.txt 
a b c d e a b c d a b c a b a

nosql@nosql:~/Pulpit/nosql2/java/word_count$ echo 'a b c d e a b c d a b c a b a' > words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count$ cat words_to_count.txt

a b c d e a b c d a b c a b a

Copy all required files to HDFS:

If it's not running yet, start Hadoop. Do this as a Hadoop superuser (hadoop in my case):

nosql@nosql:~/Pulpit/nosql2/java/word_count$ su hadoop
Hasło: 
hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ start-dfs.sh 
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [nosql]
hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ start-yarn.sh 
Starting resourcemanager
Starting nodemanagers
hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ exit
exit

nosql@nosql:~/Pulpit/nosql2/java/word_count$ su hadoop

Hasło:

hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ start-dfs.sh

Starting namenodes on [localhost]

Starting datanodes

Starting secondary namenodes [nosql]

hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ start-yarn.sh

Starting resourcemanager

Starting nodemanagers

hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ exit

exit

If Hadoop is running you can copy files:

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls
Found 3 items
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:29 fireball
-rw-r--r--   1 nosql hadoopuser       1124 2021-11-26 00:38 fireball_data.csv
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:34 fireball_single
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -mkdir wordcount
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls
Found 4 items
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:29 fireball
-rw-r--r--   1 nosql hadoopuser       1124 2021-11-26 00:38 fireball_data.csv
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:34 fireball_single
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 12:53 wordcount

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls

Found 3 items

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball

-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csv

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_single

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -mkdir wordcount

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls

Found 4 items

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball

-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csv

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_single

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 12:53 wordcount

This part is only to show you how you can rename a directory with -mv command:

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls
Found 4 items
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:29 fireball
-rw-r--r--   1 nosql hadoopuser       1124 2021-11-26 00:38 fireball_data.csv
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:34 fireball_single
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 12:55 wordcount
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -mv wordcount word_count
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls
Found 4 items
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:29 fireball
-rw-r--r--   1 nosql hadoopuser       1124 2021-11-26 00:38 fireball_data.csv
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:34 fireball_single
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 12:55 word_count
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count
Found 1 items
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 12:55 word_count/words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls

Found 4 items

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball

-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csv

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_single

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 12:55 wordcount

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -mv wordcount word_count

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls

Found 4 items

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball

-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csv

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_single

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 12:55 word_count

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count

Found 1 items

-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txt

Now you can continue:

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -copyFromLocal /home/nosql/Pulpit/nosql2/java/word_count/words_to_count.txt word_count
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count
Found 1 items
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 12:55 word_count/words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -cat word_count/words_to_count.txt
a b c d e a b c d a b c a b a

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -copyFromLocal /home/nosql/Pulpit/nosql2/java/word_count/words_to_count.txt word_count

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count

Found 1 items

-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -cat word_count/words_to_count.txt

a b c d e a b c d a b c a b a

Run WordCount on a Hadoop cluster

Make a call of the following form (this is only an example):

$ hadoop jar wordcount.jar WordCount test.txt wordcount

1	$ hadoop jar wordcount.jar WordCount test.txt wordcount

There are four arguments in this call

The name of the JAR file.
The name of the driver class within the JAR file.
The location, on HDFS, of the input file (a relative reference to the user's home folder).
The desired location of the output folder (again, a relative path).

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count
Found 1 items
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 12:55 word_count/words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hadoop jar wordcount.jar WordCount word_count/words_to_count.txt word_count/result
2021-12-09 13:45:31,151 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-09 13:45:31,585 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2021-12-09 13:45:31,619 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/nosql/.staging/job_1639050749973_0002
2021-12-09 13:45:31,944 INFO input.FileInputFormat: Total input files to process : 1
2021-12-09 13:45:32,042 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-09 13:45:32,414 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639050749973_0002
2021-12-09 13:45:32,414 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-09 13:45:32,656 INFO conf.Configuration: resource-types.xml not found
2021-12-09 13:45:32,657 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-12-09 13:45:33,069 INFO impl.YarnClientImpl: Submitted application application_1639050749973_0002
2021-12-09 13:45:33,119 INFO mapreduce.Job: The url to track the job: http://nosql:8088/proxy/application_1639050749973_0002/
2021-12-09 13:45:33,119 INFO mapreduce.Job: Running job: job_1639050749973_0002
2021-12-09 13:45:43,353 INFO mapreduce.Job: Job job_1639050749973_0002 running in uber mode : false
2021-12-09 13:45:43,355 INFO mapreduce.Job:  map 0% reduce 0%
2021-12-09 13:45:49,483 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-09 13:45:56,521 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-09 13:45:56,536 INFO mapreduce.Job: Job job_1639050749973_0002 completed successfully
2021-12-09 13:45:56,639 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=126
		FILE: Number of bytes written=545079
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=157
		HDFS: Number of bytes written=20
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=4043
		Total time spent by all reduces in occupied slots (ms)=3793
		Total time spent by all map tasks (ms)=4043
		Total time spent by all reduce tasks (ms)=3793
		Total vcore-milliseconds taken by all map tasks=4043
		Total vcore-milliseconds taken by all reduce tasks=3793
		Total megabyte-milliseconds taken by all map tasks=4140032
		Total megabyte-milliseconds taken by all reduce tasks=3884032
	Map-Reduce Framework
		Map input records=1
		Map output records=15
		Map output bytes=90
		Map output materialized bytes=126
		Input split bytes=127
		Combine input records=0
		Combine output records=0
		Reduce input groups=5
		Reduce shuffle bytes=126
		Reduce input records=15
		Reduce output records=5
		Spilled Records=30
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=103
		CPU time spent (ms)=970
		Physical memory (bytes) snapshot=390184960
		Virtual memory (bytes) snapshot=5364494336
		Total committed heap usage (bytes)=230821888
		Peak Map Physical memory (bytes)=242601984
		Peak Map Virtual memory (bytes)=2675720192
		Peak Reduce Physical memory (bytes)=147582976
		Peak Reduce Virtual memory (bytes)=2688774144
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=30
	File Output Format Counters 
		Bytes Written=20

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count

Found 1 items

-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hadoop jar wordcount.jar WordCount word_count/words_to_count.txt word_count/result

2021-12-09 13:45:31,151 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032

2021-12-09 13:45:31,585 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2021-12-09 13:45:31,619 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/nosql/.staging/job_1639050749973_0002

2021-12-09 13:45:31,944 INFO input.FileInputFormat: Total input files to process : 1

2021-12-09 13:45:32,042 INFO mapreduce.JobSubmitter: number of splits:1

2021-12-09 13:45:32,414 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639050749973_0002

2021-12-09 13:45:32,414 INFO mapreduce.JobSubmitter: Executing with tokens: []

2021-12-09 13:45:32,656 INFO conf.Configuration: resource-types.xml not found

2021-12-09 13:45:32,657 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.

2021-12-09 13:45:33,069 INFO impl.YarnClientImpl: Submitted application application_1639050749973_0002

2021-12-09 13:45:33,119 INFO mapreduce.Job: The url to track the job: http://nosql:8088/proxy/application_1639050749973_0002/

2021-12-09 13:45:33,119 INFO mapreduce.Job: Running job: job_1639050749973_0002

2021-12-09 13:45:43,353 INFO mapreduce.Job: Job job_1639050749973_0002 running in uber mode : false

2021-12-09 13:45:43,355 INFO mapreduce.Job: map 0% reduce 0%

2021-12-09 13:45:49,483 INFO mapreduce.Job: map 100% reduce 0%

2021-12-09 13:45:56,521 INFO mapreduce.Job: map 100% reduce 100%

2021-12-09 13:45:56,536 INFO mapreduce.Job: Job job_1639050749973_0002 completed successfully

2021-12-09 13:45:56,639 INFO mapreduce.Job: Counters: 54

File System Counters

FILE: Number of bytes read=126

FILE: Number of bytes written=545079

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=157

HDFS: Number of bytes written=20

HDFS: Number of read operations=8

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

HDFS: Number of bytes read erasure-coded=0

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=4043

Total time spent by all reduces in occupied slots (ms)=3793

Total time spent by all map tasks (ms)=4043

Total time spent by all reduce tasks (ms)=3793

Total vcore-milliseconds taken by all map tasks=4043

Total vcore-milliseconds taken by all reduce tasks=3793

Total megabyte-milliseconds taken by all map tasks=4140032

Total megabyte-milliseconds taken by all reduce tasks=3884032

Map-Reduce Framework

Map input records=1

Map output records=15

Map output bytes=90

Map output materialized bytes=126

Input split bytes=127

Combine input records=0

Combine output records=0

Reduce input groups=5

Reduce shuffle bytes=126

Reduce input records=15

Reduce output records=5

Spilled Records=30

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=103

CPU time spent (ms)=970

Physical memory (bytes) snapshot=390184960

Virtual memory (bytes) snapshot=5364494336

Total committed heap usage (bytes)=230821888

Peak Map Physical memory (bytes)=242601984

Peak Map Virtual memory (bytes)=2675720192

Peak Reduce Physical memory (bytes)=147582976

Peak Reduce Virtual memory (bytes)=2688774144

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=30

File Output Format Counters

Bytes Written=20

Check the output
If successful, the output file should be as follows:

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count
Found 2 items
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count/result
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 12:55 word_count/words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count/result
Found 2 items
-rw-r--r--   1 nosql hadoopuser          0 2021-12-09 13:45 word_count/result/_SUCCESS
-rw-r--r--   1 nosql hadoopuser         20 2021-12-09 13:45 word_count/result/part-r-00000
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -cat word_count/result/part-r-00000
a	5
b	4
c	3
d	2
e	1
nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l
razem 24
-rw-r--r-- 1 nosql hadoopuser 1836 gru  9 12:33 'WordCount$WordCountMapper.class'
-rw-r--r-- 1 nosql hadoopuser 1634 gru  9 12:33 'WordCount$WordCountReducer.class'
-rw-r--r-- 1 nosql hadoopuser 1465 gru  9 12:33  WordCount.class
-rw-r--r-- 1 nosql hadoopuser 3014 gru  9 12:43  wordcount.jar
-rw-r--r-- 1 nosql hadoopuser 2030 gru  9 12:13  WordCount.java
-rw-r--r-- 1 nosql hadoopuser   30 gru  9 12:44  words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -copyToLocal word_count/result/part-r-00000
nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l
razem 28
-rw-r--r-- 1 nosql hadoopuser   20 gru  9 13:53  part-r-00000
-rw-r--r-- 1 nosql hadoopuser 1836 gru  9 12:33 'WordCount$WordCountMapper.class'
-rw-r--r-- 1 nosql hadoopuser 1634 gru  9 12:33 'WordCount$WordCountReducer.class'
-rw-r--r-- 1 nosql hadoopuser 1465 gru  9 12:33  WordCount.class
-rw-r--r-- 1 nosql hadoopuser 3014 gru  9 12:43  wordcount.jar
-rw-r--r-- 1 nosql hadoopuser 2030 gru  9 12:13  WordCount.java
-rw-r--r-- 1 nosql hadoopuser   30 gru  9 12:44  words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count$ cat part-r-00000 
a	5
b	4
c	3
d	2
e	1

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count

Found 2 items

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count/result

-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count/result

Found 2 items

-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 13:45 word_count/result/_SUCCESS

-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 13:45 word_count/result/part-r-00000

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -cat word_count/result/part-r-00000

a 5

b 4

c 3

d 2

e 1

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l

razem 24

-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'

-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'

-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class

-rw-r--r-- 1 nosql hadoopuser 3014 gru 9 12:43 wordcount.jar

-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java

-rw-r--r-- 1 nosql hadoopuser 30 gru 9 12:44 words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -copyToLocal word_count/result/part-r-00000

nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l

razem 28

-rw-r--r-- 1 nosql hadoopuser 20 gru 9 13:53 part-r-00000

-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'

-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'

-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class

-rw-r--r-- 1 nosql hadoopuser 3014 gru 9 12:43 wordcount.jar

-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java

-rw-r--r-- 1 nosql hadoopuser 30 gru 9 12:44 words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count$ cat part-r-00000

a 5

b 4

c 3

d 2

e 1

Implementing WordCount with predefined mapper and reducer

You don't always have to write your own Mapper and Reducer classes from scratch. Hadoop provides several common Mapper and Reducer implementations that can be used in your jobs. If you don't override any of the methods in the Mapper and Reducer classes in the new API, the default implementations are the identity Mapper and Reducer classes, which simply output the input unchanged.

The mappers are found at org.apache.hadoop.mapreduce.lib.mapper, and include the following:

InverseMapper: This outputs (value, key).
RegexMapper: A Mapper that extracts text matching a regular expression.
TokenCounterMapper: This counts the number of discrete tokens in each line of input.

The reducers are found at org.apache.hadoop.mapreduce.lib.reduce, and currently include the following:

IntSumReducer: This outputs the sum of the list of integer values per key.
LongSumReducer: This outputs the sum of the list of long values per key.

Using predefined mapper and reduce you can make word count program much simpler.

Create local (not HDFS) directory where you can save your code.

nosql@nosql:~/Pulpit/nosql2/java$ pwd
/home/nosql/Pulpit/nosql2/java
nosql@nosql:~/Pulpit/nosql2/java$ ls -l
razem 8
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 11:48 test
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 13:53 word_count
nosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count_predefined
nosql@nosql:~/Pulpit/nosql2/java$ ls -l
razem 12
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 11:48 test
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 13:53 word_count
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 14:11 word_count_predefined
nosql@nosql:~/Pulpit/nosql2/java$ cp word_count/WordCount.java word_count_predefined/
nosql@nosql:~/Pulpit/nosql2/java$ mv word_count_predefined/WordCount.java word_count_predefined/WordCountPredefined.java 
nosql@nosql:~/Pulpit/nosql2/java$ cp word_count/words_to_count.txt word_count_predefined/
nosql@nosql:~/Pulpit/nosql2/java$ cd word_count_predefined/
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -l
razem 8
-rw-r--r-- 1 nosql hadoopuser 2030 gru  9 14:12 WordCountPredefined.java
-rw-r--r-- 1 nosql hadoopuser   30 gru  9 14:13 words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java$ pwd

/home/nosql/Pulpit/nosql2/java

nosql@nosql:~/Pulpit/nosql2/java$ ls -l

razem 8

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 test

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_count

nosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count_predefined

nosql@nosql:~/Pulpit/nosql2/java$ ls -l

razem 12

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 test

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_count

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 14:11 word_count_predefined

nosql@nosql:~/Pulpit/nosql2/java$ cp word_count/WordCount.java word_count_predefined/

nosql@nosql:~/Pulpit/nosql2/java$ mv word_count_predefined/WordCount.java word_count_predefined/WordCountPredefined.java

nosql@nosql:~/Pulpit/nosql2/java$ cp word_count/words_to_count.txt word_count_predefined/

nosql@nosql:~/Pulpit/nosql2/java$ cd word_count_predefined/

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -l

razem 8

-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 14:12 WordCountPredefined.java

-rw-r--r-- 1 nosql hadoopuser 30 gru 9 14:13 words_to_count.txt

Use text editor:

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ nano WordCountPredefined.java

1	nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ nano WordCountPredefined.java

and replace the contents of WordCountPredefined.java file with the following code:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;
       
public class WordCountPredefined {
    
    public static void main(String[] args) throws Exception {
               Configuration conf = new Configuration();
               Job job = Job.getInstance(conf, "WordCountPredefined");
               job.setJarByClass(WordCountPredefined.class);
               job.setMapperClass(TokenCounterMapper.class);
               job.setReducerClass(IntSumReducer.class);
               job.setOutputKeyClass(Text.class);
               job.setOutputValueClass(IntWritable.class);
               FileInputFormat.addInputPath(job, new Path(args[0]));
               FileOutputFormat.setOutputPath(job, new Path(args[1]));
               System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;

import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;

public class WordCountPredefined {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "WordCountPredefined");

job.setJarByClass(WordCountPredefined.class);

job.setMapperClass(TokenCounterMapper.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Compile the code by executing the following command:

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ javac WordCountPredefined.java -cp /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.1.jar

1	nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ javac WordCountPredefined.java -cp /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.1.jar

Build a JAR file
Before you run your job in Hadoop, you must collect the required class files into a single JAR file that you will submit to the system.

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ jar cvf wordcountpredefined.jar WordCountPredefined*.class
added manifest
adding: WordCountPredefined.class(in = 1438) (out= 760)(deflated 47%)

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ jar cvf wordcountpredefined.jar WordCountPredefined*.class

added manifest

adding: WordCountPredefined.class(in = 1438) (out= 760)(deflated 47%)

Prepare HDFS working directory

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -mkdir word_count_predefined
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls
Found 5 items
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:29 fireball
-rw-r--r--   1 nosql hadoopuser       1124 2021-11-26 00:38 fireball_data.csv
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:34 fireball_single
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 14:25 word_count_predefined
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -cp word_count/words_to_count.txt word_count_predefined
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls -R
[... CUT ...]
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count/result
-rw-r--r--   1 nosql hadoopuser          0 2021-12-09 13:45 word_count/result/_SUCCESS
-rw-r--r--   1 nosql hadoopuser         20 2021-12-09 13:45 word_count/result/part-r-00000
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 12:55 word_count/words_to_count.txt
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 14:33 word_count_predefined
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 14:33 word_count_predefined/words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -mkdir word_count_predefined

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls

Found 5 items

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball

-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csv

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_single

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:25 word_count_predefined

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -cp word_count/words_to_count.txt word_count_predefined

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls -R

[... CUT ...]

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count/result

-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 13:45 word_count/result/_SUCCESS

-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 13:45 word_count/result/part-r-00000

-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txt

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:33 word_count_predefined

-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 14:33 word_count_predefined/words_to_count.txt

Run WordCountPredefined on a Hadoop cluster

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hadoop jar wordcountpredefined.jar WordCountPredefined word_count_predefined/words_to_count.txt word_count_predefined/result
2021-12-09 14:44:12,833 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-09 14:44:13,342 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2021-12-09 14:44:13,370 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/nosql/.staging/job_1639050749973_0004
2021-12-09 14:44:13,690 INFO input.FileInputFormat: Total input files to process : 1
2021-12-09 14:44:13,786 INFO mapreduce.JobSubmitter: number of splits:1
2021-12-09 14:44:14,154 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639050749973_0004
2021-12-09 14:44:14,154 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-09 14:44:14,396 INFO conf.Configuration: resource-types.xml not found
2021-12-09 14:44:14,396 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-12-09 14:44:14,502 INFO impl.YarnClientImpl: Submitted application application_1639050749973_0004
2021-12-09 14:44:14,564 INFO mapreduce.Job: The url to track the job: http://nosql:8088/proxy/application_1639050749973_0004/
2021-12-09 14:44:14,572 INFO mapreduce.Job: Running job: job_1639050749973_0004
2021-12-09 14:44:22,816 INFO mapreduce.Job: Job job_1639050749973_0004 running in uber mode : false
2021-12-09 14:44:22,817 INFO mapreduce.Job:  map 0% reduce 0%
2021-12-09 14:44:29,910 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-09 14:44:35,945 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-09 14:44:35,958 INFO mapreduce.Job: Job job_1639050749973_0004 completed successfully
2021-12-09 14:44:36,063 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=126
		FILE: Number of bytes written=545261
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=168
		HDFS: Number of bytes written=20
		HDFS: Number of read operations=8
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=4251
		Total time spent by all reduces in occupied slots (ms)=3648
		Total time spent by all map tasks (ms)=4251
		Total time spent by all reduce tasks (ms)=3648
		Total vcore-milliseconds taken by all map tasks=4251
		Total vcore-milliseconds taken by all reduce tasks=3648
		Total megabyte-milliseconds taken by all map tasks=4353024
		Total megabyte-milliseconds taken by all reduce tasks=3735552
	Map-Reduce Framework
		Map input records=1
		Map output records=15
		Map output bytes=90
		Map output materialized bytes=126
		Input split bytes=138
		Combine input records=0
		Combine output records=0
		Reduce input groups=5
		Reduce shuffle bytes=126
		Reduce input records=15
		Reduce output records=5
		Spilled Records=30
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=111
		CPU time spent (ms)=1130
		Physical memory (bytes) snapshot=394354688
		Virtual memory (bytes) snapshot=5361606656
		Total committed heap usage (bytes)=230821888
		Peak Map Physical memory (bytes)=249917440
		Peak Map Virtual memory (bytes)=2676494336
		Peak Reduce Physical memory (bytes)=144437248
		Peak Reduce Virtual memory (bytes)=2685112320
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=30
	File Output Format Counters 
		Bytes Written=20

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hadoop jar wordcountpredefined.jar WordCountPredefined word_count_predefined/words_to_count.txt word_count_predefined/result

2021-12-09 14:44:12,833 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032

2021-12-09 14:44:13,342 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2021-12-09 14:44:13,370 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/nosql/.staging/job_1639050749973_0004

2021-12-09 14:44:13,690 INFO input.FileInputFormat: Total input files to process : 1

2021-12-09 14:44:13,786 INFO mapreduce.JobSubmitter: number of splits:1

2021-12-09 14:44:14,154 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639050749973_0004

2021-12-09 14:44:14,154 INFO mapreduce.JobSubmitter: Executing with tokens: []

2021-12-09 14:44:14,396 INFO conf.Configuration: resource-types.xml not found

2021-12-09 14:44:14,396 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.

2021-12-09 14:44:14,502 INFO impl.YarnClientImpl: Submitted application application_1639050749973_0004

2021-12-09 14:44:14,564 INFO mapreduce.Job: The url to track the job: http://nosql:8088/proxy/application_1639050749973_0004/

2021-12-09 14:44:14,572 INFO mapreduce.Job: Running job: job_1639050749973_0004

2021-12-09 14:44:22,816 INFO mapreduce.Job: Job job_1639050749973_0004 running in uber mode : false

2021-12-09 14:44:22,817 INFO mapreduce.Job: map 0% reduce 0%

2021-12-09 14:44:29,910 INFO mapreduce.Job: map 100% reduce 0%

2021-12-09 14:44:35,945 INFO mapreduce.Job: map 100% reduce 100%

2021-12-09 14:44:35,958 INFO mapreduce.Job: Job job_1639050749973_0004 completed successfully

2021-12-09 14:44:36,063 INFO mapreduce.Job: Counters: 54

File System Counters

FILE: Number of bytes read=126

FILE: Number of bytes written=545261

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=168

HDFS: Number of bytes written=20

HDFS: Number of read operations=8

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

HDFS: Number of bytes read erasure-coded=0

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=4251

Total time spent by all reduces in occupied slots (ms)=3648

Total time spent by all map tasks (ms)=4251

Total time spent by all reduce tasks (ms)=3648

Total vcore-milliseconds taken by all map tasks=4251

Total vcore-milliseconds taken by all reduce tasks=3648

Total megabyte-milliseconds taken by all map tasks=4353024

Total megabyte-milliseconds taken by all reduce tasks=3735552

Map-Reduce Framework

Map input records=1

Map output records=15

Map output bytes=90

Map output materialized bytes=126

Input split bytes=138

Combine input records=0

Combine output records=0

Reduce input groups=5

Reduce shuffle bytes=126

Reduce input records=15

Reduce output records=5

Spilled Records=30

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=111

CPU time spent (ms)=1130

Physical memory (bytes) snapshot=394354688

Virtual memory (bytes) snapshot=5361606656

Total committed heap usage (bytes)=230821888

Peak Map Physical memory (bytes)=249917440

Peak Map Virtual memory (bytes)=2676494336

Peak Reduce Physical memory (bytes)=144437248

Peak Reduce Virtual memory (bytes)=2685112320

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=30

File Output Format Counters

Bytes Written=20

Check the output
If successful, the output file should be as follows:

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls word_count_predefined
Found 2 items
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 14:44 word_count_predefined/result
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 14:33 word_count_predefined/words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls word_count_predefined/result
Found 2 items
-rw-r--r--   1 nosql hadoopuser          0 2021-12-09 14:44 word_count_predefined/result/_SUCCESS
-rw-r--r--   1 nosql hadoopuser         20 2021-12-09 14:44 word_count_predefined/result/part-r-00000
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -cat word_count_predefined/result/part-r-00000
a	5
b	4
c	3
d	2
e	1
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -l
razem 16
-rw-r--r-- 1 nosql hadoopuser 1438 gru  9 14:20 WordCountPredefined.class
-rw-r--r-- 1 nosql hadoopuser 1230 gru  9 14:22 wordcountpredefined.jar
-rw-r--r-- 1 nosql hadoopuser 1183 gru  9 14:15 WordCountPredefined.java
-rw-r--r-- 1 nosql hadoopuser   30 gru  9 14:13 words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -copyToLocal word_count_predefined/result/part-r-00000
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -l
razem 20
-rw-r--r-- 1 nosql hadoopuser   20 gru  9 14:50 part-r-00000
-rw-r--r-- 1 nosql hadoopuser 1438 gru  9 14:20 WordCountPredefined.class
-rw-r--r-- 1 nosql hadoopuser 1230 gru  9 14:22 wordcountpredefined.jar
-rw-r--r-- 1 nosql hadoopuser 1183 gru  9 14:15 WordCountPredefined.java
-rw-r--r-- 1 nosql hadoopuser   30 gru  9 14:13 words_to_count.txt
nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ cat part-r-00000 
a	5
b	4
c	3
d	2
e	1

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls word_count_predefined

Found 2 items

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined/result

-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 14:33 word_count_predefined/words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls word_count_predefined/result

Found 2 items

-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined/result/_SUCCESS

-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 14:44 word_count_predefined/result/part-r-00000

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -cat word_count_predefined/result/part-r-00000

a 5

b 4

c 3

d 2

e 1

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -l

razem 16

-rw-r--r-- 1 nosql hadoopuser 1438 gru 9 14:20 WordCountPredefined.class

-rw-r--r-- 1 nosql hadoopuser 1230 gru 9 14:22 wordcountpredefined.jar

-rw-r--r-- 1 nosql hadoopuser 1183 gru 9 14:15 WordCountPredefined.java

-rw-r--r-- 1 nosql hadoopuser 30 gru 9 14:13 words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -copyToLocal word_count_predefined/result/part-r-00000

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -l

razem 20

-rw-r--r-- 1 nosql hadoopuser 20 gru 9 14:50 part-r-00000

-rw-r--r-- 1 nosql hadoopuser 1438 gru 9 14:20 WordCountPredefined.class

-rw-r--r-- 1 nosql hadoopuser 1230 gru 9 14:22 wordcountpredefined.jar

-rw-r--r-- 1 nosql hadoopuser 1183 gru 9 14:15 WordCountPredefined.java

-rw-r--r-- 1 nosql hadoopuser 30 gru 9 14:13 words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ cat part-r-00000

a 5

b 4

c 3

d 2

e 1

Implementing WordCount using Streaming

Writing an Hadoop MapReduce program in Python

With the MapReduce Java API, both map and reduce tasks provide implementations for methods that contain the task functionality. These methods receive the input to the task as method arguments and then output results via the Context object. This is a clear and type-safe interface but is by definition Java specific.

Hadoop Streaming takes a different approach. With Streaming, you write a map task that reads its input from standard input, one line at a time, and gives the output of its results to standard output. The reduce task then does the same, again using only standard input and output for its data flow.

Any program that reads and writes from standard input and output can be used in Streaming, such as compiled binaries, Unix shell scripts, or programs written in a dynamic language such as Ruby or Python.

Note that in Java you know that map() method is invoked once for each input key/value pair and reduce() method is invoked for each key and its set of values. With Streaming you don't have the concept of the map or reduce methods anymore, instead you have scripts that process streams of received data. This changes how you need to write your reducer. In Java the grouping of values to each key was performed by Hadoop; each invocation of the reduce method would receive a single key and all its values. In Streaming, each instance of the reduce task is given the individual ungathered values one at a time. Hadoop Streaming does sort the keys, for example, if a mapper emitted the following data:

First 1
Word 1
Word 1
A 1
First 1

First 1

Word 1

A 1

First 1

The Streaming reducer would receive this data in the following order:

A 1
First 1
First 1
Word 1
Word 1

A 1

First 1

Word 1

Hadoop still collects the values for each key and ensures that each key is passed only to a single reducer. In other words, a reducer gets all the values for a number of keys and they are grouped together; however, they are not packaged into individual executions of the reducer, that is, one per key, as with the Java API.

Create local (not HDFS) directory where you can save your code:

nosql@nosql:~/Pulpit/nosql2/java$ pwd
/home/nosql/Pulpit/nosql2/java
nosql@nosql:~/Pulpit/nosql2/java$ ls -l
razem 12
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 11:48 test
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 13:53 word_count
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 14:50 word_count_predefined
nosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count_stream
nosql@nosql:~/Pulpit/nosql2/java$ ls -l
razem 16
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 11:48 test
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 13:53 word_count
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 14:50 word_count_predefined
drwxr-xr-x 2 nosql hadoopuser 4096 gru  9 15:08 word_count_stream

nosql@nosql:~/Pulpit/nosql2/java$ pwd

/home/nosql/Pulpit/nosql2/java

nosql@nosql:~/Pulpit/nosql2/java$ ls -l

razem 12

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 test

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_count

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 14:50 word_count_predefined

nosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count_stream

nosql@nosql:~/Pulpit/nosql2/java$ ls -l

razem 16

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 test

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_count

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 14:50 word_count_predefined

drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 15:08 word_count_stream

Do map step in Python

You will implement simple map approach to word count task. You will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it will output WORD 1 pairs immediately -- even though a specific word might occur multiple times in the input.

Create mapper.py file:

nosql@nosql:~/Pulpit/nosql2/java$ cd word_count_stream/
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ cp ../word_count/words_to_count.txt .
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ touch mapper.py
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ ls -l
razem 4
-rw-r--r-- 1 nosql hadoopuser  0 gru  9 15:13 mapper.py
-rw-r--r-- 1 nosql hadoopuser 30 gru  9 15:11 words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java$ cd word_count_stream/

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ cp ../word_count/words_to_count.txt .

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ touch mapper.py

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ ls -l

razem 4

-rw-r--r-- 1 nosql hadoopuser 0 gru 9 15:13 mapper.py

-rw-r--r-- 1 nosql hadoopuser 30 gru 9 15:11 words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ nano mapper.py

1	nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ nano mapper.py

and paste into this file the followinf code:

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print('{} {}'.format(word, 1))

import sys

for line in sys.stdin:

line = line.strip()

words = line.split()

for word in words:

print('{} {}'.format(word, 1))

Make a test

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'a b c a b a' | python3 mapper.py a 1 b 1 c 1 a 1 b 1 a 1

1
2
3
4
5
6
7

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'a b c a b a' | python3 mapper.py
a 1
b 1
c 1
a 1
b 1
a 1

Do reduce step in Python -- save this code in reducer.py file:

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ touch reducer.py
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ nano reducer.py

1 2	nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ touch reducer.py nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ nano reducer.py

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split(' ', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print('{} {}'.format(current_word, current_count))
        current_count = count
        current_word = word

if current_word == word:
    print('{} {}'.format(current_word, current_count))

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

for line in sys.stdin:

line = line.strip()

word, count = line.split(' ', 1)

try:

count = int(count)

except ValueError:

continue

if current_word == word:

current_count += count

else:

if current_word:

print('{} {}'.format(current_word, current_count))

current_count = count

current_word = word

if current_word == word:

print('{} {}'.format(current_word, current_count))

Make a test:

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo -e 'a 1\na 1\nb 2\nb 2' | python3 reducer.py a 2 b 4

1
2
3

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo -e 'a 1\na 1\nb 2\nb 2' | python3 reducer.py
a 2
b 4

Test all components together:

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'a b c a b a' | python3 mapper.py | sort | python3 reducer.py
a 3
b 2
c 1
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'c b a b a a' | python3 mapper.py | sort | python3 reducer.py
a 3
b 2
c 1

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'a b c a b a' | python3 mapper.py | sort | python3 reducer.py

a 3

b 2

c 1

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'c b a b a a' | python3 mapper.py | sort | python3 reducer.py

a 3

b 2

c 1

Prepare HDFS working directory

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls
Found 5 items
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:29 fireball
-rw-r--r--   1 nosql hadoopuser       1124 2021-11-26 00:38 fireball_data.csv
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:34 fireball_single
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 14:44 word_count_predefined
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -mkdir word_count_stream
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls
Found 6 items
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:29 fireball
-rw-r--r--   1 nosql hadoopuser       1124 2021-11-26 00:38 fireball_data.csv
drwxr-xr-x   - nosql hadoopuser          0 2021-11-26 00:34 fireball_single
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 14:44 word_count_predefined
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 15:21 word_count_stream
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -cp word_count/words_to_count.txt word_count_stream
nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls -R
[... CUT ...]
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 13:45 word_count/result
-rw-r--r--   1 nosql hadoopuser          0 2021-12-09 13:45 word_count/result/_SUCCESS
-rw-r--r--   1 nosql hadoopuser         20 2021-12-09 13:45 word_count/result/part-r-00000
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 12:55 word_count/words_to_count.txt
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 14:44 word_count_predefined
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 14:44 word_count_predefined/result
-rw-r--r--   1 nosql hadoopuser          0 2021-12-09 14:44 word_count_predefined/result/_SUCCESS
-rw-r--r--   1 nosql hadoopuser         20 2021-12-09 14:44 word_count_predefined/result/part-r-00000
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 14:33 word_count_predefined/words_to_count.txt
drwxr-xr-x   - nosql hadoopuser          0 2021-12-09 15:22 word_count_stream
-rw-r--r--   1 nosql hadoopuser         30 2021-12-09 15:22 word_count_stream/words_to_count.txt

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls

Found 5 items

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball

-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csv

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_single

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -mkdir word_count_stream

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls

Found 6 items

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball

-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csv

drwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_single

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined

drwxr-xr-x - nosql hadoopuser 0 2021-12-09 15:21 word_count_stream

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -cp word_count/words_to_count.txt word_count_stream

nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls -R