Test if java is available:
1 2 3 4 5 6 |
nosql@nosql:~/Pulpit/nosql2$ pwd /home/nosql/Pulpit/nosql2 nosql@nosql:~/Pulpit/nosql2$ java --version openjdk 11.0.12 2021-07-20 OpenJDK Runtime Environment (build 11.0.12+7-Ubuntu-0ubuntu3) OpenJDK 64-Bit Server VM (build 11.0.12+7-Ubuntu-0ubuntu3, mixed mode, sharing) |
Create directory to save Java test program:
1 2 3 4 |
nosql@nosql:~/Pulpit/nosql2$ mkdir -p java/test nosql@nosql:~/Pulpit/nosql2$ cd java/test/ nosql@nosql:~/Pulpit/nosql2/java/test$ touch HelloWorld.java nosql@nosql:~/Pulpit/nosql2/java/test$ nano HelloWorld.java |
Paste the following code:
1 2 3 4 5 |
class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!"); } } |
Compile and execute:
1 2 3 4 5 6 7 8 9 10 |
nosql@nosql:~/Pulpit/nosql2/java/test$ ls -l razem 4 -rw-r--r-- 1 nosql hadoopuser 118 gru 9 11:45 HelloWorld.java nosql@nosql:~/Pulpit/nosql2/java/test$ javac HelloWorld.java nosql@nosql:~/Pulpit/nosql2/java/test$ ls -l razem 8 -rw-r--r-- 1 nosql hadoopuser 427 gru 9 11:48 HelloWorld.class -rw-r--r-- 1 nosql hadoopuser 118 gru 9 11:45 HelloWorld.java nosql@nosql:~/Pulpit/nosql2/java/test$ java HelloWorld Hello, World! |
For our own mapper implementations, we will subclass this base class and override the specified method as follows:
1 2 3 4 5 6 |
class Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, Mapper.Context context) throws IOException, InterruptedException { [... PUT SOME CODE HERE ...] } } |
There are three additional methods that sometimes may be required to be overridden:
-
12protected void setup(Mapper.Context context)throws IOException, Interrupted Exception
This method is called once before any key/value pairs are presented to the map method. The default implementation does nothing. -
12protected void cleanup(Mapper.Context context)throws IOException, Interrupted Exception
This method is called once after all key/value pairs have been presented to the map method. The default implementation does nothing. -
12protected void run(Mapper.Context context)throws IOException, Interrupted Exception
This method controls the overall flow of task processing within a JVM. The default implementation calls the setup method once before repeatedly calling the map method for each key/value pair in the split, and then finally calls the cleanup method.
Compare these materials:
- What is the Mapper of Reducer setup() used for?
- What is the purpose of the org.apache.hadoop.mapreduce.Mapper.run() function in Hadoop?
The Reducer base class works very similarly to the Mapper
class, and usually requires only subclasses to override a single reduce method. Here is the cut-down class definition:
1 2 3 4 5 6 |
public class Reducer<K2, V2, K3, V3> { void reduce(K1 key, Iterable<V2> values, Reducer.Context context) throws IOException, InterruptedException { [... PUT SOME CODE HERE ...] } } |
This class also has the setup, run, and cleanup methods with similar default implementations as with the Mapper
class that can optionally be overridden:
-
12protected void setup(Reduce.Context context)throws IOException, InterruptedException
This method is called once before any key/lists of values are presented to the reduce method. The default implementation does nothing. -
12protected void cleanup(Reducer.Context context)throws IOException, InterruptedException
This method is called once after all key/lists of values have been presented to the reduce method. The default implementation does nothing. -
12protected void run(Reducer.Context context)throws IOException, InterruptedException
This method controls the overall flow of processing the task within JVM. The default implementation calls the setup method before repeatedly calling the reduce method for as many key/values provided to the Reducer class, and then finally calls the cleanup method.
Although our mapper and reducer implementations are all we need to perform the MapReduce job, there is one more piece of code required: the driver that communicates with the Hadoop framework and specifies the configuration elements needed to run a MapReduce job. This involves aspects such as telling Hadoop which Mapper
and Reducer
classes to use, where to find the input data and in what format, and where to place the output data and how to format it. There is an additional variety of other configuration options that can be set, some of them you will see later.
There is no default parent Driver
class as a subclass; the driver logic usually exists in the main method of the class written to encapsulate a MapReduce job. Take a look at the following code snippet as an example driver:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
public class ExampleDriver { ... public static void main(String[] args) throws Exception { // Create a Configuration object that is used to set other options Configuration conf = new Configuration() ; // Create the object representing the job Job job = Job.getInstance(conf, "ExampleJob") ; // Set the name of the main class in the job jarfile job.setJarByClass(ExampleDriver.class) ; // Set the mapper class job.setMapperClass(ExampleMapper.class) ; // Set the reducer class job.setReducerClass(ExampleReducer.class) ; // Set the types for the final output key and value job.setOutputKeyClass(Text.class) ; job.setOutputValueClass(IntWritable.class) ; // Set input and output file paths FileInputFormat.addInputPath(job, new Path(args[0])) ; FileOutputFormat.setOutputPath(job, new Path(args[1])) // Execute the job and wait for it to complete System.exit(job.waitForCompletion(true) ? 0 : 1); } } |
A common model for less complex MapReduce jobs is to have the Mapper
and Reducer
classes as inner classes within the driver. This allows everything to be kept in a single file, which simplifies the code distribution.
WordCount program in Hadoop ecosystem is an equivalent of HelloWorld program you can find in almost any programming language -- the simples piece of code which makes something useful.
You can also follow very detailed example given in MapReduce Tutorial.
- Create local (not HDFS) directory where you can save your code, create there a file
WordCount.java
:
12345678nosql@nosql:~/Pulpit/nosql2/java$ pwd/home/nosql/Pulpit/nosql2/javanosql@nosql:~/Pulpit/nosql2/java$ mkdir word_countnosql@nosql:~/Pulpit/nosql2/java$ cd word_count/nosql@nosql:~/Pulpit/nosql2/java/word_count$ touch WordCount.javanosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -lrazem 0-rw-r--r-- 1 nosql hadoopuser 0 gru 9 12:11 WordCount.java - Open
WordCount.java
in your favourite editor:
1nosql@nosql:~/Pulpit/nosql2/java/word_count$ nano WordCount.java
implement word count and save this file:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {private final static IntWritable ONE = new IntWritable(1);private Text word = new Text();public void map(Object key, Text value, Context context)throws IOException, InterruptedException {String[] words = value.toString().split(" ") ;for (String str: words) {word.set(str);context.write(word, ONE);}}}public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable> {public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {int total = 0;for (IntWritable val: values) {total++ ;}context.write(key, new IntWritable(total));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf, "WordCount");job.setJarByClass(WordCount.class);job.setMapperClass(WordCountMapper.class);job.setReducerClass(WordCountReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);}} - Verify if and where hadoop
hadoop-common
andhadoop-mapreduce-client-core
are available:
123456789101112131415161718192021222324252627nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l /usr/local/hadoop/share/hadoop/common/razem 8056-rw-r--r-- 1 hadoop hadoop 4426299 cze 15 07:14 hadoop-common-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 3344681 cze 15 07:14 hadoop-common-3.3.1-tests.jar-rw-r--r-- 1 hadoop hadoop 96492 cze 15 07:15 hadoop-kms-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 166441 cze 15 07:15 hadoop-nfs-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 191537 cze 15 07:15 hadoop-registry-3.3.1.jardrwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 jdiffdrwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:15 libdrwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 sourcesdrwxr-xr-x 3 hadoop hadoop 4096 cze 15 07:52 webappsnosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -l /usr/local/hadoop/share/hadoop/mapreduce/razem 5300-rw-r--r-- 1 hadoop hadoop 590696 cze 15 07:39 hadoop-mapreduce-client-app-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 805746 cze 15 07:39 hadoop-mapreduce-client-common-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 1636326 cze 15 07:39 hadoop-mapreduce-client-core-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 181630 cze 15 07:39 hadoop-mapreduce-client-hs-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 9963 cze 15 07:39 hadoop-mapreduce-client-hs-plugins-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 49779 cze 15 07:39 hadoop-mapreduce-client-jobclient-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 1658803 cze 15 07:39 hadoop-mapreduce-client-jobclient-3.3.1-tests.jar-rw-r--r-- 1 hadoop hadoop 90702 cze 15 07:39 hadoop-mapreduce-client-nativetask-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 62090 cze 15 07:39 hadoop-mapreduce-client-shuffle-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 22263 cze 15 07:39 hadoop-mapreduce-client-uploader-3.3.1.jar-rw-r--r-- 1 hadoop hadoop 280989 cze 15 07:39 hadoop-mapreduce-examples-3.3.1.jardrwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 jdiffdrwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 lib-examplesdrwxr-xr-x 2 hadoop hadoop 4096 cze 15 07:52 sources - Compile the code:
123456789101112nosql@nosql:~/Pulpit/nosql2/java/word_count$ pwd/home/nosql/Pulpit/nosql2/java/word_countnosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -lrazem 4-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.javanosql@nosql:~/Pulpit/nosql2/java/word_count$ javac WordCount.java -cp /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.1.jarnosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -lrazem 16-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java - Build a JAR file
Before you run your job in Hadoop, you must collect the required class files into a single JAR file that you will submit to the system.
123456789101112nosql@nosql:~/Pulpit/nosql2/java/word_count$ jar cvf wordcount.jar WordCount*.classadded manifestadding: WordCount$WordCountMapper.class(in = 1836) (out= 791)(deflated 56%)adding: WordCount$WordCountReducer.class(in = 1634) (out= 676)(deflated 58%)adding: WordCount.class(in = 1465) (out= 787)(deflated 46%)nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -lrazem 20-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class-rw-r--r-- 1 nosql hadoopuser 3014 gru 9 12:43 wordcount.jar-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java - Get some data you can work on
123nosql@nosql:~/Pulpit/nosql2/java/word_count$ echo 'a b c d e a b c d a b c a b a' > words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count$ cat words_to_count.txta b c d e a b c d a b c a b a - Copy all required files to HDFS:
If it's not running yet, start Hadoop. Do this as a Hadoop superuser (
hadoop
in my case):1234567891011nosql@nosql:~/Pulpit/nosql2/java/word_count$ su hadoopHasło:hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ start-dfs.shStarting namenodes on [localhost]Starting datanodesStarting secondary namenodes [nosql]hadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ start-yarn.shStarting resourcemanagerStarting nodemanagershadoop@nosql:/home/nosql/Pulpit/nosql2/java/word_count$ exitexitIf Hadoop is running you can copy files:
123456789101112nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -lsFound 3 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csvdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_singlenosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -mkdir wordcountnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -lsFound 4 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csvdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_singledrwxr-xr-x - nosql hadoopuser 0 2021-12-09 12:53 wordcountThis part is only to show you how you can rename a directory with
-mv
command:12345678910111213141516nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -lsFound 4 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csvdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_singledrwxr-xr-x - nosql hadoopuser 0 2021-12-09 12:55 wordcountnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -mv wordcount word_countnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -lsFound 4 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csvdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_singledrwxr-xr-x - nosql hadoopuser 0 2021-12-09 12:55 word_countnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_countFound 1 items-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txtNow you can continue:
123456nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -copyFromLocal /home/nosql/Pulpit/nosql2/java/word_count/words_to_count.txt word_countnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_countFound 1 items-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -cat word_count/words_to_count.txta b c d e a b c d a b c a b a - Run
WordCount
on a Hadoop cluster
Make a call of the following form (this is only an example):1$ hadoop jar wordcount.jar WordCount test.txt wordcountThere are four arguments in this call
- The name of the JAR file.
- The name of the driver class within the JAR file.
- The location, on HDFS, of the input file (a relative reference to the user's home folder).
- The desired location of the output folder (again, a relative path).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_countFound 1 items-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count$ hadoop jar wordcount.jar WordCount word_count/words_to_count.txt word_count/result2021-12-09 13:45:31,151 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:80322021-12-09 13:45:31,585 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.2021-12-09 13:45:31,619 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/nosql/.staging/job_1639050749973_00022021-12-09 13:45:31,944 INFO input.FileInputFormat: Total input files to process : 12021-12-09 13:45:32,042 INFO mapreduce.JobSubmitter: number of splits:12021-12-09 13:45:32,414 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639050749973_00022021-12-09 13:45:32,414 INFO mapreduce.JobSubmitter: Executing with tokens: []2021-12-09 13:45:32,656 INFO conf.Configuration: resource-types.xml not found2021-12-09 13:45:32,657 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.2021-12-09 13:45:33,069 INFO impl.YarnClientImpl: Submitted application application_1639050749973_00022021-12-09 13:45:33,119 INFO mapreduce.Job: The url to track the job: http://nosql:8088/proxy/application_1639050749973_0002/2021-12-09 13:45:33,119 INFO mapreduce.Job: Running job: job_1639050749973_00022021-12-09 13:45:43,353 INFO mapreduce.Job: Job job_1639050749973_0002 running in uber mode : false2021-12-09 13:45:43,355 INFO mapreduce.Job: map 0% reduce 0%2021-12-09 13:45:49,483 INFO mapreduce.Job: map 100% reduce 0%2021-12-09 13:45:56,521 INFO mapreduce.Job: map 100% reduce 100%2021-12-09 13:45:56,536 INFO mapreduce.Job: Job job_1639050749973_0002 completed successfully2021-12-09 13:45:56,639 INFO mapreduce.Job: Counters: 54File System CountersFILE: Number of bytes read=126FILE: Number of bytes written=545079FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=157HDFS: Number of bytes written=20HDFS: Number of read operations=8HDFS: Number of large read operations=0HDFS: Number of write operations=2HDFS: Number of bytes read erasure-coded=0Job CountersLaunched map tasks=1Launched reduce tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=4043Total time spent by all reduces in occupied slots (ms)=3793Total time spent by all map tasks (ms)=4043Total time spent by all reduce tasks (ms)=3793Total vcore-milliseconds taken by all map tasks=4043Total vcore-milliseconds taken by all reduce tasks=3793Total megabyte-milliseconds taken by all map tasks=4140032Total megabyte-milliseconds taken by all reduce tasks=3884032Map-Reduce FrameworkMap input records=1Map output records=15Map output bytes=90Map output materialized bytes=126Input split bytes=127Combine input records=0Combine output records=0Reduce input groups=5Reduce shuffle bytes=126Reduce input records=15Reduce output records=5Spilled Records=30Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=103CPU time spent (ms)=970Physical memory (bytes) snapshot=390184960Virtual memory (bytes) snapshot=5364494336Total committed heap usage (bytes)=230821888Peak Map Physical memory (bytes)=242601984Peak Map Virtual memory (bytes)=2675720192Peak Reduce Physical memory (bytes)=147582976Peak Reduce Virtual memory (bytes)=2688774144Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format CountersBytes Read=30File Output Format CountersBytes Written=20 - Check the output
If successful, the output file should be as follows:1234567891011121314151617181920212223242526272829303132333435363738nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_countFound 2 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count/result-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -ls word_count/resultFound 2 items-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 13:45 word_count/result/_SUCCESS-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 13:45 word_count/result/part-r-00000nosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -cat word_count/result/part-r-00000a 5b 4c 3d 2e 1nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -lrazem 24-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class-rw-r--r-- 1 nosql hadoopuser 3014 gru 9 12:43 wordcount.jar-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java-rw-r--r-- 1 nosql hadoopuser 30 gru 9 12:44 words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count$ hdfs dfs -copyToLocal word_count/result/part-r-00000nosql@nosql:~/Pulpit/nosql2/java/word_count$ ls -lrazem 28-rw-r--r-- 1 nosql hadoopuser 20 gru 9 13:53 part-r-00000-rw-r--r-- 1 nosql hadoopuser 1836 gru 9 12:33 'WordCount$WordCountMapper.class'-rw-r--r-- 1 nosql hadoopuser 1634 gru 9 12:33 'WordCount$WordCountReducer.class'-rw-r--r-- 1 nosql hadoopuser 1465 gru 9 12:33 WordCount.class-rw-r--r-- 1 nosql hadoopuser 3014 gru 9 12:43 wordcount.jar-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 12:13 WordCount.java-rw-r--r-- 1 nosql hadoopuser 30 gru 9 12:44 words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count$ cat part-r-00000a 5b 4c 3d 2e 1
You don't always have to write your own Mapper and Reducer classes from scratch. Hadoop provides several common Mapper and Reducer implementations that can be used in your jobs. If you don't override any of the methods in the Mapper and Reducer classes in the new API, the default implementations are the identity Mapper and Reducer classes, which simply output the input unchanged.
The mappers are found at org.apache.hadoop.mapreduce.lib.mapper
, and include the following:
InverseMapper
: This outputs (value, key).RegexMapper
: A Mapper that extracts text matching a regular expression.TokenCounterMapper
: This counts the number of discrete tokens in each line of input.
The reducers are found at org.apache.hadoop.mapreduce.lib.reduce
, and currently include the following:
IntSumReducer
: This outputs the sum of the list of integer values per key.LongSumReducer
: This outputs the sum of the list of long values per key.
Using predefined mapper and reduce you can make word count program much simpler.
- Create local (not HDFS) directory where you can save your code.
1234567891011121314151617181920nosql@nosql:~/Pulpit/nosql2/java$ pwd/home/nosql/Pulpit/nosql2/javanosql@nosql:~/Pulpit/nosql2/java$ ls -lrazem 8drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 testdrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_countnosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count_predefinednosql@nosql:~/Pulpit/nosql2/java$ ls -lrazem 12drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 testdrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_countdrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 14:11 word_count_predefinednosql@nosql:~/Pulpit/nosql2/java$ cp word_count/WordCount.java word_count_predefined/nosql@nosql:~/Pulpit/nosql2/java$ mv word_count_predefined/WordCount.java word_count_predefined/WordCountPredefined.javanosql@nosql:~/Pulpit/nosql2/java$ cp word_count/words_to_count.txt word_count_predefined/nosql@nosql:~/Pulpit/nosql2/java$ cd word_count_predefined/nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -lrazem 8-rw-r--r-- 1 nosql hadoopuser 2030 gru 9 14:12 WordCountPredefined.java-rw-r--r-- 1 nosql hadoopuser 30 gru 9 14:13 words_to_count.txt - Use text editor:
1nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ nano WordCountPredefined.java
and replace the contents ofWordCountPredefined.java
file with the following code:
12345678910111213141516171819202122232425import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;public class WordCountPredefined {public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf, "WordCountPredefined");job.setJarByClass(WordCountPredefined.class);job.setMapperClass(TokenCounterMapper.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);}} - Compile the code by executing the following command:
1nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ javac WordCountPredefined.java -cp /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar:/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.1.jar - Build a JAR file
Before you run your job in Hadoop, you must collect the required class files into a single JAR file that you will submit to the system.
123nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ jar cvf wordcountpredefined.jar WordCountPredefined*.classadded manifestadding: WordCountPredefined.class(in = 1438) (out= 760)(deflated 47%) - Prepare HDFS working directory
123456789101112131415161718nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -mkdir word_count_predefinednosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -lsFound 5 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csvdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_singledrwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_countdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:25 word_count_predefinednosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -cp word_count/words_to_count.txt word_count_predefinednosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls -R[... CUT ...]drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_countdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count/result-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 13:45 word_count/result/_SUCCESS-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 13:45 word_count/result/part-r-00000-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txtdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:33 word_count_predefined-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 14:33 word_count_predefined/words_to_count.txt - Run
WordCountPredefined
on a Hadoop cluster
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hadoop jar wordcountpredefined.jar WordCountPredefined word_count_predefined/words_to_count.txt word_count_predefined/result2021-12-09 14:44:12,833 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:80322021-12-09 14:44:13,342 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.2021-12-09 14:44:13,370 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/nosql/.staging/job_1639050749973_00042021-12-09 14:44:13,690 INFO input.FileInputFormat: Total input files to process : 12021-12-09 14:44:13,786 INFO mapreduce.JobSubmitter: number of splits:12021-12-09 14:44:14,154 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639050749973_00042021-12-09 14:44:14,154 INFO mapreduce.JobSubmitter: Executing with tokens: []2021-12-09 14:44:14,396 INFO conf.Configuration: resource-types.xml not found2021-12-09 14:44:14,396 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.2021-12-09 14:44:14,502 INFO impl.YarnClientImpl: Submitted application application_1639050749973_00042021-12-09 14:44:14,564 INFO mapreduce.Job: The url to track the job: http://nosql:8088/proxy/application_1639050749973_0004/2021-12-09 14:44:14,572 INFO mapreduce.Job: Running job: job_1639050749973_00042021-12-09 14:44:22,816 INFO mapreduce.Job: Job job_1639050749973_0004 running in uber mode : false2021-12-09 14:44:22,817 INFO mapreduce.Job: map 0% reduce 0%2021-12-09 14:44:29,910 INFO mapreduce.Job: map 100% reduce 0%2021-12-09 14:44:35,945 INFO mapreduce.Job: map 100% reduce 100%2021-12-09 14:44:35,958 INFO mapreduce.Job: Job job_1639050749973_0004 completed successfully2021-12-09 14:44:36,063 INFO mapreduce.Job: Counters: 54File System CountersFILE: Number of bytes read=126FILE: Number of bytes written=545261FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=168HDFS: Number of bytes written=20HDFS: Number of read operations=8HDFS: Number of large read operations=0HDFS: Number of write operations=2HDFS: Number of bytes read erasure-coded=0Job CountersLaunched map tasks=1Launched reduce tasks=1Data-local map tasks=1Total time spent by all maps in occupied slots (ms)=4251Total time spent by all reduces in occupied slots (ms)=3648Total time spent by all map tasks (ms)=4251Total time spent by all reduce tasks (ms)=3648Total vcore-milliseconds taken by all map tasks=4251Total vcore-milliseconds taken by all reduce tasks=3648Total megabyte-milliseconds taken by all map tasks=4353024Total megabyte-milliseconds taken by all reduce tasks=3735552Map-Reduce FrameworkMap input records=1Map output records=15Map output bytes=90Map output materialized bytes=126Input split bytes=138Combine input records=0Combine output records=0Reduce input groups=5Reduce shuffle bytes=126Reduce input records=15Reduce output records=5Spilled Records=30Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=111CPU time spent (ms)=1130Physical memory (bytes) snapshot=394354688Virtual memory (bytes) snapshot=5361606656Total committed heap usage (bytes)=230821888Peak Map Physical memory (bytes)=249917440Peak Map Virtual memory (bytes)=2676494336Peak Reduce Physical memory (bytes)=144437248Peak Reduce Virtual memory (bytes)=2685112320Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format CountersBytes Read=30File Output Format CountersBytes Written=20 - Check the output
If successful, the output file should be as follows:
12345678910111213141516171819202122232425262728293031323334nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls word_count_predefinedFound 2 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined/result-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 14:33 word_count_predefined/words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -ls word_count_predefined/resultFound 2 items-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined/result/_SUCCESS-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 14:44 word_count_predefined/result/part-r-00000nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -cat word_count_predefined/result/part-r-00000a 5b 4c 3d 2e 1nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -lrazem 16-rw-r--r-- 1 nosql hadoopuser 1438 gru 9 14:20 WordCountPredefined.class-rw-r--r-- 1 nosql hadoopuser 1230 gru 9 14:22 wordcountpredefined.jar-rw-r--r-- 1 nosql hadoopuser 1183 gru 9 14:15 WordCountPredefined.java-rw-r--r-- 1 nosql hadoopuser 30 gru 9 14:13 words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ hdfs dfs -copyToLocal word_count_predefined/result/part-r-00000nosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ ls -lrazem 20-rw-r--r-- 1 nosql hadoopuser 20 gru 9 14:50 part-r-00000-rw-r--r-- 1 nosql hadoopuser 1438 gru 9 14:20 WordCountPredefined.class-rw-r--r-- 1 nosql hadoopuser 1230 gru 9 14:22 wordcountpredefined.jar-rw-r--r-- 1 nosql hadoopuser 1183 gru 9 14:15 WordCountPredefined.java-rw-r--r-- 1 nosql hadoopuser 30 gru 9 14:13 words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count_predefined$ cat part-r-00000a 5b 4c 3d 2e 1
With the MapReduce Java API, both map and reduce tasks provide implementations for methods that contain the task functionality. These methods receive the input to the task as method arguments and then output results via the Context object. This is a clear and type-safe interface but is by definition Java specific.
Hadoop Streaming takes a different approach. With Streaming, you write a map task that reads its input from standard input, one line at a time, and gives the output of its results to standard output. The reduce task then does the same, again using only standard input and output for its data flow.
Any program that reads and writes from standard input and output can be used in Streaming, such as compiled binaries, Unix shell scripts, or programs written in a dynamic language such as Ruby or Python.
Note that in Java you know that map() method is invoked once for each input key/value pair and reduce() method is invoked for each key and its set of values. With Streaming you don't have the concept of the map or reduce methods anymore, instead you have scripts that process streams of received data. This changes how you need to write your reducer. In Java the grouping of values to each key was performed by Hadoop; each invocation of the reduce method would receive a single key and all its values. In Streaming, each instance of the reduce task is given the individual ungathered values one at a time. Hadoop Streaming does sort the keys, for example, if a mapper emitted the following data:
1 2 3 4 5 |
First 1 Word 1 Word 1 A 1 First 1 |
The Streaming reducer would receive this data in the following order:
1 2 3 4 5 |
A 1 First 1 First 1 Word 1 Word 1 |
Hadoop still collects the values for each key and ensures that each key is passed only to a single reducer. In other words, a reducer gets all the values for a number of keys and they are grouped together; however, they are not packaged into individual executions of the reducer, that is, one per key, as with the Java API.
- Create local (not HDFS) directory where you can save your code:
1234567891011121314nosql@nosql:~/Pulpit/nosql2/java$ pwd/home/nosql/Pulpit/nosql2/javanosql@nosql:~/Pulpit/nosql2/java$ ls -lrazem 12drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 testdrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_countdrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 14:50 word_count_predefinednosql@nosql:~/Pulpit/nosql2/java$ mkdir word_count_streamnosql@nosql:~/Pulpit/nosql2/java$ ls -lrazem 16drwxr-xr-x 2 nosql hadoopuser 4096 gru 9 11:48 testdrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 13:53 word_countdrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 14:50 word_count_predefineddrwxr-xr-x 2 nosql hadoopuser 4096 gru 9 15:08 word_count_stream - Do map step in Python
You will implement simple map approach to word count task. You will read data from STDIN, split it into words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The Map script will not compute an (intermediate) sum of a word’s occurrences though. Instead, it will outputWORD 1
pairs immediately -- even though a specific word might occur multiple times in the input.Create
mapper.py
file:1234567nosql@nosql:~/Pulpit/nosql2/java$ cd word_count_stream/nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ cp ../word_count/words_to_count.txt .nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ touch mapper.pynosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ ls -lrazem 4-rw-r--r-- 1 nosql hadoopuser 0 gru 9 15:13 mapper.py-rw-r--r-- 1 nosql hadoopuser 30 gru 9 15:11 words_to_count.txt1nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ nano mapper.pyand paste into this file the followinf code:
1234567import sysfor line in sys.stdin:line = line.strip()words = line.split()for word in words:print('{} {}'.format(word, 1)) - Make a test
1234567nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'a b c a b a' | python3 mapper.pya 1b 1c 1a 1b 1a 1
- Do reduce step in Python -- save this code in
reducer.py
file:12nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ touch reducer.pynosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ nano reducer.py1234567891011121314151617181920212223242526from operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word = Nonefor line in sys.stdin:line = line.strip()word, count = line.split(' ', 1)try:count = int(count)except ValueError:continueif current_word == word:current_count += countelse:if current_word:print('{} {}'.format(current_word, current_count))current_count = countcurrent_word = wordif current_word == word:print('{} {}'.format(current_word, current_count)) - Make a test:
123nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo -e 'a 1\na 1\nb 2\nb 2' | python3 reducer.pya 2b 4
- Test all components together:
12345678nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'a b c a b a' | python3 mapper.py | sort | python3 reducer.pya 3b 2c 1nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ echo 'c b a b a a' | python3 mapper.py | sort | python3 reducer.pya 3b 2c 1
- Prepare HDFS working directory
12345678910111213141516171819202122232425262728293031nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -lsFound 5 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csvdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_singledrwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_countdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefinednosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -mkdir word_count_streamnosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -lsFound 6 itemsdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:29 fireball-rw-r--r-- 1 nosql hadoopuser 1124 2021-11-26 00:38 fireball_data.csvdrwxr-xr-x - nosql hadoopuser 0 2021-11-26 00:34 fireball_singledrwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_countdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefineddrwxr-xr-x - nosql hadoopuser 0 2021-12-09 15:21 word_count_streamnosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -cp word_count/words_to_count.txt word_count_streamnosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls -R[... CUT ...]drwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_countdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 13:45 word_count/result-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 13:45 word_count/result/_SUCCESS-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 13:45 word_count/result/part-r-00000-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 12:55 word_count/words_to_count.txtdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefineddrwxr-xr-x - nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined/result-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 14:44 word_count_predefined/result/_SUCCESS-rw-r--r-- 1 nosql hadoopuser 20 2021-12-09 14:44 word_count_predefined/result/part-r-00000-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 14:33 word_count_predefined/words_to_count.txtdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 15:22 word_count_stream-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 15:22 word_count_stream/words_to_count.txt
- Run as a Hadoop stream:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ mapred streaming -files mapper.py,reducer.py -mapper "python3 mapper.py" -reducer "python3 reducer.py" -input word_count_stream/words_to_count.txt -output word_count_stream/resultpackageJobJar: [] [/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar] /tmp/streamjob17514098695461133703.jar tmpDir=null2021-12-09 15:24:57,375 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:80322021-12-09 15:24:57,622 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:80322021-12-09 15:24:57,901 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/nosql/.staging/job_1639050749973_00052021-12-09 15:24:58,716 INFO mapred.FileInputFormat: Total input files to process : 12021-12-09 15:24:58,815 INFO mapreduce.JobSubmitter: number of splits:22021-12-09 15:24:59,128 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639050749973_00052021-12-09 15:24:59,128 INFO mapreduce.JobSubmitter: Executing with tokens: []2021-12-09 15:24:59,356 INFO conf.Configuration: resource-types.xml not found2021-12-09 15:24:59,356 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.2021-12-09 15:24:59,436 INFO impl.YarnClientImpl: Submitted application application_1639050749973_00052021-12-09 15:24:59,497 INFO mapreduce.Job: The url to track the job: http://nosql:8088/proxy/application_1639050749973_0005/2021-12-09 15:24:59,504 INFO mapreduce.Job: Running job: job_1639050749973_00052021-12-09 15:25:07,661 INFO mapreduce.Job: Job job_1639050749973_0005 running in uber mode : false2021-12-09 15:25:07,663 INFO mapreduce.Job: map 0% reduce 0%2021-12-09 15:25:17,786 INFO mapreduce.Job: map 100% reduce 0%2021-12-09 15:25:23,844 INFO mapreduce.Job: map 100% reduce 100%2021-12-09 15:25:24,857 INFO mapreduce.Job: Job job_1639050749973_0005 completed successfully2021-12-09 15:25:24,957 INFO mapreduce.Job: Counters: 54File System CountersFILE: Number of bytes read=111FILE: Number of bytes written=829667FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=287HDFS: Number of bytes written=25HDFS: Number of read operations=11HDFS: Number of large read operations=0HDFS: Number of write operations=2HDFS: Number of bytes read erasure-coded=0Job CountersLaunched map tasks=2Launched reduce tasks=1Data-local map tasks=2Total time spent by all maps in occupied slots (ms)=16315Total time spent by all reduces in occupied slots (ms)=3770Total time spent by all map tasks (ms)=16315Total time spent by all reduce tasks (ms)=3770Total vcore-milliseconds taken by all map tasks=16315Total vcore-milliseconds taken by all reduce tasks=3770Total megabyte-milliseconds taken by all map tasks=16706560Total megabyte-milliseconds taken by all reduce tasks=3860480Map-Reduce FrameworkMap input records=1Map output records=15Map output bytes=75Map output materialized bytes=117Input split bytes=242Combine input records=0Combine output records=0Reduce input groups=5Reduce shuffle bytes=117Reduce input records=15Reduce output records=5Spilled Records=30Shuffled Maps =2Failed Shuffles=0Merged Map outputs=2GC time elapsed (ms)=293CPU time spent (ms)=1550Physical memory (bytes) snapshot=641064960Virtual memory (bytes) snapshot=8052236288Total committed heap usage (bytes)=398663680Peak Map Physical memory (bytes)=250089472Peak Map Virtual memory (bytes)=2684121088Peak Reduce Physical memory (bytes)=144310272Peak Reduce Virtual memory (bytes)=2688978944Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format CountersBytes Read=45File Output Format CountersBytes Written=252021-12-09 15:25:24,959 INFO streaming.StreamJob: Output directory: word_count_stream/result - Check the output
If successful, the output file should be as follow:
1234567891011nosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -ls -R word_count_streamdrwxr-xr-x - nosql hadoopuser 0 2021-12-09 15:25 word_count_stream/result-rw-r--r-- 1 nosql hadoopuser 0 2021-12-09 15:25 word_count_stream/result/_SUCCESS-rw-r--r-- 1 nosql hadoopuser 25 2021-12-09 15:25 word_count_stream/result/part-00000-rw-r--r-- 1 nosql hadoopuser 30 2021-12-09 15:22 word_count_stream/words_to_count.txtnosql@nosql:~/Pulpit/nosql2/java/word_count_stream$ hdfs dfs -cat word_count_stream/result/part-00000a 5b 4c 3d 2e 1