Hands-on Hadoop
It is very simple to get the default configuration Hadoop up and running without much fuss. Since Hadoop is a Java program, you should have Java 8.0 or higher for Hadoop 3.x.x to work. You should also set the JAVA_HOME environment variable.
Using Google Cloud Shell
If you are using Google Cloud Shell, then JAVA_HOME is already set.
To check the JAVA_HOME setting run the below command from the command prompt
echo $JAVA_HOME
If you see the path to your JDK, you are all set,
Next, you can download the latest Hadoop binary by following the steps below:
- First get the latest hadoop version using:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
- Once the download is complete, extract it:
tar xzf hadoop-3.3.1.tar.gz
Now you are all set to submit Hadoop jobs now. Try the below script to submit an example map reduce job that comes bundled with Hadoop. CD into the Hadoop folder and execute the below commands:
mkdir my-hadoop-project
cd my-hadoop-project
mkdir input
cp ../hadoop-3.3.1/etc/hadoop/*.xml input
../hadoop-3.3.1/bin/hadoop jar ../hadoop-3.3.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar grep input output 'dfs[a-z.]+'
cat output/*
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. In the above example, you copy the conf directory contents into the my-hadoop-project/input folder by first creating both the required folders.
Then you submit the mapreduce job to find every match of the given regular expression. Output is written to the given 'output' directory.
Feel free to change the content in the input folder and also the regex to extract different results.
Word Count Program
Now let us run the famous 'Hello World' program of Big Data; the word count problem!
In this problem, you are required to count the number a times any unique word appears in a pile of documents. The premise here is to use distributed computing power to distribute the number of files into multiple Map/Reduce programs to get your output blazing fast.
The Mapper will break the given document into tokens and spits out 1, for every word that it encounters. The Reducer program will get the shuffled and sorted words and a list of the '1' for each unique word.
Here is the program in full
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Create this file in a new folder called 'my-hadoop-program' and then 'jar' up the program by first compiling and then running the 'jar' executable as shown below:
../hadoop-3.3.1/bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
This will create a jar file in the my-hadoop-program.
Files used as input: Now create a folder called 'input' in your home directory and drop a few text files in that folder. Else just create two files like given below:
This text can be in doc1.txt
Hello World
hello hello there
And create another file called doc2.txt in the same input folder with the below contents:
Hi hi there
World
Now run the below command to submit this job to Hadoop
../hadoop-3.3.1/bin/hadoop jar wc.jar WordCount ~/input ~/output
You will notice that it spawns mappers and reducers and you will find a file in the output folder called ' part-r-00000' which would have output like given below
Hello 1
Hi 1
World 2
hello 2
hi 1
there 2
Reference:
- https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
- https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Mapper.html
- https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/Writable.html
Exercise
- Change the program above to ensure it is not case sensitive while counting.
- Change the program so that it holds the number in memory and then writes the total count for a word at the end of the loop. Analyze the pros and cons of this change on performance.