How I was finally able to run the infamous Word Count example on Hadoop

This is going to be a fairly technical blog as I want people to not experience the difficulties I experienced while I was trying to get started on Hadoop.

Ingredients:

A working Linux machine, physical or virtual.
Java Development Kit v1.8
Latest stable binary release of Apache Hadoop. Not the Cloudera or MapR or Hortonworks or whatever distribution of it.
IntelliJ IDEA Community Edition.
Apache Maven for building the examples.

There is one catch here: You can’t use Gradle instead of Maven. For some reason, Gradle refuses to download all the dependencies.

First thing is to install JDK 1.8. Fortunately, doing that is pretty easy on Ubuntu 16.04. You just need to install a few packages and you’re ready to go.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

After installation, you can verify the installation using:

java -version

If done right, the result should be something like:

java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Now, we need to download Hadoop binaries. Go to the Hadoop Releases link provided in the ingredients and download the latest stable binary. Don’t download source, download binary. The latest version is, as I’m writing, is 2.7.3. Download that version. You can just download off the browser if you have one available (i.e. you’re running on a Desktop environment). If it’s not, you can also use command line to download. You can go get the download link and use wget. After downloading the archive, extract the contents out of the file using this command and move them to /usr/local:

wget www-eu.apache.org/dist/hadoop/common/hadoo…
tar xzf hadoop-2.7.3.tar.gz
mv hadoop-2.7.3 /usr/local

After the installation, you first need to set up a hadoop user. I tried to bypass this step and continue using my own user, but I couldn’t get some aspects of hadoop running. It may or may not be that Hadoop requires its stuff to be run by a user, which is a member of a group named “hadoop”. I didn’t try adding my own user in the hadoop group before running stuff though, I just created another user and continued from there.

sudo useradd hduser
sudo passwd hduser
sudo usermod -a -G sudo hduser
sudo groupadd hadoop
sudo usermod -a -G hadoop hduser

Now, we created the user hduser, the group hadoop and added hduser to the group hadoop.

wget www-eu.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xzf hadoop-2.7.3.tar.gz
mv hadoop-2.7.3 /usr/local
mv /usr/local/hadoop-2.7.3 /usr/local/hadoop
sudo chown -R hduser:hadoop /usr/local/hadoop

Because we’re going to use hadoop with hduser only, we gave the ownership of the hadoop directory to hduser. Now, we should switch to hduser edit the bash profile of hduser:

sudo su hduser
sudo vim ~/.bashrc

We now have to add the following to the bash profile:

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”

Save and exit the file, and source it by:

source ~/.bashrc

Now you should set up passwordless ssh for the newly created hduser. You can really get away with not doing that, but it starts getting pretty annoying when you have to enter your password 4 times just to set up (pseudo-)distributed file system. In order to do that:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Now, try and ssh into localhost. You shouldn’t be encountering any problems. If you do, do not proceed. Solve those problems first.

Now, we should configure our hadoop installation for pseudo-distributed operation. In order to do that, we will have to edit a few files. They are all located in $HADOOP_HOME/etc/hadoop.

core-site.xml:

<configuration>

    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>

</configuration>

hdfs-site.xml:

<configuration>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop_data/hduser/namenode</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop_data/hduser/datanode</value>
    </property>

</configuration>

yarn-site.xml:

<configuration>

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>

</configuration>

mapred-site.xml:

<configuration>

    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>

</configuration>

Also, you need to make the data directories for namenode and datanode data (the ones in hdfs-site.xml):

mkdir -p /usr/local/hadoop_data/hduser/namenode
mkdir -p /usr/local/hadoop_data/hduser/datanode

After this, we should format the namenode with:

hdfs namenode -format

You should see “Namenode successfully formatted” in the logs. If you don’t see such a log and see a bunch of Java exceptions instead, do not proceed and look for a solution.

Now, we should start distributed file system and yarn:

start-dfs.sh
start-yarn.sh

After starting both, check the results with:

jps

You should be seeing something like:

9008 NodeManager
6242 SecondaryNameNode
9797 Jps
5865 NameNode
8619 ResourceManager
6028 DataNode

Numbers do not matter but the process names should all exist. If you have them all, it means you have configured hadoop correctly and you’re ready to get to work.

You should first try and check hdfs by:

hadoop fs -ls

It will probably give you a warning, stating that the directory named ‘.’ wasn’t found. Don’t worry here, nothing is wrong. That command is the equivalent of:

hadoop fs -ls /user/hduser

We know that /user/hduser directory doesn’t exist in our hdfs, because we know that nothing exists in our hdfs. We just formatted our hadoop distributed file system before starting. We will add the folder for our user and a folder in our user folder for the word count example:

hadoop fs -mkdir /user
hadoop fs -mkdir /user/hduser
hadoop fs -mkdir /user/hduser/wordcount

Just like that. Now let’s try and check our hdfs again:

hadoop fs -ls

This time, you should see the wordcount folder being listed. If it is, let’s proceed. If not, stop and look for solutions.

Also, we should install Apache Maven for building our mapreduce code. Fortunately this is very simple on Ubuntu and can be done with a single command:

sudo apt-get install maven

Now open IntelliJ IDEA, Create a New Maven Project, No Archetypes. Now it asks you for group id, artifact id and version. People, when building tutorials, put their own nicknames in the group id and when you change it, something always happens with package names and stuff. Fortunately for you guys, I’m kind enough to leave out my own nickname in the group id.

GroupId: org.hadoop.mapreduce-example
ArtifactId: wordcount-demo
Version: 1.0-SNAPSHOT

Hit next, then define project name and project folder, hit finish. IntelliJ IDEA is kind enough to greet you with the project pom.xml file. Add the following dependency to the pom xml:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.3</version>
</dependency>

Your pom.xml should now look like:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="maven.apache.org/POM/4.0.0"
         xmlns:xsi="w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="maven.apache.org/POM/4.0.0 maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.hadoop.mapreduce-example</groupId>
    <artifactId>wordcount-demo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.3</version>
        </dependency>
    </dependencies>

</project>

Now add following classes.

WordCountMapper.java

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.StringTokenizer;

/**
 * Created by ardilgulez on 27.11.2016.
 * Of course I would add my nickname somewhere
 */
public class WordCountMapper extends Mapper<Object, Text, Text, IntWritable> {

    @Override
    public void map(Object key, Text input, Context context){
        try {
            StringTokenizer tokenizer = new StringTokenizer(input.toString());
            while(tokenizer.hasMoreTokens()){
                Text word = new Text();
                word.set(tokenizer.nextToken());
                context.write(word, new IntWritable(1));
            }
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

}

WordCountReducer.java

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * Created by ardilgulez on 27.11.2016.
 * What's a better place to plug your name than class javadoc
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context){
        try {
            int resultNumber = 0;
            for(IntWritable value : values){
                resultNumber += value.get();
            }
            IntWritable result = new IntWritable();
            result.set(resultNumber);
            context.write(key, result);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

}

WordCountJob.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * Created by ardilgulez on 27.11.2016.
 * I do not create package name conflicts
 * and still get to plug my name
 */
public class WordCountJob {

    public static void main(String[] args){
        try {
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf, "WordCountJob");
            job.setJarByClass(WordCountJob.class);
            job.setMapperClass(WordCountMapper.class);
            job.setCombinerClass(WordCountReducer.class);
            job.setReducerClass(WordCountReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}

After adding classes to the project, open a new terminal, then go to project root directory (where the pom.xml that we just edited resides), and run:

mvn clean install

The process should finish very quickly and the .jar file that we will use in the next step should be available in path:

$(pwd)/target/wordcount-demo-1.0-SNAPSHOT.jar

If you did what I did, and wrote the code under a different user (not hduser), you should have to copy the jar file to the user’s home directory and change its ownership.

sudo cp $(pwd)/target/wordcount-demo-1.0-SNAPSHOT.jar /home/hduser
sudo chown hduser:hadoop /home/hduser/wordcount-demo-1.0-SNAPSHOT.jar

Well, I might be getting paths wrong here, but that’s something you should be able to fix fairly easily on your side.

Now let’s create a fairly simple text file that has the content:

Hello World Hello Hadoop

and copy that text file into our hdfs:

hadoop fs -mkdir wordcount/input
hadoop fs -copyFromLocal input.txt wordcount/input
hadoop fs -ls wordcount/input

If the output of the last command is:

Found 1 items
-rw-r--r--   1 hduser supergroup         25 2016-12-04 01:52 wordcount/input/input.txt

Then you have successfully copied the text file into HDFS. Now let’s run the mapreduce job on that file.

hadoop jar wordcount-demo-1.0-SNAPSHOT.jar WordCountJob /user/hduser/wordcount/input /user/hduser/wordcount/output

If the job runs successfully, it will say something like this in the job running logs:

Job job_1480810819272_0001 completed successfully

Now let’s check the output files:

hadoop fs -ls wordcount

We should see the output folder now:

Found 2 items
drwxr-xr-x   - hduser supergroup          0 2016-12-04 02:21 wordcount/input
drwxr-xr-x   - hduser supergroup          0 2016-12-04 02:23 wordcount/output

Let’s copy the output folder to local file system and see the output of the mapreduce job:

hadoop fs -copyToLocal wordcount/output .

Check out the output folder in the local file system. There should be two files:

$ ls
part-r-00000  _SUCCESS

The result is contained in the part-r-00000 file, the other file is just an empty file and its presence only indicates that the job ran successfully. Let’s check the content of the part-r-00000:

Hadoop 1 Hello 2 World 1

First column of the result is the words, aka keys, and the second column is the number of the occurrences for each word, aka values, the results are correct and the job worked flawlessly (big surprise)!

Congratulations for everyone who read and followed everything until this far. You now have your foot in the big data door.

How I was finally able to run the infamous Word Count example on Hadoop

Product

Explore

Company

Blogs

Support