Writing a Simple Word Counter using Hadoop MapReduce

Sunitha Selvan
FAUN — Developer Community 🐾
3 min readAug 1, 2018

--

MapReduce is the heart of Apache Hadoop. The MapReduce algorithm has two parts: Map and Reduce. Mapping involves processing a large data set parallelly to generate <key,value> pairs. These <key,value> pairs are fed to reduce which combines the data tuples into a smaller set.

Word Count is one of the simplest applications of MapReduce. Imagine we have a huge dataset and we want to count the frequency of words. We run Map on this dataset to generate <key,value> pairs.

In this case, we will have each word as the ‘key’ and the ‘value’ will be 1. It then uses a hash function to group together the Reduce tasks. The Reduce task will add up the values for a key. Once the reduced tasks are executed parallelly, we combine the results to get our output file. And voila! We have our output file with the word count.

Now, why is this so powerful?

Once we split our input file, we distribute the map tasks among the nodes and run them parallelly. Since the map tasks are independent of each other, we don’t have to worry about parallelizing it. When it comes to Reduce tasks, we need to group the map outputs in such a way that one Reduce task will handle all the occurrences of a word. The simplest solution to this is to use a Hash function. Again the reduced tasks can be run parallelly.

As the data is being processed by multiple nodes in parallel, there is a huge reduction in processing time.

How to code a Word Counter using MapReduce?

We will need two classes- one for Map and the other for Reduce. We use IntWritable instead of Java’s integer class in our code. IntWritable is similar to integer but optimised to provide serialization in Hadoop.

It’s time to define the map and reduce classes. map’s main functionality is to generate <k,v> pairs. We know that the value is always going to remain ‘1’ for a Word Counter.

Coming to the reduce function, we don’t need to worry about hashing the Map outputs to Reduce tasks. This is taken care of by Hadoop. We need to just sum up the values corresponding to a key.

Now that both Map and Reduce are defined, we need to create a configuration object which we will use for setting configurations such as input/output file format, class, mapper and reducer classes. We can define this in main().

This is one of the most straightforward applications of MapReduce. Google Maps uses MapReduce for Geospatial Query Processing such as finding all roads connecting to an intersection. It is also used for Reverse Web Link graph, counting the URL access frequency, distributed grep etc.

👋 Join FAUN today and receive similar stories each week in your inbox! Get your weekly dose of the must-read tech stories, news, and tutorials.

Follow us on Twitter 🐦 and Facebook 👥 and Instagram 📷 and join our Facebook and Linkedin Groups 💬

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

--

--