Configure Hadoop and start the HDFS Cluster services using the Ansible playbook

5 min readApr 10, 2021

This article will give you a path to solve the challenge of using Hadoop through Ansible and you can follow a similar approach to use Ansible in any of your projects.

The Task (Problem Statement) is: Configure Hadoop and start cluster services using Ansible Playbook

Ansible

Ansible is an automation tool developer in Python language to automate the Configuration Management operations. The reason why it is more mature than using scripts for configuration is that it leverages the power of various modules meant for the configuration of numerous products using a declarative language.

The admin guy doesn’t have to have a deep knowledge of the technicalities of the products and can still configure the servers pretty easily.

Hadoop

Hadoop is a Distributed Storage and Computing product by the Apache community and can be used to solve many problems in the world of Big data. Big data, in simple terms, means a huge amount of data leading to problems of the requirement of Huge Volume for storage, the requirement of great I/O speed (Velocity), and many other challenges.

What Hadoop does is, provides a Distributed architecture as a Master-Slave cluster of systems to store the data in a distributed system. Data gets stored and retrieved in parallel, so I/O speed is fast and also, an unlimited number of systems can be attached to the cluster as Slave nodes to increase the storage. In this way, Hadoop solves some problems of Big Data.

Let’s now start with our solution to configure Hadoop using the Ansible playbook.

I’ll use the following Architecture of Master-Slave and Client nodes.

For simplicity, I’ll just explain the configuration and start of the cluster service in this blog. But cluster can be used by any other machine as a client, event Master and Slave nodes can act as a client.

Step 1: Create an Ansible Inventory of IP addresses using group hosts for Master and Slave nodes.

Going to use RHEL8 OS 1 as Master and OS 2 & OS 3 as slave nodes.

Check IP addresses of them and create inventory.

Note: I’m using RHEL8 OS 1 as the Controller node of Ansible as well.

Step 2: Transfer the required software to the Managed Nodes (Hadoop Cluster)

We require Java JDK and Hadoop software to create the hdfs cluster. So, we copy them from the Controller node to Managed Nodes.

Also, we need to install them on the Managed nodes.

For installing, I’m using yum module for jdk installation and the command module for Hadoop installation because Hadoop software requires — force argument to be given in the installation command.