Configure Hadoop and start the HDFS Cluster services using the Ansible playbook

Arjun Singh
5 min readApr 10, 2021

This article will give you a path to solve the challenge of using Hadoop through Ansible and you can follow a similar approach to use Ansible in any of your projects.

The Task (Problem Statement) is: Configure Hadoop and start cluster services using Ansible Playbook

Ansible

Ansible is an automation tool developer in Python language to automate the Configuration Management operations. The reason why it is more mature than using scripts for configuration is that it leverages the power of various modules meant for the configuration of numerous products using a declarative language.

The admin guy doesn’t have to have a deep knowledge of the technicalities of the products and can still configure the servers pretty easily.

Hadoop

Hadoop is a Distributed Storage and Computing product by the Apache community and can be used to solve many problems in the world of Big data. Big data, in simple terms, means a huge amount of data leading to problems of the requirement of Huge Volume for storage, the requirement of great I/O speed (Velocity), and many other challenges.

What Hadoop does is, provides a Distributed architecture as a Master-Slave cluster of systems to store the data in a distributed system. Data gets stored and retrieved in parallel, so I/O speed is fast and also, an unlimited number of systems can be attached to the cluster as Slave nodes to increase the storage. In this way, Hadoop solves some problems of Big Data.

Let’s now start with our solution to configure Hadoop using the Ansible playbook.

I’ll use the following Architecture of Master-Slave and Client nodes.

For simplicity, I’ll just explain the configuration and start of the cluster service in this blog. But cluster can be used by any other machine as a client, event Master and Slave nodes can act as a client.

Step 1: Create an Ansible Inventory of IP addresses using group hosts for Master and Slave nodes.

Going to use RHEL8 OS 1 as Master and OS 2 & OS 3 as slave nodes.

Check IP addresses of them and create inventory.

Note: I’m using RHEL8 OS 1 as the Controller node of Ansible as well.

Step 2: Transfer the required software to the Managed Nodes (Hadoop Cluster)

We require Java JDK and Hadoop software to create the hdfs cluster. So, we copy them from the Controller node to Managed Nodes.

Also, we need to install them on the Managed nodes.

For installing, I’m using yum module for jdk installation and the command module for Hadoop installation because Hadoop software requires — force argument to be given in the installation command.

Run the playbook.

ansible-playbook configure_hadoop.yml

Step 3: Create directories /nn and /dn in the cluster nodes and configure the Hadoop configuration files

For this, we will use file module of ansible, and copy & template modules to copy the configuration files.

hdfs-site.xml for master (name node)

hdfs-site.xml for slave (data node)

core-site.xml for both master and slave

variables file:

Playbook:

Run the playbook.

ansible-playbook configure_hadoop_1.yml

Step 4: Start the services in Name node and Data node

We need to format the name node.

Note: It has an interactive command, so we use the shell module and give command using Pipe.

echo Y | hadoop namenode -format

Start name node service

hadoop-daemon.sh start namenode

Start data node service

hadoop-daemon.sh start datanode

Run the playbook.

ansible-playbook configure_hadoop_2.yml

Cross-check in Hadoop if cluster formed properly or not using the command:

jps

hadoop dfsadmin -report

Hence, finally, our Hadoop Cluster is ready to be used by clients.

This was all about Configuration of Hadoop and starting the HDFS Cluster services using Ansible playbook.

Thank You.. :)

--

--