Configuration of Hadoop ClusterUsing Ansible

3 min readDec 29, 2020

In this blog, I will configure the Hadoop cluster using Ansible automation.

What is Hadoop?

Hadoop is an open-source, Java-based programming framework that supports the storage and processing of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Some Basic Terminology:

Namenode: Also known as the master node is the main component of Hadoop CLuster. It stores the metadata about block locations, etc. This metadata is useful for file read and write operations.

Datanode: Also known as the Slave node, is someone who shares their own components with the master node. It is the final location for storing the files. There can be many data-nodes in one cluster.

Client: Generally, the client is the one who stores or uploads the data to the slave node.

What is Ansible?

Ansible is an Open-Source software that automates software provisioning, configuration management, and application deployment.

Some basic Terminology:

Controller Node: It is the OS on which the Ansible software is installed. or the commanding system from where the ansible-playbook is run.

Managed Node: The network devices( switches, routers) or the system that we want to configure is known as a Managed node.

Now we are going to write the ansible-playbook to configure my local VMs to Hadoop name node and data node.

First, we need to install the Ansible and python3 package in the name node. Now we have to create the inventory file consisting of all the IP’s of the system

Here I have created a file ip.txt consisting of all the ips of the name node as well as of data node. After creating the ip.txt we have to update the path of ip.txt in the ansible.cfg file.

NOTE: If you want to use the public key the syntax will be different.

Now, after updating the inventory it is advisable to check the connectivity of all the managed nodes using: