In this blog, I will configure the Hadoop cluster using Ansible automation.
What is Hadoop?
Hadoop is an open-source, Java-based programming framework that supports the storage and processing of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Some Basic Terminology:
Namenode: Also known as the master node is the main component of Hadoop CLuster. It stores the metadata about block locations, etc. This metadata is useful for file read and write operations.
Datanode: Also known as the Slave node, is someone who shares their own components with the master node. It is the final location for storing the files. There can be many data-nodes in one cluster.
Client: Generally, the client is the one who stores or uploads the data to the slave node.
What is Ansible?
Ansible is an Open-Source software that automates software provisioning, configuration management, and application deployment.
Some basic Terminology:
Controller Node: It is the OS on which the Ansible software is installed. or the commanding system from where the ansible-playbook is run.
Managed Node: The network devices( switches, routers) or the system that we want to configure is known as a Managed node.
Now we are going to write the ansible-playbook to configure my local VMs to Hadoop name node and data node.
First, we need to install the Ansible and python3 package in the name node. Now we have to create the inventory file consisting of all the IP’s of the system
Here I have created a file ip.txt consisting of all the ips of the name node as well as of data node. After creating the ip.txt we have to update the path of ip.txt in the ansible.cfg file.
NOTE: If you want to use the public key the syntax will be different.
Now, after updating the inventory it is advisable to check the connectivity of all the managed nodes using:
ansible <namenode/ datanode> -m ping
After performing all the steps we are ready to write the ansible script for configuring the Hadoop Cluster.
Here, I have divided my script into 3 parts:
- All the tasks are common in all the name node as well as the data node
2. All the tasks are only for the data node.
3. All the tasks are for the name node.
After writing the Ansible script we can run it using the following command:
If no error comes up you will see that the name node and the data node both are configured perfectly.
We can also check if the services are running or not by jps command.
If you want to know how many data nodes are live or dead you can use the cmd from the name node
hadoop dfsadmin -report
That’s all guys
Thank you for reading!!