Instalando Hadoop No Debian9

3/11/2020

Before we start with the Hadoop Setup Process on Ubuntu Linux for Single Node cluster, Let us understand in brief

Of course, you don’t need to install Java 10 on your Debian 9 machine yourself if you have a high-speed Debian VPS hosted with us – in which case, you can contact our team of Linux experts who can install and configure either version of Java for you. They are available 24/7 and can help with any requests or questions that you may have. The LAMP server is the cornerstone of Linux web hosting. In the early days of dynamic web content LAMP was what won Linux the crown in the web space, and it still is responsible for powering a very large portion of the Internet's sites. If you're looking to set up a LAMP stack to host your website.

What is Hadoop?

Apache Hadoop as the name suggests is part of the Apache Project. It is free, java based framework which is used to store and analyse data using commodity hardware via distributed computing environment.

As mentioned in my previous blog Introduction to Big Data there are many solutions to store and analyse big data for example MPP (Massively Parallel Processing ) databases, NoSQL databases like Mongo DB, Apache Cassandra etc.

Why use Hadoop?

Hadoop runs on commodity hardware, thereby making it cost effective.

Hadoop also takes care of fault tolerance by replicating data so that it can be recovered in the event of failures.

Most Importantly, Hadoop can manage Structured, Semi-Structured as well as Unstructured data, thus making it Flexible.

Hadoop also takes care of less network usage and can handle very large data sets thus making it Scalable in true sense.

Basic Linux shell Commands (Linux users can skip this part)

Let us have an overview of some basic linux shell commands which are required for the Hadoop installation process, this will be helpful for non-linux users.

Please make a note that there are various options available with each command mentioned here. Also, the command can be used in several other contexts but I have listed only those command options which are required for the Hadoop setup.

ssh : Open SSH known as openBSD Secure Shell is a set of security related network level utilities based on the SSH (Secure Socket Shell) protocol. This set of utilities includes the ssh command which is used to access a remote machine securely. Syntax: ssh user@hostname
ssh-keygen is one of the utility provided by Open SSH. It is used to establish a secure session between the user and host using various cryptographic algorithms like the RSA algorithm. There are many options that one can use with ssh-keygen for various purposes, but we shall only use a few out of which one of the syntax is ssh -t [type] where the type denotes the cryptographic algorithm ( RSA or DSA ) used to generate the key.
tar -xvzf : This command is used to extract files from an archive. (similar to unzip in windows OS ) Syntax: tar [Options] [Filename] , There are many options available but we shall use few.
mkdir : This command is used to create a directory. Syntax : mkdir [directory name]
mv : This command is used to move or rename files/directories. Syntax : mv [destination]
cd : Change directory command is used to change the current directory in linux shell / console. Syntax : cd [dir name / path]
vi : VI is a Visual text editor which is used to create, edit, write, update files. Syntax : vi [new file name / existing file name ]
adduser : This command adds a new user to the system. Syntax : adduser [username]
addgroup : This command adds a new group to the system. Syntax : add group [groupname]
chown : This command changes the owner and the group to which this file belongs. Syntax : chown [user]:[group] [filename]
chmod : This command is used to change the file permissions.
- Syntax : chmod [permissions] [filename]
- Permissions defines the set of permissions for the owner of this file. It is represented as an octal no. (0-7) ex. chmod 747 myfile.
- Here, 7,4,7 each no. defines the permission for the user.
  - 4 stands for read
  - 2 stands for write
  - 1 stands for execute
  - 0 stands for no permission
- 7 in this case is a combination of 4+2+1 (read, write and execute).
apt-get update : command downloads the package lists from the repositories and “updates” them to get information on the newest versions of packages and their dependencies.

Hadoop Architecture in brief

Though I will be discussing about Hadoop Architecture in detail in my next post, the installation requires some basic knowledge of what type of installation we are performing and what are the components of this architecture.

The two important components of Hadoop are as explained below,

HDFS : Hadoop Distributed File System as the name suggests is a distributed file system that provides high availability and fault tolerance as the main features.
- HDFS is based upon the master – slave architecture. An HDFS cluster comprises of a master which is known as the name node and slaves which are known as the data nodes.
Map Reduce : Hadoop uses the MapReduce algorithm to process the data in parallel on different CPU nodes.
- Similar to HDFS Architecture, MapReduce comprises of a Master which is known as the Job Tracker and slaves which are known as the Task trackers.

Hadoop Installation on Ubuntu Linux.

There are 3 modes of Installing Hadoop

1) Local Mode : Hadoop is by default configured to run on a standalone mode as a single java process too. In this case there are no daemons running, which means there is only one JVM instance that runs. Also note that HDFS is not used here. In this case only the JAVA_HOME needs to be set for configuration.

2) Pseudo Distributed : The Master (Name Node, Job Tracker) and the Slaves (Data Nodes + Task Trackers) are all running on a single machine. These Hadoop daemons can run on one single machine to simulate a cluster on a small scale. This is also known as a single cluster and is generally useful for Developers and Testers.

3) Fully Distributed : All the Hadoop daemons run on a separate machine i.e. The Master (Name Node, Job Tracker) and the Slaves (Data Nodes + Task Trackers) are configured on different machines respectively.

I will be guiding the steps to install a Pseudo Distributed Single Node Hadoop cluster.

A) Java Installation : Since Apache Hadoop framework is written in Java Programming language, we need to install JDK.

# Update the source list using the apt-get update command.

poonam@Latitude$ sudo apt-get update

# Install Sun Java 7 JDK

poonam@Latitude$ sudo apt-get install openjdk-7-jdk

A.1) Quick Check : After installing JDK, check the Java Version

java version “1.7.0_95”

OpenJDK Runtime Environment (IcedTea 2.6.4) (7u95-2.6.4-0ubuntu0.14.04.1)

OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)

B) Adding a dedicated user

Though this step is not compulsory, I follow it so as to keep my Hadoop installation separate from other software applications. So we will create a new user “hduser” and a new group “hadoop”, we will add this hduser to the hadoop group using the addgroup and the adduser commands.

poonam@Latitude$ sudo addgroup hadoop

poonam@Latitude$ sudo adduser –ingroup hadoop hduser

Switch user

poonam@Latitude$ su -hduser

C) Installing and Configuring SSH

Hadoop uses Shell (SSH) to communicate with the slave nodes. It requires a password-less SSH connection between the master and all the slaves because if ssh is not password-less, you have to go on each individual machine and start all the processes there, manually in case of a fully distributed cluster. Therefore even in the Pseudo distributed single node cluster we will need to configure the SSH access and keep the connection password-less otherwise we will be prompted very frequently for entering the password. To make it password-less we need to configure the SSH access to localhost for the hduser.

Install SSH using the following commands

hduser@Latitude$ sudo apt-get install ssh

hduser@Latitude$ sudo apt-get install sshd

hduser@Latitude$ ssh-keygen -t rsa -P “”

Generating public/private rsa key pair.

Enter file in which to save the key (/home/hduser/.ssh/id_rsa):

Created directory ‘/home/hduser/.ssh’.

Your identification has been saved in /home/hduser/.ssh/id_rsa.

Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

You will see something similar on the console , this is just an example and not the actual key fingerprint.

Example: The key fingerprint is:
d2:65:43:bd:b0:cb:b1:c5:7e:39:f6:1d:1e:6e:a7:bd hduser@Latitude
The key’s randomart image is:
+–[ RSA 2048]—–+
| .. |
| …. |
| =+.. |
| oo+ |
| S. * . |
| + . =o |
| ooo+|
| =+|
| oE+|
+—————–+

Now this will create an RSA Key Pair with an empty password. This is generally not recommended but as discussed above we would need a password-less SSH access to avoid entering the password every time Hadoop communicates with its nodes.

hduser@Latitude:$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

We can check if ssh is working properly,

hduser@Latitude:$ ssh localhost

The authenticity of host ‘localhost (::1)’ can’t be established.

RSA key fingerprint is c6:56:35:67:be:03:00:da:1c:95:2f:aa:33:t9:36:26.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.

D) Disabling IPv6

Why do we disable IPv6 for Hadoop?

1) As stated in Apache Hadoop Wiki

Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Some Linux releases default to being IPv6 only. That means unless the systems are configured to re-enable IPv4.

2) As we are setting up a Pseudo distributed single node cluster i.e. all the masters and slaves will be residing on one machine, in this case we are not connecting to any IPv6 network, thus there is no point in enabling IPv6 here.

To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in any editor and add the following lines to the end of the file:

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Reboot the machine in order to make these changes take effect. You can check whether IPv6 is disabled on your machine with the following command:

hduser@Latitude:$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

A return value of 1 means IPv6 is disabled.

E) Install Hadoop

Download Hadoop from the Apache download mirrors

The best practice is to keep the Hadoop installation to the /usr/local/hadoop directory, it may vary as per your choice.Make sure to change the owner of all the files to the hduser user and hadoop group.

hduser@Latitude:$ cd /usr/local

hduser@Latitude:$ sudo tar xzf hadoop-1.0.3.tar.gz

hduser@Latitude:$ sudo chown -R hduser:hadoop hadoop

F) Setup Configuration Files for Pseudo Distributed Single Node cluster

The following files will have to be modified to complete the Hadoop setup:

~/.bashrc
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
/usr/local/hadoop/etc/hadoop/core-site.xml
/usr/local/hadoop/etc/hadoop/mapred-site.xml
/usr/local/hadoop/etc/hadoop/hdfs-site.xml

1. Update ~/.bashrc:

Assuming that you are using the bash shell,

Add the following lines to the end of the $HOME/ .bashrc file of user hduser.

# Set Hadoop-related environment variablesexport HADOOP_HOME=/usr/local/hadoop

# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/bin

2. hadoop-env.sh

To configure hadoop we have to set the Java environment variable i.e. JAVA_HOME.

The hadoop-env.sh file contents will be

# The java implementation to use. Required.

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

change the contents to

# The java implementation to use. Required.

export JAVA_HOME=/usr/lib/jvm/java-7-sun

3. core-site.xml

The core-site.xml file contains configuration properties that Hadoop uses when starting up. Before we edit this file, we will have to create a “tmp” directory with the path “/app/hadoop/tmp” and change the owner of this “tmp” directory to hduser and group to hadoop

hduser@Latitude:$ sudo chown -R hduser:hadoop tmp

Now add the following code to core-site.xml

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>

The name of the default file system. A URI whose scheme and authority

determine the FileSystem implementation. The uri’s scheme determines

the config property (fs.SCHEME.impl) naming the FileSystem

implementation class. The uri’s authority is used to determine the host,

</description>
</property>

4. Add the following code to conf/mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>

The host and port that the MapReduce job tracker runs at.

If “local”, then jobs are run in-process as a single map and

</description>
</property>

5. Add the following code to conf/hdfs-site.xml

<name>dfs.replication</name>

Default block replication. The actual number of replications can be

specified when the file is created. The default is used if replication is

</description>

This is the first step to starting up Hadoop. This requires you to format the HDFS (Hadoop Distributed File System) which rests on top of your local filesystem of your “cluster”.

You need to do this the first time you set up a Hadoop cluster.

Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS)!

To format the HDFS :

hduser@Latitude:~$ /usr/local/hadoop/bin/hadoop namenode -format

H) Finally, Starting up your Pseudo-Distributed Single-Node Cluster

Run this command

hduser@Latitude:~$ start-all.sh

If at all you get an error “start-all.sh: command not found” then run the command given below,

hduser@Latitude:~$ /usr/local/hadoop/bin/start-all.sh

To check if the installation was successful, type in the following command

hduser@Latitude:~$ jps

The output will look like this

2149 JobTracker

2085 SecondaryNameNode

1788 NameNode

There will be in all 6 daemons running with the names TaskTracker, JobTracker, DataNode, SecondaryNameNode, Jps and NameNode. If any of these daemons is missing, then the installation has gone wrong some where and needs to be debugged.

To stop the cluster

hduser@Latitude:~$ stop-all.sh

If at all you get an error “stop-all.sh: command not found” then run the command given below,

hduser@Latitude:~$ /usr/local/hadoop/bin/stop-all.sh

These are the steps required to setup your own single node cluster.

In case of any queries, concerns or suggestions please comment or write to [email protected]

References :

Instalando Hadoop No Debian9

Author

Archives

Categories