Running Hadoop on CentOS 6 (Multi-Node Cluster)

hadoop.apache.org

This demonstration has been tested with the following software versions:

  • Oracle VM VirtualBox 4.1.22 r80657
  • CentOS-6.3_x86_64 with Minimal Installation
  • Hadoop 1.0.3

Networking

hdp01 192.168.1.39 NameNode、JobTracker、DataNode
hdp02 192.168.1.40 DataNode、TaskTracker
hdp03 192.168.1.41 DataNode、TaskTracker

Candidate CentOS VM [1]

After installing the CentOS in VirtualBox, patch up system by applying all updates and install wget tool for future use. Note that the following instructions are executed on hdp01.

[root@hdp01 ~]# yum -y update
[root@hdp01 ~]# yum -y install wget

Here we disable the iptables firewall In Redhat/CentOS Linux for facilitating the demonstration.

[root@hdp01 ~]# service iptables stop && chkconfig iptables off

Install Oracle Java SDK and configure the system environment variables. Note that the Oracle has recently disallowed direct downloads of java from their servers, [2] provides an feasible solution:

wget -c --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F" "<DOWNLOAD URL>" --output-document="<DOWNLOAD FILE NAME>"

For example, we can use following command to obtain the JDK 1.7.0_07 rpm:

[root@hdp01 ~]# wget -c --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F" "http://download.oracle.com/otn-pub/java/jdk/7u7-b10/jdk-7u7-linux-x64.rpm" --output-document="jdk-7u7-linux-x64.rpm"
[root@hdp01 ~]# rpm -ivh jdk-7u7-linux-x64.rpm
[root@hdp01 ~]# vi /etc/profile

Add the following variables:

JAVA_HOME=/usr/java/jdk1.7.0_07
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar
export PATH JAVA_HOME CLASSPATH

Finally, we take user hadoop to perform the Hadoop operations.

[root@hdp01 ~]# adduser hadoop
[root@hdp01 ~]# passwd hadoop
[root@hdp01 ~]# gpasswd -a hadoop root
[root@hdp01 ~]# grep "^root" /etc/group

OpenSSH Packages

There are some omissive packages in minimal installation of CentOS, e.g., scp command.

[root@hdp01 ~]# yum -y install openssh-server openssh-clients
[root@hdp01 ~]# chkconfig sshd on
[root@hdp01 ~]# service sshd start
[root@hdp01 ~]# yum -y install rsync

FQDN Mapping

[root@hdp01 ~]# vi /etc/hosts

Add the following variables (all machines) to make sure each particular host is reachable.

127.0.0.1 localhost
192.168.1.39 hdp01
192.168.1.40 hdp02
192.168.1.41 hdp03

*VBoxManage Duplication

VBoxManage clonehd <source_file> <output_file>

After clonehd, the VM raises the network issue, delete the eth0 line and modify the eth1 line to be eth0 [3]. In addition, modify the hostname value.

[root@hdp01 ~]# vi /etc/udev/rules.d/70-persistent-net.rules
[root@hdp01 ~]# vi /etc/sysconfig/network

SSH Access

[hadoop@hdp01 ~]$ ssh-keygen -t rsa -P ''
[hadoop@hdp01 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hadoop@hdp01 ~]$ chmod 600 ~/.ssh/authorized_keys
[hadoop@hdp01 ~]$ scp -r ~/.ssh hdp*:~/

Hadoop Configuration

[hadoop@hdp01 ~]$ wget http://ftp.tc.edu.tw/pub/Apache/hadoop/common/hadoop-1.0.3/hadoop-1.0.3.tar.gz
[hadoop@hdp01 ~]$ tar zxvf hadoop-1.0.3.tar.gz


Configure the following files.

[hadoop@hdp01 ~]$ vi hadoop-1.0.3/conf/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hdp01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop/hadoop-${user.name}</value>
</property>
</configuration>

[hadoop@hdp01 ~]$ vi hadoop-1.0.3/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

[hadoop@hdp01 ~]$ vi hadoop-1.0.3/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdp01:9001</value>
</property>
</configuration>

[hadoop@hdp01 ~]$ vi hadoop-1.0.3/conf/hadoop-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_07
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/conf
export HADOOP_HOME_WARN_SUPPRESS=1

Deployment

[hadoop@hdp01 ~]$ scp -r hadoop-1.0.3 hdp01:~/
[hadoop@hdp02 ~]$ scp -r hadoop-1.0.3 hdp02:~/
 
// directories setting for each machines (hdp*)
[root@hdp01 ~]# cp hadoop-1.0.3 /opt/hadoop
[root@hdp01 ~]# cd /opt
[root@hdp01 ~]# mkdir /var/hadoop
[root@hdp01 ~]# chown -R hadoop.hadoop hadoop
[root@hdp01 ~]# chown -R hadoop.hadoop /var/hadoop

conf/masters (master only)

[hadoop@hdp01 ~]$ vi hadoop-1.0.3/conf/masters
 
hdp01

[hadoop@hdp01 ~]$ vi hadoop-1.0.3/conf/slaves
 
hdp01
hdp02
hdp03

Formatting the HDFS filesystem via the NameNode (master only)

[hadoop@hdp01 hadoop]$ bin/hadoop namenode -format

Now, we can start the multi-node cluster via bin/start-all.sh, and stop the service via bin/stop-all.sh command. Finally, we can put some materials into HDFS and conduct the wordcount on master machine to verify the whole process.

References

  1. 详细 完整分布模式安装hadoop VirtualBox 3虚拟机 简单复制
  2. wget for Oracle JDK is broken
  3. VirtualBox Clone Root HD / Ubuntu / Network issue
About these ads

2 thoughts on “Running Hadoop on CentOS 6 (Multi-Node Cluster)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s