HBase is a powerful open-source, non-relational, distributed database designed for handling massive datasets. Inspired by Google’s BigTable, HBase is written in Java and integrates seamlessly with the Apache Hadoop ecosystem. It leverages the Hadoop Distributed File System (HDFS) for storage, providing BigTable-like capabilities within your Hadoop environment. In essence, HBase offers a fault-tolerant solution for storing and retrieving large quantities of sparse data – situations where meaningful information resides amidst vast amounts of empty or irrelevant data. Examples include identifying the top 50 largest items from a dataset of 2 billion records or isolating the non-zero entries within a massive collection where they represent less than 0.1% of the total data.

This guide provides a step-by-step walkthrough of setting up an HBase cluster on top of Hadoop (YARN) using Ubuntu 12.04 LTS. While the guide targets this specific environment, the principles can be adapted to other Linux distributions with minor modifications.

Hardware Configuration

This guide assumes a cluster comprising one master node and three slave nodes, each with the following specifications:

  • Server RAM: 16GB
  • Server CPU: [Virtual] Single Core 64-bit

Adapt these specifications based on your workload and data volume. More cores and RAM will improve performance significantly, especially for RegionServers.

Services Running on Each Server

The following services are distributed across the cluster nodes:

  1. Master Node [AHMD-HBASE-MASTER][16.114.26.95]: NameNode, HMaster, ResourceManager.
  2. Slave Node [AHMD-HBASE-RS01] [16.114.26.99]: NodeManager, RegionServer, DataNode, ZooKeeper, SecondaryNameNode (SecNN). Note that running the SecondaryNameNode on a RegionServer is not ideal for production environments due to resource contention.
  3. Slave Node [AHMD-HBASE-RS02] [16.114.26.94]: NodeManager, RegionServer, DataNode, ZooKeeper.
  4. Slave Node [AHMD-HBASE-RS03] [16.114.22.192]: NodeManager, RegionServer, DataNode, ZooKeeper.

Important Considerations:

  • ZooKeeper: A dedicated ZooKeeper cluster (separate from the HBase RegionServers) is highly recommended for production environments to ensure stability and performance.
  • ResourceManager: For larger clusters, consider a dedicated ResourceManager node for optimal resource management.
  • SecondaryNameNode: Ideally, the SecondaryNameNode should be placed on a separate machine from the RegionServers to avoid resource conflicts.

Prerequisites - Before You Start

Before proceeding, ensure the following prerequisites are met:

Update All Servers with the /etc/hosts File

This step ensures proper name resolution within the cluster. Copy the following content into the /etc/hosts file on all servers. Verify there are no duplicate entries.

# HBASE Nodes
16.114.26.95    AHMD-HBASE-MASTER  # HMaster, NNode, RManager.
16.114.26.99    AHMD-HBASE-RS01      # NodeManager, RServer, DNode, ZKeeper, SNN.
16.114.26.94    AHMD-HBASE-RS02      # NodeManager, RServer, DNode, ZKeeper.
16.114.22.192   AHMD-HBASE-RS03      # NodeManager, RServer, DNode, ZKeeper.

Explanation:

  • Each line maps an IP address to a hostname.
  • Ensure the IP addresses and hostnames match your server configuration.
  • Comments provide context for each node’s role.

Passwordless SSH from master to all slaves

Passwordless SSH is crucial for automated tasks and communication between nodes. Follow these steps to configure it:

  1. Generate SSH Key Pair on the Master Node:

    ahmed@AHMD-HBASE-MASTER:~> ssh-keygen -t rsa
    

    Accept the default file location and passphrase (or set a passphrase if desired).

  2. Copy Public Key to Each Slave Node:

    ahmed@AHMD-HBASE-MASTER:~> ssh ahmed@AHMD-HBASE-RS01 mkdir -p .ssh
    ahmed@AHMD-HBASE-MASTER:~> cat ~/.ssh/id_rsa.pub | ssh ahmed@AHMD-HBASE-RS01 'cat >> .ssh/authorized_keys'
    
    ahmed@AHMD-HBASE-MASTER:~> ssh ahmed@AHMD-HBASE-RS02 mkdir -p .ssh
    ahmed@AHMD-HBASE-MASTER:~> cat ~/.ssh/id_rsa.pub | ssh ahmed@AHMD-HBASE-RS02 'cat >> .ssh/authorized_keys'
    
    ahmed@AHMD-HBASE-MASTER:~> ssh ahmed@AHMD-HBASE-RS03 mkdir -p .ssh
    ahmed@AHMD-HBASE-MASTER:~> cat ~/.ssh/id_rsa.pub | ssh ahmed@AHMD-HBASE-RS03 'cat >> .ssh/authorized_keys'
    

    Explanation:

    • The first command generates an RSA key pair.
    • The subsequent commands create the .ssh directory (if it doesn’t exist) and append the public key (id_rsa.pub) to the authorized_keys file on each slave node.
  3. Test Passwordless SSH:

    ahmed@AHMD-HBASE-MASTER:~> ssh AHMD-HBASE-RS01
    ahmed@AHMD-HBASE-RS01:~>
    

    If you can log in without being prompted for a password, passwordless SSH is configured correctly.

Setting Up the Hadoop Cluster (YARN)

This section details the installation and configuration of the Hadoop cluster.

Extracting and Creating Required Directories

Perform the following commands on all servers:

tar xvzf hadoop-2.3.0-cdh5.1.2.tar.gz -C /opt
ln -s /opt/hadoop-2.3.0-cdh5.1.2 /opt/hadoop
mkdir -p /data/hadoop/nn
mkdir -p /data/hadoop/dn
mkdir -p /data/zookeeper
mkdir -p /data/1/yarn/local
mkdir -p /data/2/yarn/local
mkdir -p /data/3/yarn/local
mkdir -p /data/1/yarn/logs
mkdir -p /data/2/yarn/logs
mkdir -p /data/3/yarn/logs

Explanation:

  • Extracts the Hadoop archive to the /opt directory.
  • Creates a symbolic link /opt/hadoop for easier access.
  • Creates directories for NameNode data (/data/hadoop/nn), DataNode data (/data/hadoop/dn), ZooKeeper data (/data/zookeeper), and YARN local and log directories (/data/1/yarn/local, /data/1/yarn/logs, etc.). Consider using a RAID configuration for these data directories in production for better performance and fault tolerance.

Setting Up HDFS

Configure HDFS on the AHMD-HBASE-MASTER node. Edit the configuration files located in /opt/hadoop/etc/hadoop/.

1. core-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://AHMD-HBASE-MASTER:9000</value>
  </property>
</configuration>

Explanation:

  • fs.default.name: Specifies the URI for the NameNode. Port 9000 is the default HDFS port.

2. hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/hadoop/nn</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///data/hadoop/dn</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>AHMD-HBASE-RS01:50090</value>
  </property>
</configuration>

Explanation:

  • dfs.replication: Sets the data replication factor to 3 (each data block is stored on three different DataNodes for fault tolerance).
  • dfs.namenode.name.dir: Specifies the directory where the NameNode stores its metadata.
  • dfs.datanode.data.dir: Specifies the directory where the DataNodes store data blocks.
  • dfs.namenode.secondary.http-address: Specifies the HTTP address of the SecondaryNameNode.

3. slaves:

This file lists the hostnames of the DataNodes. Add the following lines:

AHMD-HBASE-RS01
AHMD-HBASE-RS02
AHMD-HBASE-RS03

Setting Up YARN

Configure YARN on the AHMD-HBASE-MASTER node. Edit the configuration files located in /opt/hadoop/etc/hadoop/.

1. mapred-site.xml:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Explanation:

  • mapreduce.framework.name: Specifies that YARN is the MapReduce framework.

2. yarn-site.xml:

<configuration>
  <!-- Site specific YARN configuration properties -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>AHMD-HBASE-MASTER</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
  </property>
  <property>
    <name>yarn.log.aggregation-enable</name>
    <value>true</value>
  </property>
  <property>
    <description>Where to aggregate logs</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>hdfs://AHMD-HBASE-MASTER:9000/var/log/hadoop-yarn/apps</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>6144</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
  </property>
</configuration>

Explanation:

  • yarn.nodemanager.aux-services: Enables the MapReduce shuffle service.
  • yarn.nodemanager.aux-services.mapreduce.shuffle.class: Specifies the shuffle handler class.
  • yarn.resourcemanager.hostname: Specifies the hostname of the ResourceManager.
  • yarn.nodemanager.local-dirs: Specifies the local directories used by the NodeManagers for storing intermediate data. Use multiple directories on separate disks for improved I/O performance.
  • yarn.nodemanager.log-dirs: Specifies the directories where NodeManager logs are stored.
  • yarn.log.aggregation-enable: Enables log aggregation.
  • yarn.nodemanager.remote-app-log-dir: Specifies the HDFS directory where aggregated logs are stored.
  • yarn.nodemanager.resource.memory-mb: Specifies the total memory available to the NodeManager in MB (6144MB = 6GB). Adjust based on the server’s RAM.
  • yarn.scheduler.minimum-allocation-mb: Specifies the minimum memory allocation unit for the YARN scheduler (1024MB = 1GB).

Copy All Configuration Files to Slaves

Copy the configuration files from the master node to all slave nodes.

scp -r /opt/hadoop/etc/hadoop/* root@AHMD-HBASE-RS01:/opt/hadoop/etc/hadoop/
scp -r /opt/hadoop/etc/hadoop/* root@AHMD-HBASE-RS02:/opt/hadoop/etc/hadoop/
scp -r /opt/hadoop/etc/hadoop/* root@AHMD-HBASE-RS03:/opt/hadoop/etc/hadoop/

Note: This example uses root for copying files. In a production environment, consider creating a dedicated Hadoop user and granting it appropriate permissions.

Starting HDFS Services

1. Format HDFS (First Time Only):

On AHMD-HBASE-MASTER:

/opt/hadoop/bin/hadoop namenode -format

Warning: This command must be executed only once, when setting up the cluster for the first time. Formatting the NameNode will erase existing data.

2. Start the NameNode:

On AHMD-HBASE-MASTER:

/opt/hadoop/sbin/hadoop-daemon.sh start namenode

3. Start the DataNodes:

On AHMD-HBASE-MASTER:

/opt/hadoop/sbin/hadoop-daemons.sh start datanode

This command starts the DataNode service on all nodes listed in the slaves file.

4. Start the SecondaryNameNode:

On AHMD-HBASE-RS01 (or the node specified in hdfs-site.xml):

/opt/hadoop/sbin/hadoop-daemons.sh start secondarynamenode

Starting YARN Services

1. Start the ResourceManager:

On AHMD-HBASE-MASTER:

/opt/hadoop/sbin/yarn-daemon.sh start resourcemanager

2. Start the NodeManagers:

On AHMD-HBASE-MASTER:

/opt/hadoop/sbin/yarn-daemons.sh start nodemanager

This command starts the NodeManager service on all slave nodes.

Testing Your Hadoop Cluster

Use the example programs included with Hadoop to verify the cluster’s functionality.

1. List Available Examples:

/opt/hadoop/bin/yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.2.jar

This command lists the available example programs.

2. Run the pi Example:

/opt/hadoop/bin/yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0-cdh5.1.2.jar pi 5 1000

Explanation:

  • pi: Specifies the “pi” example program.
  • 5: Specifies the number of maps to run.
  • 1000: Specifies the number of samples per map.

The output will show the progress of the job and the estimated value of Pi. You can monitor the job’s progress in the YARN ResourceManager web UI (typically accessible at http://AHMD-HBASE-MASTER:8088).

Setting Up ZooKeeper on AHMD-HBASE-RS01, AHMD-HBASE-RS02, AHMD-HBASE-RS03

ZooKeeper is a crucial component for HBase, providing coordination and configuration management. Install and configure ZooKeeper on the slave nodes. An odd number of ZooKeeper nodes is recommended for leader election.

Initial Setup on All ZooKeeper Servers

1. Extract ZooKeeper:

tar xvzf zookeeper-3.4.5-cdh5.1.2.tar.gz -C /opt/
ln -s /opt/zookeeper-3.4.5-cdh5.1.2 /opt/zookeeper

2. Create a Directory for ZooKeeper Data:

mkdir -p /data/zookeeper

3. Configure ZooKeeper (zoo.cfg):

cp /opt/zookeeper/conf/zoo_sample.cfg /opt/zookeeper/conf/zoo.cfg
sed -i -- 's/tmp\/zookeeper/data\/zookeeper/g' /opt/zookeeper/conf/zoo.cfg

This copies the sample configuration file and updates the dataDir path.

4. Add ZooKeeper Cluster Nodes to zoo.cfg:

echo "server.1=AHMD-HBASE-RS01:2888:3888" >> /opt/zookeeper/conf/zoo.cfg
echo "server.2=AHMD-HBASE-RS02:2888:3888" >> /opt/zookeeper/conf/zoo.cfg
echo "server.3=AHMD-HBASE-RS03:2888:3888" >> /opt/zookeeper/conf/zoo.cfg

Explanation:

  • server.X=hostname:port1:port2: Defines the ZooKeeper servers in the cluster.
    • X: The server ID (unique for each server).
    • hostname: The hostname of the ZooKeeper server.
    • port1: The port used for follower-to-leader communication.
    • port2: The port used for leader election.

Assigning IDs to ZooKeeper Nodes

Create a myid file in the dataDir directory on each ZooKeeper server, containing the server’s ID.

  • AHMD-HBASE-RS01 (ID: 1):

    echo "1" > /data/zookeeper/myid
    
  • AHMD-HBASE-RS02 (ID: 2):

    echo "2" > /data/zookeeper/myid
    
  • AHMD-HBASE-RS03 (ID: 3):

    echo "3" > /data/zookeeper/myid
    

Starting Up ZooKeeper Server

On each ZooKeeper server:

/opt/zookeeper/bin/zkServer.sh start

The ZooKeeper startup logs will be written to zookeeper.out in the directory where the command was executed.

Testing ZooKeeper

ZooKeeper runs on port 2181 by default. Use telnet to verify the service is running.

root@AHMD-HBASE-RS01:~# telnet AHMD-HBASE-RS02 2181
Trying 16.114.26.94...
Connected to AHMD-HBASE-RS02.
Escape character is '^]'.
^CConnection closed by foreign host.
root@AHMD-HBASE-RS01:~#

If you see “Connected to…” and can close the connection with ^], ZooKeeper is running correctly.

Setting Up HBase

With Hadoop and ZooKeeper configured, you can now set up HBase.

On AHMD-HBASE-MASTER:

Extract the Archive and Update Configurations

tar xvzf hbase-0.98.1-cdh5.1.2.tar.gz -C /opt
ln -s /opt/hbase-0.98.1-cdh5.1.2 /opt/hbase

Navigate to the /opt/hbase/conf/ directory.

1. hbase-site.xml:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://AHMD-HBASE-MASTER:9000/hbase_jio</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>AHMD-HBASE-RS01,AHMD-HBASE-RS02,AHMD-HBASE-RS03</value>
  </property>
</configuration>

Explanation:

  • hbase.rootdir: Specifies the HDFS directory where HBase stores its data.
  • hbase.cluster.distributed: Indicates that this is a distributed HBase cluster.
  • hbase.zookeeper.quorum: Lists the ZooKeeper servers in the cluster, separated by commas.

2. hbase-env.sh:

export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"

export HBASE_MASTER_OPTS="-Xmx6g $HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"

export HBASE_REGIONSERVER_OPTS="-Xmx6g $HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"

Explanation:

  • HBASE_JMX_BASE: Enables JMX monitoring without SSL and authentication (for simplicity in this example). In production, enable SSL and authentication for security.
  • HBASE_MASTER_OPTS: Sets the maximum heap size for the HMaster process to 6GB and enables JMX monitoring on port 10101. Adjust the heap size based on your server’s RAM.
  • HBASE_REGIONSERVER_OPTS: Sets the maximum heap size for the RegionServer processes to 6GB and enables JMX monitoring on port 10102. Adjust the heap size based on your server’s RAM.

3. regionservers:

This file lists the hostnames of the RegionServers.

AHMD-HBASE-RS01
AHMD-HBASE-RS02
AHMD-HBASE-RS03

Copy Configuration to All Servers

scp -r /opt/hbase/conf/* root@AHMD-HBASE-RS01:/opt/hbase/conf/
scp -r /opt/hbase/conf/* root@AHMD-HBASE-RS02:/opt/hbase/conf/
scp -r /opt/hbase/conf/* root@AHMD-HBASE-RS03:/opt/hbase/conf/

Starting HBase Services

1. Start the HMaster:

On AHMD-HBASE-MASTER:

/opt/hbase/bin/hbase-daemon.sh start master

2. Start the RegionServers:

On AHMD-HBASE-MASTER:

/opt/hbase/bin/hbase-daemons.sh start regionserver

This will start the RegionServer processes on all nodes listed in the regionservers file.

Testing HBase

1. Enter the HBase Shell:

/opt/hbase/bin/hbase shell

2. Execute HBase Commands:

status 'simple'
create 'table2', 'cf1'
put 'table2', 'row1', 'column1', 'value'
scan 'table2'

Explanation:

  • status 'simple': Displays a basic status of the HBase cluster.
  • create 'table2', 'cf1': Creates a table named ‘table2’ with a column family named ‘cf1’.
  • put 'table2', 'row1', 'column1', 'value': Adds a row to ‘table2’ with row key ‘row1’, column ‘column1’ in column family ‘cf1’, and the value ‘value’.
  • scan 'table2': Displays all rows in the ‘table2’ table.

If these commands execute successfully, your HBase cluster is set up correctly. You can monitor the HBase Master and RegionServer UIs through the browser as well, generally on ports 16010 (Master) and 16020 (RegionServer). Remember to enable proper authentication and authorization in production environments.