First lets setup the Hbase Configuration Files. For pseudo-distributed replace with 'localhost'

Setting up Local HBase on Top of HDFS.

Hbase Configuration

--------------
hbase-site.xml
--------------
	<configuration>
	  <property>
		<name>hbase.rootdir</name>
		<value>hdfs://<server_ip_address>:9000/hbase</value>
	  </property>
	 <property>
	 <name>hbase.zookeeper.property.clientPort</name>
	 <value>2181</value>
	 </property>
	  <property>
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	  </property>
	  <property>
		  <name>hbase.zookeeper.quorum</name>
		  <value><server_ip_address></value>
	   </property>
	</configuration>

Explanation:

  • hbase.rootdir: Specifies the directory in HDFS where HBase stores its data. Make sure the /hbase directory exists in HDFS; if not, create it using hdfs dfs -mkdir /hbase.
  • hbase.zookeeper.property.clientPort: Defines the port on which ZooKeeper will listen for client connections. The default is 2181.
  • hbase.cluster.distributed: Set to true to indicate that HBase is running in distributed mode (even in pseudo-distributed mode).
  • hbase.zookeeper.quorum: Lists the ZooKeeper quorum servers. In pseudo-distributed mode, this is simply the server_ip_address (localhost).

Managing Zookeeper

HBase relies on ZooKeeper for coordination. You can either let HBase manage its own ZooKeeper instance or use an external one.

--------------
hbase-env.sh
--------------
	export HBASE_MANAGES_ZK=false

Explanation:

  • HBASE_MANAGES_ZK=false: If set to false, HBase will not start or manage its own ZooKeeper instance. You are responsible for starting and managing a separate ZooKeeper instance. If you set this to true, HBase will start its own Zookeeper instance, but for more control it is recommended to set this to false.

If HBASE_MANAGES_ZK=false, you’ll need to configure and start ZooKeeper independently.

------------
zoo.cfg
------------
	#  The number of milliseconds of each tick
	tickTime=2000
	#  The number of ticks that the initial
	#  synchronization phase can take
	initLimit=20
	#  The number of ticks that can pass between
	#  sending a request and getting an acknowledgement
	syncLimit=10
	#  the directory where the snapshot is stored.
	dataDir=/opt/mapr/zkdata
	#  the port at which the clients will connect
	clientPort=2181
	#  max number of client connections
	maxClientCnxns=100
	maxSessionTimeout=300000

Explanation:

  • tickTime: The basic time unit in milliseconds used by ZooKeeper.
  • initLimit: The maximum number of ticks the ZooKeeper followers have to connect and sync to the leader.
  • syncLimit: The maximum number of ticks that can pass between sending a request and receiving an acknowledgment.
  • dataDir: The directory where ZooKeeper stores its data, including the snapshot of the data tree and the transaction log. Important: Create this directory (/opt/mapr/zkdata in this example) with appropriate permissions. You may need to change this path to a location suitable for your environment.
  • clientPort: The port on which ZooKeeper listens for client connections. Make sure this matches the hbase.zookeeper.property.clientPort in hbase-site.xml.
  • maxClientCnxns: Limits the number of concurrent connections from a single client.
  • maxSessionTimeout: Maximum session timeout in milliseconds.

HDFS Configuration

HBase stores its data in HDFS, so proper HDFS configuration is crucial.

-------------
core-site.xml
-------------
	<configuration>
		<property>
			<name>fs.default.name</name>
			<value>hdfs://<server_ip_address>:9000</value>
		</property>
	</configuration>

Explanation:

  • fs.default.name: Specifies the URI for the NameNode. This tells Hadoop clients where to find the HDFS file system. Replace <server_ip_address> with the actual IP address or hostname of your NameNode.


    hdfs-site.xml ————-

	<configuration>
		<property>
			<name>dfs.replication</name>
			<value>1</value>
		</property>

		<!--The path in this needs to be created first-->
		<property>
			<name>dfs.namenode.name.dir</name>
			<value>file:///root/hadoop-2.5.1/yarn_data/hdfs/namenode</value>
		</property>

		<!--The path in this needs to be created first-->
		<property>
			<name>dfs.datanode.data.dir</name>
			<value>file:///root/hadoop-2.5.1/yarn_data/hdfs/datanode</value>
		</property>
	</configuration>

Explanation:

  • dfs.replication: Specifies the number of replicas for each block of data. In pseudo-distributed mode, a replication factor of 1 is sufficient.
  • dfs.namenode.name.dir: The directory where the NameNode stores its metadata. Important: Create this directory (/root/hadoop-2.5.1/yarn_data/hdfs/namenode in this example) before starting HDFS. Ensure the user running the NameNode has write permissions to this directory.
  • dfs.datanode.data.dir: The directory where the DataNode stores data blocks. Important: Create this directory (/root/hadoop-2.5.1/yarn_data/hdfs/datanode in this example) before starting HDFS. Ensure the user running the DataNode has write permissions to this directory.


    mapred-site.xml ————-

	<configuration>
	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
	</configuration>

Explanation:

  • mapreduce.framework.name: Specifies the execution framework for MapReduce jobs. Setting it to yarn indicates that MapReduce jobs will run on the YARN cluster.


    yarn-site.xml ————–

	<configuration>
	<!-- Site specific YARN configuration properties -->
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
		<value>org.apache.hadoop.mapred.ShuffleHandler</value>
	</property>
	</configuration>

Explanation:

  • yarn.nodemanager.aux-services: Configures auxiliary services for the NodeManager. mapreduce_shuffle is required for MapReduce applications.
  • yarn.nodemanager.aux-services.mapreduce.shuffle.class: Specifies the class responsible for handling the shuffle process in MapReduce.

Configuring Slaves (Important for Fully Distributed Mode)

While not strictly required for pseudo-distributed mode, the slaves file (or workers file in newer Hadoop versions) is important if you plan to expand to a fully distributed cluster. Add the hostname or IP address of each DataNode in your cluster to this file, one entry per line. For pseudo-distributed mode, it often contains just localhost. This file is located in $HADOOP_HOME/etc/hadoop/slaves or $HADOOP_HOME/etc/hadoop/workers.

Start HDFS and YARN Services.

Now that we have configured HDFS and YARN, let’s start the services. Before starting, format the NameNode:

hdfs namenode -format

Warning: Formatting the NameNode will erase all data in your HDFS. Only do this the first time you set up the cluster or if you want to completely wipe your HDFS data.

Now, start HDFS:

./start-dfs.sh

Start YARN:

./start-yarn.sh

Verify that the services have started successfully by running the jps command:

jps

You should see processes similar to the following:

43655 Jps
12018 Bootstrap
31585 NameNode
32114 SecondaryNameNode
31798 DataNode
32494 NodeManager
32277 ResourceManager

If any of these processes are missing, check the logs in the $HADOOP_HOME/logs directory for errors.

HDFS Test

To verify that HDFS is working correctly, run a simple HDFS test:

./bin/hadoop jar \
            /root/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.1-tests.jar \
                TestDFSIO -write -nrFiles 10 -fileSize 100

This command writes 10 files, each of size 100MB, to HDFS. After the write operation completes, run the following command to clean up the test data:

./bin/hadoop jar \
            /root/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.1-tests.jar \
                TestDFSIO -clean

Start HBase Services

With HDFS running, we can now start the HBase services.

First, start ZooKeeper (if you are managing it separately):

./hbase-daemon.sh --config ../conf/zoo.cfg start zookeeper

Next, start the RegionServer:

./hbase-daemon.sh start regionserver

Finally, start the HBase Master:

./hbase-daemon.sh start master

Verify that the HBase services have started successfully using the jps command:

jps

You should see processes similar to this:

43655 Jps
12018 Bootstrap
40171 HQuorumPeer
40425 HRegionServer
31585 NameNode
32114 SecondaryNameNode
41509 HMaster
31798 DataNode
32494 NodeManager
32277 ResourceManager

If HQuorumPeer is not listed, then it is managed by hbase, it is expected.

If any HBase processes are missing, check the logs in the $HBASE_HOME/logs directory for errors.

HBase Shell Interaction

Now that HBase is running, you can interact with it using the HBase shell. Start the HBase shell with the following command:

hbase shell

You might see deprecation warnings, but you can safely ignore them for this tutorial. The HBase shell prompt will appear:

hbase(main):001:0>

Let’s try some basic HBase commands:

  • list: Lists all the tables in HBase.
hbase(main):001:0> list
TABLE
test
1 row(s) in 2.3090 seconds

=> ["test"]
  • scan 'test': Scans the contents of the ‘test’ table.
hbase(main):002:0> scan 'test'
ROW                                  COLUMN+CELL
 row                                 column=test_fam:, timestamp=1419947982152, value=NewValue
1 row(s) in 0.4890 seconds
  • put 'test', 'row2', 'test_fam', 'SecondValue': Inserts a row into the ‘test’ table.
hbase(main):003:0> put 'test', 'row2', 'test_fam', 'SecondValue'
0 row(s) in 0.1110 seconds
  • scan 'test': Scans the contents of the ‘test’ table again to verify the insert.
hbase(main):004:0> scan 'test'
ROW                                  COLUMN+CELL
 row                                 column=test_fam:, timestamp=1419947982152, value=NewValue
 row2                                column=test_fam:, timestamp=1419950094363, value=SecondValue
2 row(s) in 0.0260 seconds
  • exit: Exits the HBase shell.

Troubleshooting

  • Connection refused errors: Double-check that HDFS, ZooKeeper, and HBase are all running and that the IP addresses and ports in your configuration files are correct. Firewall issues can also cause connection refused errors.
  • Can't get master address from ZooKeeper errors: This usually indicates a problem with ZooKeeper. Make sure ZooKeeper is running and that the hbase.zookeeper.quorum property in hbase-site.xml is correctly configured.
  • Permissions issues: Ensure that the user running the HBase and HDFS processes has the necessary permissions to read and write to the data directories.

Conclusion

This guide provides a detailed, step-by-step approach to setting up HBase on top of HDFS in pseudo-distributed mode. By following these instructions, you can create a local HBase environment for development, testing, and exploration. Remember to consult the official Apache HBase documentation for more in-depth information and advanced configuration options.