First lets setup the Hbase Configuration Files. For pseudo-distributed replace
Setting up Local HBase on Top of HDFS.
Hbase Configuration
--------------
hbase-site.xml
--------------
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://<server_ip_address>:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value><server_ip_address></value>
</property>
</configuration>
Explanation:
hbase.rootdir
: Specifies the directory in HDFS where HBase stores its data. Make sure the/hbase
directory exists in HDFS; if not, create it usinghdfs dfs -mkdir /hbase
.hbase.zookeeper.property.clientPort
: Defines the port on which ZooKeeper will listen for client connections. The default is 2181.hbase.cluster.distributed
: Set totrue
to indicate that HBase is running in distributed mode (even in pseudo-distributed mode).hbase.zookeeper.quorum
: Lists the ZooKeeper quorum servers. In pseudo-distributed mode, this is simply theserver_ip_address
(localhost).
Managing Zookeeper
HBase relies on ZooKeeper for coordination. You can either let HBase manage its own ZooKeeper instance or use an external one.
--------------
hbase-env.sh
--------------
export HBASE_MANAGES_ZK=false
Explanation:
HBASE_MANAGES_ZK=false
: If set tofalse
, HBase will not start or manage its own ZooKeeper instance. You are responsible for starting and managing a separate ZooKeeper instance. If you set this totrue
, HBase will start its own Zookeeper instance, but for more control it is recommended to set this tofalse
.
Setting up a Separate Zookeeper (Optional but Recommended)
If HBASE_MANAGES_ZK=false
, you’ll need to configure and start ZooKeeper independently.
------------
zoo.cfg
------------
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=20
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=10
# the directory where the snapshot is stored.
dataDir=/opt/mapr/zkdata
# the port at which the clients will connect
clientPort=2181
# max number of client connections
maxClientCnxns=100
maxSessionTimeout=300000
Explanation:
tickTime
: The basic time unit in milliseconds used by ZooKeeper.initLimit
: The maximum number of ticks the ZooKeeper followers have to connect and sync to the leader.syncLimit
: The maximum number of ticks that can pass between sending a request and receiving an acknowledgment.dataDir
: The directory where ZooKeeper stores its data, including the snapshot of the data tree and the transaction log. Important: Create this directory (/opt/mapr/zkdata
in this example) with appropriate permissions. You may need to change this path to a location suitable for your environment.clientPort
: The port on which ZooKeeper listens for client connections. Make sure this matches thehbase.zookeeper.property.clientPort
inhbase-site.xml
.maxClientCnxns
: Limits the number of concurrent connections from a single client.maxSessionTimeout
: Maximum session timeout in milliseconds.
HDFS Configuration
HBase stores its data in HDFS, so proper HDFS configuration is crucial.
-------------
core-site.xml
-------------
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://<server_ip_address>:9000</value>
</property>
</configuration>
Explanation:
-
fs.default.name
: Specifies the URI for the NameNode. This tells Hadoop clients where to find the HDFS file system. Replace<server_ip_address>
with the actual IP address or hostname of your NameNode.
hdfs-site.xml ————-
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--The path in this needs to be created first-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///root/hadoop-2.5.1/yarn_data/hdfs/namenode</value>
</property>
<!--The path in this needs to be created first-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///root/hadoop-2.5.1/yarn_data/hdfs/datanode</value>
</property>
</configuration>
Explanation:
dfs.replication
: Specifies the number of replicas for each block of data. In pseudo-distributed mode, a replication factor of 1 is sufficient.dfs.namenode.name.dir
: The directory where the NameNode stores its metadata. Important: Create this directory (/root/hadoop-2.5.1/yarn_data/hdfs/namenode
in this example) before starting HDFS. Ensure the user running the NameNode has write permissions to this directory.-
dfs.datanode.data.dir
: The directory where the DataNode stores data blocks. Important: Create this directory (/root/hadoop-2.5.1/yarn_data/hdfs/datanode
in this example) before starting HDFS. Ensure the user running the DataNode has write permissions to this directory.
mapred-site.xml ————-
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Explanation:
-
mapreduce.framework.name
: Specifies the execution framework for MapReduce jobs. Setting it toyarn
indicates that MapReduce jobs will run on the YARN cluster.
yarn-site.xml ————–
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Explanation:
yarn.nodemanager.aux-services
: Configures auxiliary services for the NodeManager.mapreduce_shuffle
is required for MapReduce applications.yarn.nodemanager.aux-services.mapreduce.shuffle.class
: Specifies the class responsible for handling the shuffle process in MapReduce.
Configuring Slaves (Important for Fully Distributed Mode)
While not strictly required for pseudo-distributed mode, the slaves
file (or workers
file in newer Hadoop versions) is important if you plan to expand to a fully distributed cluster. Add the hostname or IP address of each DataNode in your cluster to this file, one entry per line. For pseudo-distributed mode, it often contains just localhost
. This file is located in $HADOOP_HOME/etc/hadoop/slaves
or $HADOOP_HOME/etc/hadoop/workers
.
Start HDFS and YARN Services.
Now that we have configured HDFS and YARN, let’s start the services. Before starting, format the NameNode:
hdfs namenode -format
Warning: Formatting the NameNode will erase all data in your HDFS. Only do this the first time you set up the cluster or if you want to completely wipe your HDFS data.
Now, start HDFS:
./start-dfs.sh
Start YARN:
./start-yarn.sh
Verify that the services have started successfully by running the jps
command:
jps
You should see processes similar to the following:
43655 Jps
12018 Bootstrap
31585 NameNode
32114 SecondaryNameNode
31798 DataNode
32494 NodeManager
32277 ResourceManager
If any of these processes are missing, check the logs in the $HADOOP_HOME/logs
directory for errors.
HDFS Test
To verify that HDFS is working correctly, run a simple HDFS test:
./bin/hadoop jar \
/root/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.1-tests.jar \
TestDFSIO -write -nrFiles 10 -fileSize 100
This command writes 10 files, each of size 100MB, to HDFS. After the write operation completes, run the following command to clean up the test data:
./bin/hadoop jar \
/root/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.1-tests.jar \
TestDFSIO -clean
Start HBase Services
With HDFS running, we can now start the HBase services.
First, start ZooKeeper (if you are managing it separately):
./hbase-daemon.sh --config ../conf/zoo.cfg start zookeeper
Next, start the RegionServer:
./hbase-daemon.sh start regionserver
Finally, start the HBase Master:
./hbase-daemon.sh start master
Verify that the HBase services have started successfully using the jps
command:
jps
You should see processes similar to this:
43655 Jps
12018 Bootstrap
40171 HQuorumPeer
40425 HRegionServer
31585 NameNode
32114 SecondaryNameNode
41509 HMaster
31798 DataNode
32494 NodeManager
32277 ResourceManager
If HQuorumPeer
is not listed, then it is managed by hbase, it is expected.
If any HBase processes are missing, check the logs in the $HBASE_HOME/logs
directory for errors.
HBase Shell Interaction
Now that HBase is running, you can interact with it using the HBase shell. Start the HBase shell with the following command:
hbase shell
You might see deprecation warnings, but you can safely ignore them for this tutorial. The HBase shell prompt will appear:
hbase(main):001:0>
Let’s try some basic HBase commands:
list
: Lists all the tables in HBase.
hbase(main):001:0> list
TABLE
test
1 row(s) in 2.3090 seconds
=> ["test"]
scan 'test'
: Scans the contents of the ‘test’ table.
hbase(main):002:0> scan 'test'
ROW COLUMN+CELL
row column=test_fam:, timestamp=1419947982152, value=NewValue
1 row(s) in 0.4890 seconds
put 'test', 'row2', 'test_fam', 'SecondValue'
: Inserts a row into the ‘test’ table.
hbase(main):003:0> put 'test', 'row2', 'test_fam', 'SecondValue'
0 row(s) in 0.1110 seconds
scan 'test'
: Scans the contents of the ‘test’ table again to verify the insert.
hbase(main):004:0> scan 'test'
ROW COLUMN+CELL
row column=test_fam:, timestamp=1419947982152, value=NewValue
row2 column=test_fam:, timestamp=1419950094363, value=SecondValue
2 row(s) in 0.0260 seconds
exit
: Exits the HBase shell.
Troubleshooting
Connection refused
errors: Double-check that HDFS, ZooKeeper, and HBase are all running and that the IP addresses and ports in your configuration files are correct. Firewall issues can also cause connection refused errors.Can't get master address from ZooKeeper
errors: This usually indicates a problem with ZooKeeper. Make sure ZooKeeper is running and that thehbase.zookeeper.quorum
property inhbase-site.xml
is correctly configured.- Permissions issues: Ensure that the user running the HBase and HDFS processes has the necessary permissions to read and write to the data directories.
Conclusion
This guide provides a detailed, step-by-step approach to setting up HBase on top of HDFS in pseudo-distributed mode. By following these instructions, you can create a local HBase environment for development, testing, and exploration. Remember to consult the official Apache HBase documentation for more in-depth information and advanced configuration options.