Pentaho Data Integration (PDI
) is a suite of open source Business Intelligence (BI) products which provide data integration, OLAP services, reporting, dashboarding, data mining and ETL capabilities. Here we are setting up pentaho server has the below steps. Will be referring to Pentaho Data Integration
as PDI from now on.
Installing Pentaho Server 5.4.1 Community Edition.
Installation is very straight forward, we get a pdi.zip
which needs to be extracted in the location required. Here we will be extracting the file into /opt
directory on our RHEL ETL PDI Server.
Extract and move the directory to /opt
[root@pentaho-server ~]# unzip pdi.zip
[root@pentaho-server ~]# mv data-integration /opt
Create a soft Link /opt
directory, we have used kettle
/ pentaho
as the softlink.
[root@pentaho-server ~]# cd opt
[root@pentaho-server opt]# ln -s data-integration pentaho
[root@pentaho-server opt]# ln -s data-integration kettle
[root@pentaho-server opt]# ls -l
total 12
drwxr-xr-x 12 root root 4096 Sep 14 11:43 data-integration
lrwxrwxrwx 1 root root 16 Sep 7 14:37 kettle -> data-integration
lrwxrwxrwx 1 root root 16 Sep 15 10:17 pentaho -> data-integration
drwxr-xr-x. 2 root root 4096 May 17 2013 rh
[root@pentaho-server opt]#
Thats it Our, Installation is complete. :)
Create ETL (PDI) Server Gateway for hdfs
, hive
, yarn
, hbase
from Cloudera Manager.
We are doing this so for two reasons.
- PDI Hadoop cloudera configurations can be managed from the cloudera-manager.
- Also we make PDI server communicate to Hadoop cluster directly.
Before we add the gateway, this server should be part of the hosts on the cloudera manager, so that all the packages are installed on this server.
Steps to add a host on Cloudera Manager.
NOTE : Before we start make sure the /etc/hosts
directory is similar to what we have on the hadoop
cluster, and ETL PDI server, so that we are able to communicate with the hadoop
cluster from ETL PDI Server.
- Click on
Hosts
. - Add a new host to Cluster.
- Add a Host to cluster wizard.
- Enter the IP address or FQDN for PDI server.
- Install JDK and JCL Policy both as we have
kerberos
enabled cluster. - Complete the wizard, and you will see the host added to the cloudera manager. Just install all the components without setting any roles to the host.
- You will see that the host will have no roles assigned to it.
Now we are ready to make PDI server a gateway.
Steps to create a Gateway in Cloudera Manager. We will take hdfs
as an example. But the steps are identical for all other services.
NOTE : we can add all the roles to the host, and deploy-client
configuration all at once.
- Login as an Admin user.
- Got to the service we want to create a Gateway for, we can create gateway for
hdfs
,yarn
,hbase
,hive
,impala
. - Click
hdfs
>Instances
tab >Add Role to Instance
button > Selectgateway
and add the server in the text box. Continue
andFinish
, this will add the configuration information in the cloudera manager, but next we need todeploy-client
configuration- And we are done.
Testing Access to Hadoop from ETL PDI Server.
Now we have make the server a gateway, now lets test the configuration works.
Login to the hadoop user.
[root@pentaho-server ~]# su hadoop-user
[hadoop-user@pentaho-server ~]$ cd ~
[hadoop-user@pentaho-server ~]$ pwd
/home/hadoop-user
Check if we are able to access the HDFS directories. All good.
[hadoop-user@pentaho-server ~]$ hadoop fs -ls /
Found 6 items
drwxr-xr-x - hdfs supergroup 0 2015-05-29 15:32 /benchmarks
drwxr-xr-x - hbase hbase 0 2015-09-14 15:10 /hbase
drwxrwxr-x - solr solr 0 2015-05-29 11:49 /solr
drwxrwxrwx - hdfs supergroup 0 2015-09-14 17:12 /tmp
drwxr-xr-x - hdfs supergroup 0 2015-05-29 16:22 /use
drwxrwxr-x - hdfs supergroup 0 2015-09-15 16:30 /user
Lets run a test Job, Should run successfully.
[hadoop-user@pentaho-server pentaho_test]$ hadoop jar \
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 10 1000
...
verbose
...
MR JOB Completed successfully
Once we know that we are able to running the hadoop command and run a MR Job, we are done.
Updating PDI configuration to communicate to Hadoop Setup.
Before we go into this we need to get the configuration from clouder-manager
.
- Goto
hdfs
server. - On the right top corner Click
Action
>Download Client Configuration
Do this for hdfs
, hive
, yarn
. Then move these files to PDI server, so that we can configure it.
Now lets continue to work on updating the config on PDI server.
[root@pentaho-server opt]# ls -l
total 12
drwxr-xr-x 4 cloudera-scm cloudera-scm 4096 Sep 14 13:12 cloudera
drwxr-xr-x 12 root root 4096 Sep 14 11:43 data-integration
lrwxrwxrwx 1 root root 16 Sep 7 14:37 kettle -> data-integration
lrwxrwxrwx 1 root root 16 Sep 15 10:17 pentaho -> data-integration
drwxr-xr-x. 2 root root 4096 May 17 2013 rh
Change Directory to cd /opt/pentaho/plugins/pentaho-big-data-plugin/
[root@pentaho-server opt]# cd /opt/pentaho/plugins/pentaho-big-data-plugin/
[root@pentaho-server pentaho-big-data-plugin]# ls -l
total 96640
drwxr-xr-x 7 root root 4096 Jun 14 14:35 hadoop-configurations
drwxr-xr-x 2 root root 4096 Jun 14 14:35 lib
-rw-r--r-- 1 root root 901809 Jun 14 12:55 pentaho-big-data-plugin-5.4.0.1-130.jar
-rw-r--r-- 1 root root 98034663 Jun 14 12:55 pentaho-mapreduce-libraries.zip
-rw-r--r-- 1 root root 2383 Sep 15 10:01 plugin.properties
drwxr-xr-x 2 root root 4096 Jun 14 12:55 plugins
Update the plugin.properties
with active.hadoop.configuration=cdh53
leave the other parameters as it is.
We are using cdh53
as we are using cdh5.4
in our setup. cdh53
works for cdh5.3
and cdh5.4
.
[root@pentaho-server pentaho-big-data-plugin]# cat plugin.properties
# The Hadoop Configuration to use when communicating with a Hadoop cluster.
# This is used for all Hadoop client tools
# including HDFS, Hive, HBase, and Sqoop.
# For more configuration options specific to the Hadoop configuration choosen
# here see the config.properties file in that configuration's directory.
active.hadoop.configuration=cdh53
# Path to the directory that contains the available Hadoop configurations
hadoop.configurations.path=hadoop-configurations
# Version of Kettle to use from the Kettle HDFS installation directory.
# This can be set globally here or overridden per job
# as a User Defined property. If not set we will use the version of
# Kettle that is used to submit the Pentaho MapReduce job.
pmr.kettle.installation.id=
# Installation path in HDFS for the Pentaho MapReduce Hadoop Distribution
# The directory structure should follow this structure where {version}
# can be configured through the Pentaho MapReduce
# User Defined properties as kettle.runtime.version
#
pmr.kettle.dfs.install.dir=/opt/pentaho/mapreduce
# Enables the use of Hadoop's Distributed Cache to
# store the Kettle environment required to execute Pentaho MapReduce
# If this is disabled you must configure all TaskTracker
# nodes with the Pentaho for Hadoop Distribution
# @deprecated This is deprecated and is provided as a migration path for existing installations.
pmr.use.distributed.cache=true
# Pentaho MapReduce runtime archive to be preloaded into
# kettle.hdfs.install.dir/pmr.kettle.installation.id
pmr.libraries.archive.file=pentaho-mapreduce-libraries.zip
# Additional plugins to be copied when Pentaho MapReduce's
# Kettle Environment does not exist on DFS. This should be a comma-separated
# list of plugin folder names to copy.
# e.g. pmr.kettle.additional.plugins=my-test-plugin,steps/DummyPlugin
pmr.kettle.additional.plugins=
Now lets go into hadoop-configurations > cdh53
to update configuration from cloudera
.
[root@pentaho-server pentaho-big-data-plugin]# ls
hadoop-configurations lib pentaho-big-data-plugin-5.4.0.1-130.jar
pentaho-mapreduce-libraries.zip plugin.properties plugins
[root@pentaho-server pentaho-big-data-plugin]# cd hadoop-configurations/
[root@pentaho-server hadoop-configurations]# ls -l
total 20
drwxr-xr-x 3 root root 4096 Sep 16 11:46 cdh53
drwxr-xr-x 3 root root 4096 Jun 14 14:35 emr34
drwxr-xr-x 3 root root 4096 Sep 15 10:01 hadoop-20
drwxr-xr-x 3 root root 4096 Jun 14 14:35 hdp22
drwxr-xr-x 3 root root 4096 Jun 14 14:35 mapr401
[root@pentaho-server hadoop-configurations]# cd cdh53/
In cdh53
update all the configuration with cloudera-manager
configurations.
[root@pentaho-server cdh53]# ls -l
total 212
-rw-r--r-- 1 root root 835 Jun 14 12:49 config.properties
-rwxr-xr-x 1 root root 3798 Sep 16 11:16 core-site.xml
-rwxr-xr-x 1 root root 2546 Sep 16 11:16 hadoop-env.sh
-rwxr-xr-x 1 root root 3568 Sep 16 11:16 hdfs-site.xml
-rwxr-xr-x 1 root root 1126 Sep 16 11:46 hive-env.sh
-rwxr-xr-x 1 root root 2785 Sep 16 11:46 hive-site.xml
drwxr-xr-x 4 root root 4096 Jun 14 14:35 lib
-rwxr-xr-x 1 root root 314 Sep 16 11:16 log4j.properties
-rwxr-xr-x 1 root root 4678 Sep 16 11:16 mapred-site.xml
-rw-r--r-- 1 root root 128746 Jun 14 12:49 pentaho-hadoop-shims-cdh53-54.2015.06.01.jar
-rw-r--r-- 1 root root 6871 Jun 14 12:49 pentaho-hadoop-shims-cdh53-hbase-comparators-54.jar
-rwxr-xr-x 1 root root 315 Sep 16 11:16 ssl-client.xml
-rwxr-xr-x 1 root root 1141 Sep 16 11:16 topology.map
-rwxr-xr-x 1 root root 1510 Sep 16 11:16 topology.py
-rwxr-xr-x 1 root root 6208 Sep 16 11:16 yarn-site.xml
[root@pentaho-server cdh53]#
Add the following files as below. You can get these files from the clouder-manager
.
Example : On Cloudera Manager, Clusters
-> Services
-> hdfs
-> Action
(On the right top corner) -> download-client
configuration.
Do the same for yarn
and other services.
From hdfs
configuration.
core-site.xml
hadoop-env.sh
hdfs-site.xml
log4j.properties
ssl-client.xml
topology.map
topology.py
From hive
configuration.
hive-env.sh
hive-site.xml
From yarn
configuration.
mapred-site.xml
yarn-site.xml
We are done with configuration, now lets test it.
Testing PDI Hadoop.
We will do a simple test by copying file from local drive to Hadoop cluster.
load_hdfs
job information. Its a simple Job to test connectivity between PDI and Hadoop cluster.
Important part of the file is where we give the location for src
and dest
.
<field>
<source_filefolder>
file:///home/hadoop-user/pentaho_test/update_raw.txt
</source_filefolder>
<destination_filefolder>
hdfs://nameservice-ha/user/hadoop-user/weblogs/raw
</destination_filefolder>
<wildcard>^.*\.txt</wildcard>
</field>
Here is the complete file load_hdfs.kjb
.
[hadoop-user@pentaho-server pentaho_test]$ cat load_hdfs.kjb
<?xml version="1.0" encoding="UTF-8"?>
<job>
<name>load_hdfs</name>
<description />
<extended_description />
<job_version />
<job_status>0</job_status>
<directory>/</directory>
<created_user>-</created_user>
<created_date>2011/12/19 13:48:36.419</created_date>
<modified_user>-</modified_user>
<modified_date>2011/12/19 13:48:36.419</modified_date>
<parameters />
<slaveservers />
<job-log-table>
<connection />
<schema />
<table />
<size_limit_lines />
<interval />
<timeout_days />
<field>
<id>ID_JOB</id>
<enabled>Y</enabled>
<name>ID_JOB</name>
</field>
<field>
<id>CHANNEL_ID</id>
<enabled>Y</enabled>
<name>CHANNEL_ID</name>
</field>
<field>
<id>JOBNAME</id>
<enabled>Y</enabled>
<name>JOBNAME</name>
</field>
<field>
<id>STATUS</id>
<enabled>Y</enabled>
<name>STATUS</name>
</field>
<field>
<id>LINES_READ</id>
<enabled>Y</enabled>
<name>LINES_READ</name>
</field>
<field>
<id>LINES_WRITTEN</id>
<enabled>Y</enabled>
<name>LINES_WRITTEN</name>
</field>
<field>
<id>LINES_UPDATED</id>
<enabled>Y</enabled>
<name>LINES_UPDATED</name>
</field>
<field>
<id>LINES_INPUT</id>
<enabled>Y</enabled>
<name>LINES_INPUT</name>
</field>
<field>
<id>LINES_OUTPUT</id>
<enabled>Y</enabled>
<name>LINES_OUTPUT</name>
</field>
<field>
<id>LINES_REJECTED</id>
<enabled>Y</enabled>
<name>LINES_REJECTED</name>
</field>
<field>
<id>ERRORS</id>
<enabled>Y</enabled>
<name>ERRORS</name>
</field>
<field>
<id>STARTDATE</id>
<enabled>Y</enabled>
<name>STARTDATE</name>
</field>
<field>
<id>ENDDATE</id>
<enabled>Y</enabled>
<name>ENDDATE</name>
</field>
<field>
<id>LOGDATE</id>
<enabled>Y</enabled>
<name>LOGDATE</name>
</field>
<field>
<id>DEPDATE</id>
<enabled>Y</enabled>
<name>DEPDATE</name>
</field>
<field>
<id>REPLAYDATE</id>
<enabled>Y</enabled>
<name>REPLAYDATE</name>
</field>
<field>
<id>LOG_FIELD</id>
<enabled>Y</enabled>
<name>LOG_FIELD</name>
</field>
</job-log-table>
<jobentry-log-table>
<connection />
<schema />
<table />
<timeout_days />
<field>
<id>ID_BATCH</id>
<enabled>Y</enabled>
<name>ID_BATCH</name>
</field>
<field>
<id>CHANNEL_ID</id>
<enabled>Y</enabled>
<name>CHANNEL_ID</name>
</field>
<field>
<id>LOG_DATE</id>
<enabled>Y</enabled>
<name>LOG_DATE</name>
</field>
<field>
<id>JOBNAME</id>
<enabled>Y</enabled>
<name>TRANSNAME</name>
</field>
<field>
<id>JOBENTRYNAME</id>
<enabled>Y</enabled>
<name>STEPNAME</name>
</field>
<field>
<id>LINES_READ</id>
<enabled>Y</enabled>
<name>LINES_READ</name>
</field>
<field>
<id>LINES_WRITTEN</id>
<enabled>Y</enabled>
<name>LINES_WRITTEN</name>
</field>
<field>
<id>LINES_UPDATED</id>
<enabled>Y</enabled>
<name>LINES_UPDATED</name>
</field>
<field>
<id>LINES_INPUT</id>
<enabled>Y</enabled>
<name>LINES_INPUT</name>
</field>
<field>
<id>LINES_OUTPUT</id>
<enabled>Y</enabled>
<name>LINES_OUTPUT</name>
</field>
<field>
<id>LINES_REJECTED</id>
<enabled>Y</enabled>
<name>LINES_REJECTED</name>
</field>
<field>
<id>ERRORS</id>
<enabled>Y</enabled>
<name>ERRORS</name>
</field>
<field>
<id>RESULT</id>
<enabled>Y</enabled>
<name>RESULT</name>
</field>
<field>
<id>NR_RESULT_ROWS</id>
<enabled>Y</enabled>
<name>NR_RESULT_ROWS</name>
</field>
<field>
<id>NR_RESULT_FILES</id>
<enabled>Y</enabled>
<name>NR_RESULT_FILES</name>
</field>
<field>
<id>LOG_FIELD</id>
<enabled>N</enabled>
<name>LOG_FIELD</name>
</field>
<field>
<id>COPY_NR</id>
<enabled>N</enabled>
<name>COPY_NR</name>
</field>
</jobentry-log-table>
<channel-log-table>
<connection />
<schema />
<table />
<timeout_days />
<field>
<id>ID_BATCH</id>
<enabled>Y</enabled>
<name>ID_BATCH</name>
</field>
<field>
<id>CHANNEL_ID</id>
<enabled>Y</enabled>
<name>CHANNEL_ID</name>
</field>
<field>
<id>LOG_DATE</id>
<enabled>Y</enabled>
<name>LOG_DATE</name>
</field>
<field>
<id>LOGGING_OBJECT_TYPE</id>
<enabled>Y</enabled>
<name>LOGGING_OBJECT_TYPE</name>
</field>
<field>
<id>OBJECT_NAME</id>
<enabled>Y</enabled>
<name>OBJECT_NAME</name>
</field>
<field>
<id>OBJECT_COPY</id>
<enabled>Y</enabled>
<name>OBJECT_COPY</name>
</field>
<field>
<id>REPOSITORY_DIRECTORY</id>
<enabled>Y</enabled>
<name>REPOSITORY_DIRECTORY</name>
</field>
<field>
<id>FILENAME</id>
<enabled>Y</enabled>
<name>FILENAME</name>
</field>
<field>
<id>OBJECT_ID</id>
<enabled>Y</enabled>
<name>OBJECT_ID</name>
</field>
<field>
<id>OBJECT_REVISION</id>
<enabled>Y</enabled>
<name>OBJECT_REVISION</name>
</field>
<field>
<id>PARENT_CHANNEL_ID</id>
<enabled>Y</enabled>
<name>PARENT_CHANNEL_ID</name>
</field>
<field>
<id>ROOT_CHANNEL_ID</id>
<enabled>Y</enabled>
<name>ROOT_CHANNEL_ID</name>
</field>
</channel-log-table>
<pass_batchid>N</pass_batchid>
<shared_objects_file />
<entries>
<entry>
<name>START</name>
<description />
<type>SPECIAL</type>
<start>Y</start>
<dummy>N</dummy>
<repeat>N</repeat>
<schedulerType>0</schedulerType>
<intervalSeconds>0</intervalSeconds>
<intervalMinutes>60</intervalMinutes>
<hour>12</hour>
<minutes>0</minutes>
<weekDay>1</weekDay>
<DayOfMonth>1</DayOfMonth>
<parallel>N</parallel>
<draw>Y</draw>
<nr>0</nr>
<xloc>68</xloc>
<yloc>88</yloc>
</entry>
<entry>
<name>Copy Files</name>
<description />
<type>COPY_FILES</type>
<copy_empty_folders>Y</copy_empty_folders>
<arg_from_previous>N</arg_from_previous>
<overwrite_files>N</overwrite_files>
<include_subfolders>N</include_subfolders>
<remove_source_files>N</remove_source_files>
<add_result_filesname>N</add_result_filesname>
<destination_is_a_file>N</destination_is_a_file>
<create_destination_folder>Y</create_destination_folder>
<fields>
<field>
<source_filefolder>
file:///home/hadoop-user/pentaho_test/update_raw.txt
</source_filefolder>
<destination_filefolder>
hdfs://nameservice-ha/user/hadoop-user/weblogs/raw
</destination_filefolder>
<wildcard>^.*\.txt</wildcard>
</field>
</fields>
<parallel>N</parallel>
<draw>Y</draw>
<nr>0</nr>
<xloc>191</xloc>
<yloc>88</yloc>
</entry>
</entries>
<hops>
<hop>
<from>START</from>
<to>Copy Files</to>
<from_nr>0</from_nr>
<to_nr>0</to_nr>
<enabled>Y</enabled>
<evaluation>Y</evaluation>
<unconditional>Y</unconditional>
</hop>
</hops>
<notepads />
</job>
[hadoop-user@pentaho-server pentaho_test]$
Lets create a location for the file to be put.
[hadoop-user@pentaho-server pentaho_test]$ hadoop fs -mkdir -p /user/hadoop-user/weblogs/raw
[hadoop-user@pentaho-server pentaho_test]$ ls -l
total 83124
-rw-r--r-- 1 hadoop-user hadoop-user 6160 Sep 16 11:26 load_hdfs.kjb
-rwxr-xr-x 1 hadoop-user hadoop-user 77908174 Jan 10 2012 weblogs_rebuild.txt
We have kettle job will will load the weblogs_rebuild.txt
file to Hadoop cluster, location created earlier /user/hadoop-user/weblogs/raw
.
[hadoop-user@pentaho-server pentaho_test]$ sh /opt/pentaho/kitchen.sh -file=/home/hadoop-user/pentaho_test/load_hdfs.kjb
2015/09/16 11:26:37 - Kitchen - Start of run.
2015/09/16 11:26:38 - load_hdfs - Start of job execution
2015/09/16 11:26:38 - load_hdfs - Starting entry [Copy Files]
2015/09/16 11:26:38 - Copy Files - Starting ...
2015/09/16 11:26:38 - Copy Files - Processing row source
File/folder source : [file:///home/hadoop-user/pentaho_test/weblogs_rebuild.txt] ...
destination file/folder : [hdfs://nameservice-ha/user/hadoop-user/weblogs/raw]...
wildcard : [^.*\.txt]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/opt/data-integration/launcher/../lib/slf4j-log4j12-1.7.5.jar!\
/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/data-integration/plugins/pentaho-big-data-plugin/\
lib/slf4j-log4j12-1.7.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html# multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2015/09/16 11:26:40 - load_hdfs - Finished job entry [Copy Files] (result=[true])
2015/09/16 11:26:40 - load_hdfs - Job execution finished
2015/09/16 11:26:40 - Kitchen - Finished!
2015/09/16 11:26:40 - Kitchen - Start=2015/09/16 11:26:37.578, Stop=2015/09/16 11:26:40.932
2015/09/16 11:26:40 - Kitchen - Processing ended after 3 seconds.
Job completed successfully, now lets check the hadoop location.
[hadoop-user@pentaho-server pentaho_test]$ hadoop fs -ls /user/hadoop-user/weblogs/raw
Found 1 items
-rw-r--r-- 3 hadoop-user hadoop-user 77908174 2015-09-16 11:26 \
/user/hadoop-user/weblogs/raw/weblogs_rebuild.txt
[hadoop-user@pentaho-server pentaho_test]$ ls -l
Looks like we have the file copied to HDFS. So we are good.