Introduction
WebSphere DataStage is one the foremost leaders in the ETL (Extract, Transform, and Load) market space. One of the great advantages of this tool is its scalability, as it is capable of parallel processing on an SMP, MPP or cluster environment. Although DataStage Enterprise Edition (DS/EE) provides many types of plug-in stages to connect to DB2, including DB2 API, DB2 load, and dynamic RDBMS, only DB2 Enterprise Stage is designed to support parallel processing for maximum scalability and performance.
The DB2 Data Partitioning Feature (DPF) offers the necessary scalability to distribute a large database over multiple partitions (logical or physical). ETL processing of a large bulk of data across whole tables is very time-expensive using traditional plug-in stages. DB2 Enterprise Stage however provides a parallel execution engine, using direct communication with each database partition to achieve the best possible performance.
DB2 Enterprise Stage with DPF communication architecture
As you see in Figure 1, the DS/EE primary server can be separate from the DB2 coordinate node. Although a 32-bit DB2 client still must be installed, it’s different from the typical remote DB2 access which requires only DB2 client for connectivity. It can be used to pre-query the DB2 instance and determine partitioning of source or target table. On the DB2 server, every DB2 DPF partition must have the DS/EE engine installed. In addition, the DS/EE engine and libraries must be installed in the same location on all DS/EE servers and DB2 servers.
The following principles are important in understanding how this framework works:
|
Environment used in our example
In our example, we use 2 machines with RedHat Enterprise Linux 3.0 operating system for testing, one with 2 CPUs and 1G memory for the DB2 server, another with 1 CPU with 1G memory for DS/EE server. In the DB2 server, we have 2 database partitions which can be configured via DB2nodes.cfg, while in DS/EE server; the engine configuration file tells us which nodes are used to execute DataStage jobs concurrently.
The following are steps we followed to successfully configure remote DB2 instance using DS/EE DB2 Enterprise Stage. We will begin this exercise from scratch, including DB2 server configuration, DS/EE installation and configuration.
|
Installation and configuration steps for the DB2 server
If DB2 DPF environments are installed and configured, you can skip step 1 and step 3
Step 1. Install DB2 Enterprise Server and create DB2 instance at Stage164 node
Check your DB2 version before installing DB2 ESE on Stage164 node. For our example we used V8.1 fix pack 7. For DPF feature, you must have another separate license. Pay attention to Linux kernel parameters which can potentially affect DB2 installation. Please follow the DB2 installation guide.
[root@stage164 home]# groupadd –g db2grp1 |
[root@stage164 home]# cd /opt/IBM/db2/V8.1/instance/ |
[root@stage164 instance]# su – db2inst1 |
[db2inst1@stage164 db2inst1]$ db2 get dbm cfg | grep -i svcename |
Step 2. Configure remote shell (rsh) service and remote authority file.
For the DPF environment, DB2 needs the remote shell utility to communicate and execute commands between each partition. Rsh utility can be used for inter-partition communication; OpenSSH utility is another option for inter-partition communication that protects secure communication. For simplicity, we will not cover it in this article.
[root@stage164 /]# rpm -qa | grep -i rsh |
[root@stage164 /]#service xinetd start |
Stage164 db2inst1 |
[db2inst1@stage164 db2inst1]$ rsh stage164 date |
Step 3. Create DPF partitions and create sample database
0 stage164 0 |
[db2inst1@stage164 db2inst1]$ db2stop force |
[db2inst1@stage164 db2inst1]$ db2sampl |
Step 4. Create DS/EE users and configure the to access the DB2 database
If DS/EE users and groups have been created on the DS/EE node, then create the same users and groups on the DB2 server node. In any case, make sure you have the same DS/EE users and groups on these two machines.
[root@stage164 home]# groupadd -g 501 dsadmin |
Stage164 db2inst1 |
. /home/db2inst1/sqllib/db2profile |
# su - dsadm |
|
Installation and configuration steps for the DS/EE node
Now let's walk through the process in detail.
Step 1. Install DataStage Enterprise Edition(DS/EE) and DB2 client
First, DS/EE users and groups need to be created in advance. In this example, the user is dsadmin, group dsadmin. If DS/EE not installed, follow the WebSphere DataStage install guide. We assume the software is installed on DSHOME variable which is /home/dsadmin/Ascential/DataStage/DSEngine. Then, install the DB2 client and create one client instance at DS/EE node.
Step 2. Add DB2 library and instance home at DS/EE configuration file
The dsenv configuration file, located under DSHOME directory, is one of the most important configuration files in DS/EE. It contains the environment variables and library path. At this step, we will add DB2 library to LD_LIBRARY_PATH so that DS/EE engine can connect to DB2.
Note: PXEngine library should precede DB2 library for LD_LIBRARY_PATH environment path.
Configure the dsenv file as follows:
|
You can add this dsenv file to dsadmin .bashrc file (/home/dsadmin/.bashrc) to avoid executing it manually every time. What you need do is to exit the dsadmin user and re-logon to make it execute and take effect.
. /home/dsadmin/Ascential/DataStage/DSEngine/dsenv |
Step 3. Catalog remote sample db to DS/EE using dsadmin
[dsadmin@transfer dsadmin]$ db2 CATALOG TCPIP NODE stage164 REMOTE stage164 SERVER 50000 |
[dsadmin@transfer dsadmin]$ rsh stage164 date |
Step 4. Copy DB2nodes.cfg from DB2 server to DS/EE and configure environment variable.
Copy the DB2nodes.cfg file from the DB2 server to one directory of DS/EE. This file tells DS/EE engine how many DB2 partitions there are in the DB2 server. Then create one environment variable APT_DB2INSTANCE_HOME by DataStage Administrator to point to the directory. This variable can be specified at the project level or the job level.
Step 5. NFS configuration, export /home/dsadmin/Ascential/
First, add 2 machine names into /etc/hosts file at both nodes to identify one another’s network name. Then, share the DS/EE whole directory to the DB2 server so that each partition can communicate with DS/EE.
/home/dsadmin/Ascential stage164(rw,sync) |
[root@transfer /]# service nfs start |
[root@stage164 home]# mount -t nfs -o rw transfer:/home/dsadmin/Ascential /home/dsadmin/ |
You can check mounted files as follows:
[root@stage164 home]# df -k |
To avoid mounting it every time when machine restart, you can also add this entry into file /etc/fstab to mount this directory automatically:
transfer:/home/dsadmin/Ascential /home/dsadmin/Ascential nfs defaults 0 0 |
Step 6. Verify DB2 operator library and execute DB2setup.sh and DB2grants.sh
db2 connect to samp_02 user dsadmin using passw0rd |
db2 connect to samp_2 user dsadmin using passw0rd |
Step 7. Create or modify DS/EE configuration file
DS/EE provides parallel engine configuration files. DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is defined in the configuration file. The DataStage configuration file needs to contain the node on which DataStage and the DB2 client are installed and the nodes of the remote computer where the DB2 server is installed.
The following is one example. For more detail info of engine configuration file, please refer to the "Parallel job development guide."
{ |
Step 8. Restart DS/EE server and test connectivity
At this point you have completed all configurations on both nodes. Restart DS/EE server by issuing the commands below:
[dsadmin@transfer bin]$ uv -admin –stop |
Note: after stopping the DS/EE engine, you need to exit dsadmin and re-logon, and the dsenv configuration file will be executed. Also, be sure the time interval between stop and start is longer than 30 seconds in order for the changed configuration to take effect.
Next, we will test remote connectivity using DataStage Designer. Choose Import plug-in table definition. The following window will appear. Click Next. If it imports successfully, that means the remote DB2 connectivity configuration has succeeded.
|
Develop one DB2 enterprise job on DS/EE
In this part, we will develop one parallel job with DB2 Enterprise Stage using DataStage Designer. This job is very simple because it just demonstrates how to extract DB2 department table data to one sequential file.
Double-click the DB2 Enterprise stage icon, and set the following properties to the DB2 Enterprise stage. For detailed information, please reference the "Parallel job developer’s guide."
Add the following two environment variables into this job via DataStage Manager. APT_DB2INSTANCE_HOME defines DB2nodes.cfg location, while APT_CONFIG_FILE specifies the engine configuration file.
|
Performance comparison between Enterprise Stage and API Stage
In this part, we will execute the jobs developed above by DataStage Director and compare the performance between DS/EE Enterprise Stage and API Stage. The following is another job with DB2 API Stage.
o generate a quantity of test data, we created the following stored procedure:
|
Execute the stored procedure:
DB2 –td@ -f emp_resume.sql |
Then, we execute these 2 jobs against 100,000, 1M and 5M rows via DataStage Director and observe the result using the job monitor. The following screenshots are test results with DB2 Enterprise Stage and DB2 API Stage.
In Figure 8, there are 2 nodes are executing ETL with DB2 Enterprise Stage, while with DB2 API Stage there is 1 node. The ETL processing of Enterprise Stage per second is above 2 times to API Stage. Furthermore, with data growth, Enterprise Stage has greater advantage because of parallel performance.
|
Limitation
DB2 Enterprise Stage has a great parallel performance over the other DB2 plug-in stages using a DB2 DPF environments, however, it requires the hardware and operating system of ETL server, and the DB2 nodes must be the same. Consequently, it’s not a replacement for other DB2 plug-in stages, especially in heterogeneous environments.
|
Conclusion
This article has described how to configure remote connectivity for DS/EE DB2 Enterprise Stage using step-by-step instructions. In addition, we provided a performance comparison between Enterprise Stage and DB2 API Stage using two DS/EE DataStage jobs.
Resources