pianzif

datastage transformer控件详解

序言

在之前的工作中，用到的都是一些很简单的transformer的转换功能，比如直接加一些函数做一些判断然后输出，或者构造一些列！没有用到他的loop功能及stage variable功能，这篇主要是对他们的学习

transformer基本功能回顾

功能说明：

一个功能极为强大的Stage。有一个input link，多个output link，可以将字段进行转换，也可以通过条件来指定数据输出到那个output link。在开发过程中可以使用拖拽。

Constraint及Derivation的区别
Constraint通过限定条件使符合条件的数据输出到这个output link。
Derivation通过定义表达式来转换字段值。
在Constraint及Derivation中可以使用Job parameters及Stage Variables。

Ø 注意：Transformer Stage功能强大，但在运行过程中是以牺牲速度为代价的。在只有简单的变换，拷贝等操作时，最好用Modify Stage，Copy Stage，Filter Stage等来替换Transformer Stage。

循环的使用

例子一

转换中的循环让你可以在每一个输入行处理时有多行输出。在本例中，一个记录有一个公司名称和四个地区的四个销售收入数字，一个循环要经过每一列，会给每个地区输出一行。也就是说一行的输入可产生四行的输出 .

是不是有一些看不懂，没关系，不用理会，接着往下看

以下均来自官方文档的整理

循环实际讲解

Defining a loop condition

You specify that the Transformer stage loops when processing each input row by defining a loop condition. The loop continues to iterate while the condition is true.

About this task

You can use the @ITERATION system variable in your expression. @ITERATION holds a count of the number of times that the loop has been executed, starting at 1. @ITERATION is reset to one when a new input row is read.

To define a loop condition:

Procedure

If required, open the Loop Condition grid by clicking the arrow on the title bar.
Double-click the Loop While condition, or type CTRL-D, to open the expression editor.
In the expression editor, specify the expression that controls your loop. The expression must return a result of true or false.

What to do next

It is possible to define a faulty loop condition that results in infinite looping, and yet still compiles successfully. To catch such events, you can specify a loop iteration warning threshold in the Loop Variable tab of the Stage Properties window. A warning is written to the job log when a loop has repeated the specified number of times, and the warning is repeated every time a multiple of that value is reached.

So, for example, if you specify a threshold of 100, warnings are written to the job log when the loop iterates 100 times, 200 times, 300 times, and so on. Setting the threshold to 0 specifies that no warnings are issued. The default threshold is 10000, which is a good starting value. You can set a limit for all jobs in your project by setting the environment variable APT_TRANSFORM_LOOP_WARNING_THRESHOLD to a threshold value.

The threshold applies to both loop iteration, and to the number of records held in the input row cache (the input row cache is used when aggregating values in input columns).

Defining loop variables

You can declare and use loop variables within a Transformer stage. You can use the loop variables in expressions within the stage.

About this task

You can use loop variables when a loop condition is defined for the Transformer stage. When a loop is defined, the Transformer stage can output multiple rows for every row input to the stage. Loop variables are evaluated every time that the loop is iterated, and so can change their value for every output row. Such variables are accessible only from the Transformer stage in which they are declared. You cannot use a loop variable in a stage variable derivation.

Loop variables can be used as follows:

They can be assigned values by expressions.
They can be used in expressions which define an output column derivation.
Expressions evaluating a variable can include other loop variables or stage variables or the variable being evaluated itself.

Any loop variables you declare are shown in a table in the right pane of the links area. The table looks like the output link table and the stage variables table. You can maximize or minimize the table by clicking the arrow in the table title bar.

The table lists the loop variables together with the expressions that are used to derive their values. Link lines join the loop variables with input columns used in the expressions. Links from the right side of the table link the variables to the output columns that use them, or to the stage variables that they use.

To declare a loop variable:

Procedure

Select Loop Variable Properties from the loop variable pop-up menu.
In the grid on the Loop Variables tab, enter the variable name, initial value, SQL type, extended information (if variable contains Unicode data), precision, scale, and an optional description. Variable names must begin with an alphabetic character (a-z, A-Z) and can only contain alphanumeric characters (a-z, A-Z, 0-9).
Click OK. The new loop variable appears in the loop variable table in the links pane.
Note: You can also add a loop variable by selecting Insert New Loop Variable or Append New Loop Variable from the loop variable pop-up menu. A new variable is added to the loop variables table in the links pane. The first variable is given the default name LoopVar and default data type VarChar (255), subsequent loop variables are named LoopVar1, LoopVar2, and so on. You can edit the variables on the Loop Variables tab of the Stage Properties window.

Example

Figure 1. Example Transformer stage with loop variable defined

1：Loop example: converting a single row to multiple rows

You can use the Transformer stage to convert a single row for data with repeating columns to multiple output rows.

Input data with multiple repeating columns

When the input data contains rows with multiple columns containing repeating data, you can use the Transformer stage to produce multiple output rows: one for each of the repeating columns.

For example, if the input row contained the following data.

Col1	Col2	Name1	Name2	Name3
abc	def	Jim	Bob	Tom

You can use the Transformer stage to flatten the input data and create multiple output rows for each input row. The data now comprises the following columns.

Col1	Col2	Name
abc	def	Jim
abc	def	Bob
abc	def	Tom

To implement this scenario in the Transformer stage, make the following settings:

Loop condition

Enter the following expression as the loop condition.

@ITERATION <= 3

Because each input row has three columns containing names, you need to process each input row three times and create three separate output rows.

Loop variable

Define a loop variable to supply the value for the new column Name in your output rows. The value of LoopVar1is set by the following expression:

IF (@ITERATION = 1) THEN inlink.Name1
ELSE IF (@ITERATION = 2) THEN inlink.Name2
ELSE inlink.Name3

Output link metadata and derivations

Define the output link columns and their derivations:

Col1 - inlink.col1
Col2 - inlink.col2
Name - LoopVar1

我想：循环一次输出一列

2：Loop example: multiple repeating values in a single field

You can use the Transformer stage to convert a single row for data with repeating values in a single column to multiple output rows.

Input data with multiple repeating values in a single field

When you have data where a single column contains multiple repeating values that are separated by a delimiter, you can flatten the data to produce multiple output columns: one for each of the delimited values. You can also specify that certain values are filtered out, and not have a new row created.

For example, the input row contains the following data.

Col1	Col2	Names
abc	def	Jim/Bob/Tom

You want to flatten the name field so a new row is created for every new name indicated by the backslash (/) character. You also want to filter out the name Jim and drop the column named Col2, so that the resulting output data for the example column produces two rows with two columns.

Col1	Name
abc	Bob
abc	Tom

To implement this scenario in the Transformer stage, make the following settings:

Stage variable

Define a stage variable to hold a count of the fields separated by the delimiter character. The value of StageVar1 is set by the following expression:

DCOUNT(inlink.Names, "/")

Loop condition

Enter the following expression as the loop condition:

@ITERATION <= StageVar1

The loop continues to iterate for the count in the Names column.

Loop variable

Define a loop variable to supply the value for the new column Name in your output rows. The value of LoopVar1 is set by the following expression:

FIELD(inlink.Names, "/", @ITERATION, 1)

This expression extracts the substrings delimited by the slash character (/) from the input column.

Output link constraint

Define an output link constraint to filter out the name Jim. Use the following expression to define the constraint:

LoopVar1 <> "Jim"

Output link metadata and derivations

Define the output link columns and their derivations. Drop the Col2 column by not including it in the metadata.

Col1 - inlink.col1
Name - LoopVar1

3：Loop example: generating new rows

You can use the Transformer stage to generate new rows, based on the value of a column in the input row.

Value in an input row column used to generate new output rows

You can use the Transformer stage to generate new rows, based on values held in an input column.

For example, you have an input column that contains a count, and want to generate output rows based on the value of the count. The following example column has a count value of 5.

Col1	Col2	MaxCount
abc	def	5

You can generate five output rows for this one input row based on the value in the Count column.

Col1	Col2	EntryNumber
abc	def	1
abc	def	2
abc	def	3
abc	def	4
abc	def	5

To implement this scenario in the Transformer stage, make the following settings:

Loop condition

Enter the following expression as the loop condition:

@ITERATION <= inlink.MaxCount

For each input row, the loop iterates the number of times defined by the value of the MaxCount column.

Output link metadata and derivations

Define the output link columns and their derivations:

Col1 - inlink.Col1
Col2 - inlink.Col2
EntryNumber - @ITERATION

4：Loop example: aggregating data

You can use the Transformer stage to add aggregated information to output rows.

Aggregation operations make use of a cache that stores input rows. You can monitor the number of entries in the cache by setting a threshold level in the Loop Variable tab of the Stage Properties window. If the threshold is reached when the job runs, a warning is issued into the log, and the job continues to run.

Input row group aggregation included with input row data

You can save input rows to a cache area, so that you can process this data in a loop.

For example, you have input data that has a column holding a price value. You want to add a column to the output rows. The new column indicates what percentage the price value is of the total value for prices in all rows in that group. The value for the new Percentage column is calculated by the following expression.

(price * 100)/sum of all prices in group

In the example, the data is sorted and is grouped on the value in Col1.

Col1	Col2	Price
1000	abc	100.00
1000	def	20.00
1000	ghi	60.00
1000	jkl	20.00
2000	zyx	120.00
2000	wvu	110.00
2000	tsr	170.00

The percentage for each row in the group where Col1 = 1000 is calculated by the following expression.

(price * 100)/200

The percentage for each row in the group where Col1 = 2000 is calculated by the following expression.

(price * 100)/400

The output is shown in the following table.

Col1	Col2	Price	Percentage
1000	abc	100.00	50.00
1000	def	20.00	10.00
1000	ghi	60.00	30.00
1000	jkl	20.00	10.00
2000	zyx	120.00	30.00
2000	wvu	110.00	27.50
2000	tsr	170.00	42.50

This scenario uses key break facilities that are available on the Transformer stage. You can use these facilities to detect when the value of an input column changes, and so group rows as you process them.

This scenario is implemented by storing the grouped rows in an input row cache and processing them when the value in a key column changes. In the example, the grouped rows are processed when the value in the column named Col1 changes from 1000 to 2000. Two functions , SaveInputRecord() and GetSavedInputRecord(), are used to add input rows to the cache and retrieve them. SaveInputRecord() is called when a stage variable is evaluated, and returns the count of rows in the cache (starting at 1 when the first row is added). GetSavedInputRecord() is called when a loop variable is evaluated.

To implement this scenario in the Transformer stage, make the following settings:

Stage variable

Define the following stage variables:

NumSavedRows: SaveInputRecord()
IsBreak: LastRowInGroup(inlink.Col1)
TotalPrice: IF IsBreak THEN SummingPrice + inlink.Price ELSE 0
SummingPrice: IF IsBreak THEN 0 ELSE SummingPrice + inlink.Price
NumRows: IF IsBreak THEN NumSavedRows ELSE 0

Loop condition

Enter the following expression as the loop condition:

@ITERATION <= NumRows

The loop continues to iterate for the count specified in the NumRows variable.

Loop variables

Define the following loop variable:

SavedRowIndex: GetSavedInputRecord()

Output link metadata and derivations

Define the output link columns and their derivations:

Col1 - inlink.Col1
Col2 - inlink.Col2
Price - inlink.Price
Percentage - (inlink.Price * 100)/TotalPrice

SaveInputRecord() is called in the first Stage Variable (NumSavedRows). SaveInputRecord() saves the current input row in the cache, and returns the count of records currently in the cache. Each input row in a group is saved until the break value is reached. At the last value in the group, NumRows is set to the number of rows stored in the input cache. The Loop Condition then loops round the number of times specified by NumRows, calling GetSavedInputRecord() each time to make the next saved input row current before re-processing each input row to create each output row. The usage of the inlink columns in the output link refers to their values in the currently retrieved input row, so will change on each output loop.

Caching selected input rows

You can call the SaveInputRecord() within an expression, so that input rows are only saved in the cache when the expression evaluates as true.

For example, you can implement the scenario described, but save only input rows where the price column is not 0. The settings are as follows:

Stage variable

Define the following stage variables:

IgnoreRow: IF (inlink.Price = 0) THEN 1 ELSE 0
NumSavedRows: IF IgnoreRecord THEN SavedRowSum ELSE SaveInputRecord()
IsBreak: LastRowInGroup(inlink.Col1)
SavedRowSum: IF IsBreak THEN 0 ELSE NumSavedRows
TotalPrice: IF IsBreak THEN SummingPrice + inlink.Price ELSE 0
SummingPrice: IF IsBreak THEN 0 ELSE SummingPrice + inlink.Price
NumRows: IF IsBreak THEN NumSavedRows ELSE 0

Loop condition

Enter the following expression as the loop condition:

@ITERATION <= NumRows

Loop variables

Define the following loop variable:

SavedRowIndex: GetSavedInputRecord()

Output link metadata and derivations

Define the output link columns and their derivations:

Col1 - inlink.Col1
Col2 - inlink.Col2
Price - inlink.Price
Percentage - (inlink.Price * 100)/TotalPrice

This example produces output similar to the previous example, but the aggregation does not include Price values of 0, and no output rows with a Price value of 0 are produced.

-----------

Outputting additional generated rows

This example is based on the first example, but, in this case, you want to identify any input row where the Price is greater than or equal to 100. If an input row has a Price greater than or equal to 100, then a 25% discount is applied to the Price and a new additional output row is generated. The Col1 value in the new row has 1 added to it to indicate an extra discount entry. The original input row is still output as normal. Therefore any input row with a Price of greater than or equal to 100 will produce two output rows, one with the discounted price and one without.

The input data is as shown in the following table:

Col1	Col2	Price
1000	abc	100.00
1000	def	20.00
1000	ghi	60.00
1000	jkl	20.00
2000	zyx	120.00
2000	wvu	110.00
2000	tsr	170.00

The required table is shown in the following table:

Col1	Col2	Price	Percentage
1000	abc	100.00	50.00
1001	abc	75.00	50.00
1000	def	20.00	10.00
1000	ghi	60.00	30.00
1000	jkl	20.00	10.00
2000	zyx	120.00	30.00
2001	zyx	90.00	30.00
2000	wvu	110.00	27.50
2001	wvu	82.50	27.50
2000	tsr	170.00	42.50
2001	tsr	127.50	42.50

To implement this scenario in the Transformer stage, make the following settings:

Stage variable

Define the following stage variables:

NumSavedRowInt: SaveInputRecord()
AddRow: IF (inlink.Price >= 100) THEN 1 ELSE 0
NumSavedRows: IF AddRow THEN SaveInputRecord() ELSE NumSavedRowInt
IsBreak: LastRowInGroup(inlink.Col1)
TotalPrice: IF IsBreak THEN SummingPrice + inlink.Price ELSE 0
SummingPrice: IF IsBreak THEN 0 ELSE SummingPrice + inlink.Price
NumRows: IF IsBreak THEN NumSavedRows ELSE 0

Loop condition

Enter the following expression as the loop condition:

@ITERATION <= NumRows

The loop continues to iterate for the count specified in the NumRows variable.

Loop variables

Define the following loop variables:

SavedRowIndex: GetSavedInputRecord()
AddedRow: LastAddedRow
LastAddedRow: IF (inlink.Price < 100) THEN 0 ELSE IF (AddedRow = 0) THEN 1 ELSE 0

Output link metadata and derivations

Define the output link columns and their derivations:

Col1 - IF (inlink.Price < 100) THEN inlink.Col1 ELSE IF (AddedRow = 0) THEN inlink.Col1 ELSE inlink.Col1 + 1
Col2 - inlink.Col2
Price - IF (inlink.Price < 100) THEN inlink.Price ELSE IF (AddedRow = 0) THEN inlink.Price ELSE inlink.Price * 0.75
Percentage - (inlink.Price * 100)/TotalPrice

SaveInputRecord is called either once or twice depending on the value of Price. When SaveInputRecord is called twice, in addition to the normal aggregation, it produces the extra output record with the recalculated Price value. The Loop variable AddedRow is used to evaluate the output column values differently for each of the duplicate input rows.

Runtime errors

The number of calls to SaveInputRecord() and GetSavedInputRecord() must match for each loop. You can call SaveInputRecord() multiple times to add to the cache, but once you call GetSavedInputRecord(), then you must call it enough times to empty the input cache before you can call SaveInputRecord() again. The examples described can generate runtime errors in the following circumstances by not observing this rule:

If your Transformer stage calls GetSavedInputRecord before SaveInputRecord, then a fatal error similar to the following example is reported in the job log:
```
APT_CombinedOperatorController,0: Fatal Error: get_record() called on 
record 1 but only 0 records saved by save_record()
```
If your Transformer stage calls GetSavedInputRecord more times than SaveInputRecord is called, then a fatal error similar to the following example is reported in the job log:
```
APT_CombinedOperatorController,0: Fatal Error: get_record() called on 
record 3 but only 2 records saved by save_record()
```
If your Transformer stage calls SaveInputRecord but does not call GetSavedInputRecord, then a fatal error similar to the following example is reported in the job log:
```
APT_CombinedOperatorController,0: Fatal Error: save_record() called on 
record 3, but only 0 records retrieved by get_record()
```
If your Transformer stage does not call GetSavedInputRecord as many times as SaveInputRecord, then a fatal error similar to the following example is reported in the job log:
```
APT_CombinedOperatorController,0: Fatal Error: save_record() called on 
record 3, but only 2 records retrieved by get_record()
```

呵呵，看到最后一个例子是不是都晕了，我也晕了，幸好找到了这么点资料，暂时看下

转换的记忆功能

DataStage 的转换有记忆和对键（Key）变化的探测功能。多年来，ETL专家们用一些众所周知的变通方法通过手工编码为DataStage实现同样的功能。在一个DataStage的工作中，一个键的变化包括了拥有同一键的多项纪录,我们要将这些纪录作为一个数组来处理.

在一个转换中有两个新的缓存 ― SaveInputRecord()和GetSavedInputRecord()，你可以保存一条记录并在以后取出，用来比较两个或更多的转换器中的记录。

针对循环和键变化探测有新的系统变量 ― @ITERATION, LastRow()显示同样键中的最后一行，LastTwoInGroup(InputColumn)显示一个指定列的值是否在下一纪录有变化.

下面是一个计算合计的例子，这里根据键的变化, 循环处理每个行并计算每个键的合计.

http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ds.parjob.dev.doc%2Ftopics%2Fspecifyingaloopcondition.html

http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r7/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ds.parjob.dev.doc%2Ftopics%2Fc_deeref_Functions_functions.html

http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r5/index.jsp?topic=%2Fcom.ibm.swg.im.iis.ds.parjob.dev.doc%2Ftopics%2Fspecifyingaloopexample4.html

你可能感兴趣的:(datastage)

批量作业调度、数据挖掘，这几款应该是今年 “最值得推荐” 的ETL工具了加菲盐008 taskctl Kettle kettle etl 批量作业调度数据挖掘 taskctl
工具传送门：Taskctl商业付费版(付费)TaskctlWeb商业免费版（永久免费）Kettle（开源免费）Datastage(付费)ETL是数据仓库中的非常重要的一环，是承前启后的必要的一步。ETL负责将分布的、异构数据源中的数据如关系数据、平面数据文件等抽取到临时中间层后进行清洗、转换、集成，最后加载到数据仓库或数据集市中，成为联机分析处理、数据挖掘的基础。下面给大家介绍一下什么是ETL以及
ETL策略朱先生_hfm etl 数据仓库
数据仓库平台，ETL是很重要一环，看ETL的发展史，最初使用表格，人工从系统下载，在excel匹配，然后加载到数据仓库平台，到后来开始出现ETL工具，大概分为两个派别，以ETL架构的datastage，informatica，以ELT为代表的ODI，再到后来的批处理方式，SQL编码方式，但是其本质还是抽取数据，处理，再加载到目标平台，常用的方式：1.初始化：当我们建立数仓平台时候，一开始会涉及到同
【Flink SQL API体验数据湖格式之paimon】以茉萱 flink sql 大数据
前言随着大数据技术的普及，数据仓库的部署方式也在发生着改变，之前在部署数据仓库项目时，首先想到的是选择国外哪家公司的产品，比如：数据存储会从Oracle、SqlServer中或者Mysql中选择，ETL工具会从Informatica、DataStage或者Kettle中选择，BI报表工具会从IBMcognos、SapBo或者帆软中选择，基本上使用的产品组合都类似，但随着数据量的激增，之前的部署方式
DMETL4简介及安装配置指导 DM fans etl
1.简介：DMETL(目前的版本是4.0)是达梦数据库有限公司在上十年数据处理经验的基础上，研制开发的具有自主版权的、商品化的数据集成软件，实现了对数据抽取、传输、整合、以及装载的一站式支持，是构建数据中心、数据仓库、数据交换和数据同步等应用的理想工具。同类型ETL工具有：informatic、kettle、datastage等等2.应用场景a.异构数据同步异构数据同步是指在一定的时间范围内，通过
十个原因你应该用DataStage 8.5 cyxlxp8411 BI
你应该升级到DataStage8.5的十个原因DataStage8.5版已经发布让客户升级。这里有十大理由你应该把你的DataStage升级到8.5版本。这里列举了DataStage8.5版中的十个最好的特性。这些特性中的大部分是关于DataStage并行工作的改进，另外几个是有关帮助服务器工作的客户。1.DataStage8.5的速度更快。快了，更快了。DataStage8.5中许多工作比8.1
项目描述之ODS（二） oycn2010 个人情感
运作数据存储ODS（OperationalDataStore），ODS系统是面向主题的、集成的、可变的、数据是最新的或是接近最新的、细节的5个基本特征；是基于某个主题相关一组数据的集合，而不局限在某个应用系统，从业务关联的角度看数据，而不是基于传统的应用角度看数据。数据采集(ETL)设计原则1.基于现有技术，优先采用DataStage作为数据采集工具；2.使用DataStage进行进行数据采集时，
Datastage部署与使用你的凯子很吊 etl
Datastage部署与使用-码农教程https://www.cnblogs.com/lanston/category/739553.htmlStreamsets定时拉取接口数据同步到HBase集群_streamsetsapi_webmote的博客-CSDN博客【SDC】StreamSets实战之路-28-实战篇-使用StreamSets实时采集指定数据目录文件并写入库Kudu_菜鸟蜀黍的博客-C
成功解决DataX从Hive导出Oracle的数据乱码问题！笑看风云路 hive DataX 数据乱码 ETL Hive Oracle
前言大数据与RDBMS之间的数据导入和导出都是企业日常数据处理中常见的一环，该环节一般称为e-t-l即extract-transform-load。市面上可用的etl工具和框架很多，如来自于传统数仓和BI圈的kettle/informatica/datastage,来自于hadoop生态圈的sqoop/datax，抑或使用计算引擎spark/presto/flink直接编写代码完成etl作业。在这
ETL工具的比较：DATASTAGE, KETTLE ,ODI ,SSIS 宇宙的尽头是PYTHON etl 数据仓库
DATASTAGE部分DS产品组成:Client客户端层视频中讲解的版本为8.7版本datastageadministratordatastagedesignerdatastagedirectordatastageadministrator：DS项目的项目管理（项目的添加，删除，修改配置等）datastagedesigner：JOB的设计和执行（job的创建，删除，编译，执行等）Datastage
187页（10万字）业务和数据中台建设方案2022版数字化动态大数据
1.1.1.1.1. 按时延分类1.1.1.1.1.1.1. 准实时接入针对于T+1模式无法满足业务系统的需求，需要进行准实时同步。准实时同步是指将数据从传统的关系型数据库准实时同步到大数据平台，并对数据进行实时或者准实时分析。借助OralceGoldenGate（OGG）、IBMDatastageDataReplication（CDC）等软件可以实时地读取关系数据库的日志记录
DataStage中merge、lookup、join的区别与联系 weixin_30764883
三者功能类似，都可以将表连接起来进行输出。区别主要体现在性能上。lookup就是一个表在另一个表中找，处理过程都在内存进行，因此占用内存较多，一般大事实表和小纬表用这种方式关联效率高。merge和join的处理过程不需用占大量内存。不同在于merge要先把key值排序在做join，因此要求key不能重复，Merge的输出集可以设为多个。转载于:https://www.cnblogs.com/gen
DataStage---lookup和join的区别 [转] chenj8211 datastage Oracle DB2 SQL 工作
关于lookup和join的区别，不同工具有类似的方式和原理，但功能特点各有不同。首先lookup典型的1对N关联，而join可以N对M。此外lookup一般是左外连接（假设主表在左的设计思路），join则可以分开指定内或左外或者右外或者全外连接。lookup通常可以全部或部分缓冲进入内存，join则不一定，不同工具的做法差别挺大。lookup其实不少工具并不需要sort，因为是通过lookupk
DataStage作业开发步骤大毛发沙海数据库 etl
0.导入目标表结构。菜单路径：导入—表定义—Orchestrate模式。按以下步骤导入目标表表结构。1.新增一个并行作业。2.保存作业到对应的目录路径，并对作业命名，命名PJ_任务层目标表名。3.作业的基础配置，主要是参数配置。3.1常规配置，必须勾选√允许多实例、作业描述就写表中文名。3.2参数配置，添加以下环境变量。4.从选用版里添加组件（控件）到作业，主要是建立源头到目标的加工逻辑组件。4.
实时数据引擎系列(二): 批流一体的数据数据库
前言在上文(https://segmentfault.com/a/11...)我们提到了通过数据库日志获取新鲜的数据,在对数据的认识里,TAPDATA引擎的设计和一些其他的流框架不太一样,他的对象抽象里没有批数据和流数据的区分,数据只有一种,被命名为Record,数据来源只有一种,命名为DataSource,而数据流阶段也只有一种,被命名为DataStage在抽象上数据去除了批与流的区别,在全部的
Datastage Dabbie
搭建客户端时遇到的坑关于host文件：转载自https://blog.csdn.net/mosquitolxw/article/details/6440245概要从文中来看，应该是InformationServer不能通过IP直接访问，而只能输入服务器名才能访问。而IS装在远程主机上，而公司内部显然没有DNS，这种情况下就必须在Host文件中人为加上服务器名和IP的映射关系。现在让我们来看看Hos
批量作业调度、数据挖掘，这应该是今年"最值得推荐"的ETL工具了 TASKCTL
ETL是数据仓库中的非常重要的一环，是承前启后的必要的一步。ETL负责将分布的、异构数据源中的数据如关系数据、平面数据文件等抽取到临时中间层后进行清洗、转换、集成，最后加载到数据仓库或数据集市中，成为联机分析处理、数据挖掘的基础。下面给大家介绍一下什么是ETL以及ETL常用的三种工具——Datastage，Taskctl，Kettle。什么是ETL？ETL，Extract-Transform-Lo
datastage导出导入Job mboby 工具
在使用datastage开发ETL的时候，有时在移交测试生产的时候需要导出job。首先打开命令窗口，切到datastage安装目录，然后执行#导出startdsexport.exe/H=10.20.13.16/U=olapetl/P=Paic1234OltpPs/job=PsPaicEmpAdInt_PS_PA_ADD:\Users\LIWEILI605\Desktop\PsPaicEmpAdIn
IBM Information Server（DataStage8.1）安装紫色蜘蛛爬啊爬 DataStage
IBMInformationServer（DataStage）安装注：抱歉现在不能上传图片，CSDN啥时候才能传图片呢一、安装条件——系统需求.二、安装步骤.1.安装文件说明：.2.安装步骤说明：.a）安装DB2用于元数据管理.b）安装WebSphereApplicationServer用于发布.c）配置IBMInformationServer服务管理员.d）添加Datastage项目.e）Inf
DataStage Designer JOB的导入导出紫色蜘蛛爬啊爬 DataStage
DataStageDesignerJOB的导入导出注：抱歉CSDN现在不能上传图片...一、导出JOB1.登陆源域项目首先运行Designer客户机，登录到要导出JOB的域和项目中。2.导出JOB在JOB所在的目录上单击右键-“导出”。则该目录下的所有JOB就会出现在到处目录中。添加导出到文件的位置，需要填写完整的路径和文件名，包括文件的后缀名。单击导出就可以将上述文件JOB全部导出到目标文件中去
关于Datastage配置带参数的存储过程调度上官小西 DataStage
1，打开空间的属性，如下图，2,。单价【Parameters】,在参数列表中配置存储过程中的参数，如下图所示：3.再点开【stage】,如下图选择【columns】，配置存储过程参数。
DataStage的安装 zhaohuixiaofei datastage
用户环境变量的设置用root，dsadm，ods用户登录，在用户根目录下，执行命令$vi.bash_profile，修改环境变量如下：#.bash_profile./home/ap/dsadm/Ascential/DataStage/DSEngine/dsenv#Getthealiasesandfunctionsif[-f~/.bashrc];then.~/.bashrcfi#Userspecif
DataStage（ETL）技术总结 -- 介绍篇 yuzhic 1.1 后台开发
数据整合的核心内容是从数据源中抽取数据，然后对这些数据进行转化，最终加载的目标数据库或者数据仓库中去，这也就是我们通常所说的ETL过程(Extract,Transform,Load)。IBMWebSphereDataStage（下面简称为DataStage）为整个ETL过程提供了一个图形化的开发环境,它是一套专门对多种操作数据源的数据抽取、转换和维护过程进行简化和自动化，并将其输入数据集或数据仓库
DataStage实践之简单入门 weixin_34366546
DataStage组成：DataStageDesigner（设计者）：用来创建DataStageJob（作业）的设计接口。每个作业都指定数据源，所需的转换和数据的目的地。作业被编译成可执行的，由Director计划，由Server运行。DataStageDirector（指挥者）：用来验证，计划时间，运行，监控DataStage的作业。DataStageManager（管理者）：用来查看，编辑Re
安装DataStage且安装DataStage中内置的DB2数据库之后，发现linux无法运行db2数据库命令疯子Bro DataStage
由于工作需要，许要搭建DataStage开发环境，服务器和客户端安装完毕之后，发现服务器上不能运行DB2的命令，DB2是包含在DataStage安装包中，且本人配置过了，鉴于之前没接触过DB2数据库，一时也不知道怎么办，经过查找，解决思路如下：安装完成之后，软件将会在系统中创建几个用户，如下图：其中除了redhat是自己创建的之外，其他都是DataStage软件创建，其中dasusr1、db2fe
安装完DB2使用db2命令报bash: db2: command not found... 给我枝烟 datastage DB2
安装完DB2后一切正常，可就是切换换db2inst1用户后执行#db2报一下错误[root@SCdatastage~]#su-db2inst1Lastlogin:TueSep411:02:12CST2018onpts/0-sh-4.2$db2bash:db2:commandnotfound...-sh-4.2$一般产生此错误的原因是环境变量没有配置对；1、检查db2inst1用户下.bashrc文
datastage提取平面文件的分隔符问题 congji1914
今天有使用ds提取平面文件，遇到了分隔符为多字节的问题。1、使用serverjob只支持单字节的分隔符2、使用paralleljob可以支持多个字节的分隔符例如分隔符为/x01可以做如下的设置记得是delimiterstring这样就可以使用各种字符拼接的多字节分隔符了来自“ITPUB博客”，链接：http://blog.itpub.net/27120361/viewspace-1770586/，
Spark作为ETL工具与SequoiaDB的结合应用 SequoiaDBOfficial
一、前言ETL一词较常用于数据仓库，但其对象并不仅限于数据仓库。ETL是指将数据从源系统中经过抽取（Extract）、转换（Transform）、加载（Load）到目标数据存储区的过程。常见的ETL工具有OracleDataIntegrator、InformaticaPowerCenter、DataStage、Kettle、DataSprider等。在大数据应用中，海量的数据及对潜在应用的支持是非
DATASTAGE-作业运行时错误解决办法-表空间不足 jing-爱学习 datastage
1.作业运行出现如下错误ODBCEX_GONGSHANG_74_SSXKGSXXinsert,0:ODBC函数“SQLExecute”报告：SQLSTATE=HY000:NativeErrorCode=1,653:Msg=[Oracle][ODBC][Ora]ORA-01653:unabletoextendtableEXDB.EX_GONGSHANG_74_SSXKGSXXby8192intabl
【Datastage】函数大全 ajsyipsc40270
一、类型转换函数类型转换函数用于更改参数的类型。以下函数位于表达式编辑器的“类型转换”类别中。方括号表示参数是可选的。缺省日期格式为%yyyy-%mm-%dd。以下示例按照Transformer阶段的“派生”字段中所示来显示这些函数。1.Char根据其数字代码值生成一个ASCII字符。您可以指定allow8bits参数来转换8位ASCII值（可选）。·输入：code(number)，[allow8
SQL*Loader-951错误 small_well database
在使用datastage开发的时候，遇到错误：SQL*Loader-951:Errorcallingonce/loadinitializationORA-00604:erroroccuredatrecursiveSQLlevel1ORA-00054:resourcebusyandacquirewithNOWAITspecified搬出谷歌大神：可能是表里索引处于unusablestate-->导致
安装数据库首次应用 Array_06 java oracle sql
可是为什么再一次失败之后就变成直接跳过那个要求 enter full pathname of java.exe的界面这个java.exe是你的Oracle 11g安装目录中例如：【F:\app\chen\product\11.2.0\dbhome_1\jdk\jre\bin】下的java.exe 。不是你的电脑安装的java jdk下的java.exe！注意第一次，使用SQL D
Weblogic Server Console密码修改和遗忘解决方法 bijian1013 Welogic
在工作中一同事将Weblogic的console的密码忘记了，通过网上查询资料解决，实践整理了一下。一.修改Console密码打开weblogic控制台，安全领域 --> myrealm -->&n
IllegalStateException: Cannot forward a response that is already committed Cwind java Servlets
对于初学者来说，一个常见的误解是：当调用 forward() 或者 sendRedirect() 时控制流将会自动跳出原函数。标题所示错误通常是基于此误解而引起的。示例代码： protected void doPost() { if (someCondition) { sendRedirect(); } forward(); // Thi
基于流的装饰设计模式木zi_鸣设计模式
当想要对已有类的对象进行功能增强时，可以定义一个类，将已有对象传入，基于已有的功能，并提供加强功能。自定义的类成为装饰类模仿BufferedReader，对Reader进行包装，体现装饰设计模式装饰类通常会通过构造方法接受被装饰的对象，并基于被装饰的对象功能，提供更强的功能。装饰模式比继承灵活，避免继承臃肿，降低了类与类之间的关系装饰类因为增强已有对象，具备的功能该
Linux中的uniq命令被触发 linux
Linux命令uniq的作用是过滤重复部分显示文件内容，这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个及以后更多个重复行将被删去，行比较是根据所用字符集的排序序列进行的。该命令加工后的结果写到输出文件中。输入文件和输出文件必须不同。如果输入文件用“- ”表示，则从标准输入读取。 AD： uniq [选项] 文件说明：这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个
正则表达式Pattern 肆无忌惮_ Pattern
正则表达式是符合一定规则的表达式，用来专门操作字符串，对字符创进行匹配，切割，替换，获取。例如，我们需要对QQ号码格式进行检验规则是长度6~12位不能0开头只能是数字，我们可以一位一位进行比较，利用parseLong进行判断，或者是用正则表达式来匹配[1-9][0-9]{4,14} 或者 [1-9]\d{4,14} &nbs
Oracle高级查询之OVER (PARTITION BY ..) 知了ing oracle sql
一、rank()/dense_rank() over(partition by ...order by ...) 现在客户有这样一个需求，查询每个部门工资最高的雇员的信息，相信有一定oracle应用知识的同学都能写出下面的SQL语句： select e.ename, e.job, e.sal, e.deptno from scott.emp e, (se
Python调试矮蛋蛋 python pdb
原文地址： http://blog.csdn.net/xuyuefei1988/article/details/19399137 1、下面网上收罗的资料初学者应该够用了，但对比IBM的Python 代码调试技巧： IBM：包括 pdb 模块、利用 PyDev 和 Eclipse 集成进行调试、PyCharm 以及 Debug 日志进行调试： http://www.ibm.com/d
webservice传递自定义对象时函数为空，以及boolean不对应的问题 alleni123 webservice
今天在客户端调用方法 NodeStatus status=iservice.getNodeStatus(). 结果NodeStatus的属性都是null。进行debug之后，发现服务器端返回的确实是有值的对象。后来发现原来是因为在客户端，NodeStatus的setter全部被我删除了。本来是因为逻辑上不需要在客户端使用setter，结果改了之后竟然不能获取带属性值的
java如何干掉指针，又如何巧妙的通过引用来操作指针————>说的就是java指针百合不是茶
C语言的强大在于可以直接操作指针的地址，通过改变指针的地址指向来达到更改地址的目的,又是由于c语言的指针过于强大，初学者很难掌握， java的出现解决了c，c++中指针的问题 java将指针封装在底层，开发人员是不能够去操作指针的地址，但是可以通过引用来间接的操作：定义一个指针p来指向a的地址（&是地址符号）：
Eclipse打不开，提示“An error has occurred.See the log file ***/.log” bijian1013 eclipse
打开eclipse工作目录的\.metadata\.log文件，发现如下错误： !ENTRY org.eclipse.osgi 4 0 2012-09-10 09:28:57.139 !MESSAGE Application error !STACK 1 java.lang.NoClassDefFoundError: org/eclipse/core/resources/IContai
spring aop实例annotation方法实现 bijian1013 java spring AOP annotation
在spring aop实例中我们通过配置xml文件来实现AOP，这里学习使用annotation来实现，使用annotation其实就是指明具体的aspect,pointcut和advice。1.申明一个切面(用一个类来实现)在这个切面里,包括了advice和pointcut AdviceMethods.jav
[Velocity一]Velocity语法基础入门 bit1129 velocity
用户和开发人员参考文档 http://velocity.apache.org/engine/releases/velocity-1.7/developer-guide.html 注释 1.行级注释## 2.多行注释#* *# 变量定义使用$开头的字符串是变量定义，例如$var1, $var2, 赋值使用#set为变量赋值，例
【Kafka十一】关于Kafka的副本管理 bit1129 kafka
1. 关于request.required.acks request.required.acks控制者Producer写请求的什么时候可以确认写成功，默认是0， 0表示即不进行确认即返回。 1表示Leader写成功即返回，此时还没有进行写数据同步到其它Follower Partition中 -1表示根据指定的最少Partition确认后才返回，这个在 Th
lua统计nginx内部变量数据 ronin47 lua nginx　统计
server { listen 80; server_name photo.domain.com; location /{set $str $uri; content_by_lua ' local url = ngx.var.uri local res = ngx.location.capture(
java-11.二叉树中节点的最大距离 bylijinnan java
import java.util.ArrayList; import java.util.List; public class MaxLenInBinTree { /* a. 1 / \ 2 3 / \ / \ 4 5 6 7 max=4 pass "root"
Netty源码学习-ReadTimeoutHandler bylijinnan java netty
ReadTimeoutHandler的实现思路：开启一个定时任务，如果在指定时间内没有接收到消息，则抛出ReadTimeoutException 这个异常的捕获，在开发中，交给跟在ReadTimeoutHandler后面的ChannelHandler，例如 private final ChannelHandler timeoutHandler = new ReadTim
jquery验证上传文件样式及大小(好用) cngolon 文件上传 jquery验证
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script src="jquery1.8/jquery-1.8.0.
浏览器兼容【转】 cuishikuan css 浏览器 IE
浏览器兼容问题一：不同浏览器的标签默认的外补丁和内补丁不同问题症状：随便写几个标签，不加样式控制的情况下，各自的margin 和padding差异较大。碰到频率:100% 解决方案：CSS里 *{margin:0;padding:0;} 备注：这个是最常见的也是最易解决的一个浏览器兼容性问题，几乎所有的CSS文件开头都会用通配符*来设
Shell特殊变量：Shell $0, $#, $*, $@, $?, $$和命令行参数 daizj shell $#$?特殊变量
前面已经讲到，变量名只能包含数字、字母和下划线，因为某些包含其他字符的变量有特殊含义，这样的变量被称为特殊变量。例如，$ 表示当前Shell进程的ID，即pid，看下面的代码： $echo $$ 运行结果 29949 特殊变量列表变量含义 $0 当前脚本的文件名 $n 传递给脚本或函数的参数。n 是一个数字，表示第几个参数。例如，第一个
程序设计KISS 原则-------KEEP IT SIMPLE, STUPID! dcj3sjt126com unix
翻到一本书，讲到编程一般原则是kiss：Keep It Simple, Stupid.对这个原则深有体会，其实不仅编程如此，而且系统架构也是如此。 KEEP IT SIMPLE, STUPID! 编写只做一件事情，并且要做好的程序；编写可以在一起工作的程序，编写处理文本流的程序，因为这是通用的接口。这就是UNIX哲学.所有的哲学真正的浓缩为一个铁一样的定律，高明的工程师的神圣的“KISS 原
android Activity间List传值 dcj3sjt126com Activity
第一个Activity： import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.Map;import android.app.Activity;import android.content.Intent;import android.os.Bundle;import a
tomcat 设置java虚拟机内存 eksliang tomcat 内存设置
转载请出自出处：http://eksliang.iteye.com/blog/2117772 http://eksliang.iteye.com/ 常见的内存溢出有以下两种: java.lang.OutOfMemoryError: PermGen space java.lang.OutOfMemoryError: Java heap space ------------
Android 数据库事务处理 gqdy365 android
使用SQLiteDatabase的beginTransaction()方法可以开启一个事务，程序执行到endTransaction() 方法时会检查事务的标志是否为成功，如果程序执行到endTransaction()之前调用了setTransactionSuccessful() 方法设置事务的标志为成功则提交事务，如果没有调用setTransactionSuccessful() 方法则回滚事务。事
Java 打开浏览器 hw1287789687 打开网址 open浏览器 open browser 打开url 打开浏览器
使用java 语言如何打开浏览器呢? 我们先研究下在cmd窗口中,如何打开网址使用IE 打开 D:\software\bin>cmd /c start iexplore http://hw1287789687.iteye.com/blog/2153709 使用火狐打开 D:\software\bin>cmd /c start firefox http://hw1287789
ReplaceGoogleCDN：将 Google CDN 替换为国内的 Chrome 插件 justjavac chrome Google google api chrome插件
Chrome Web Store 安装地址： https://chrome.google.com/webstore/detail/replace-google-cdn/kpampjmfiopfpkkepbllemkibefkiice 由于众所周知的原因，只需替换一个域名就可以继续使用Google提供的前端公共库了。同样，通过script标记引用这些资源，让网站访问速度瞬间提速吧
进程VS.线程 m635674608 线程
资料来源： http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001397567993007df355a3394da48f0bf14960f0c78753f000 1、Apache最早就是采用多进程模式 2、IIS服务器默认采用多线程模式 3、多进程优缺点优点：多进程模式最大
Linux下安装MemCached 字符串 memcached
前提准备：1. MemCached目前最新版本为：1.4.22，可以从官网下载到。2. MemCached依赖libevent，因此在安装MemCached之前需要先安装libevent。2.1 运行下面命令，查看系统是否已安装libevent。[root@SecurityCheck ~]# rpm -qa|grep libevent libevent-headers-1.4.13-4.el6.n
java设计模式之--jdk动态代理（实现aop编程） Supanccy2013 java DAO 设计模式 AOP
与静态代理类对照的是动态代理类，动态代理类的字节码在程序运行时由Java反射机制动态生成，无需程序员手工编写它的源代码。动态代理类不仅简化了编程工作，而且提高了软件系统的可扩展性，因为Java 反射机制可以生成任意类型的动态代理类。java.lang.reflect 包中的Proxy类和InvocationHandler 接口提供了生成动态代理类的能力。 &
Spring 4.2新特性-对java8默认方法(default method)定义Bean的支持 wiselyman spring 4
2.1 默认方法(default method) java8引入了一个default medthod; 用来扩展已有的接口,在对已有接口的使用不产生任何影响的情况下,添加扩展使用default关键字 Spring 4.2支持加载在默认方法里声明的bean 2.2 将要被声明成bean的类 public class DemoService {