本文翻译自DWH PRO
In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Teradata initially followed the rules of relational data modeling tightly and did not allow row duplicates in a table.
在关系型数据模型理论中每一个实体需要一个主键,每一个主键值唯一确定/辨别一个对象。Teradata也牢牢遵循着关系型数据模型的规则,不允许表中出现重复的行。
As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row level duplicates. Data cleansing as part of the ETL/ELT chain is the correct approach to handling row duplicates.
作为一种逻辑型的数据模型师,我希望重复的行只出现在物理数据模型层。从商业角度考虑,我无法想象我们需要存储行级重复数据的情形。数据清洗作为 ETL/ELT 链的一部分,是正确处理重复行的方法。
When the ANSI standard permitted to allow row duplicates, also Teradata added this option. The ANSI standard view on duplicates forced Teradata to allow SET tables (no row duplicates allowed) and MULTISET tables (row duplicates allowed) exist.
当美国国家标准协会允许使用重复行的时候,Teradata 也加上了这个选项,这个ANSI标准视图在处理重复数据的时候强制Teradata允许使用SET tables(不允许重复行)和 MULTISET tables (允许使用重复行)存在。
Above background knowledge makes clear that the default in Teradata mode are SET tables, and the default in ANSI mode are MULTISET tables.
通过以上背景知识我们知道了Teradata的默认模型是SET tables,而ANSI模型默认是MULTISET tables
Let me repeat the key aspects of SET and MULTISET tables:
以下是SET和MULTISET tables的details:
Instead of a UPI, we can also use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.
除了UPI之外,我们还可以使用USI(第二唯一索引),或者任意的使用唯一性或者主键约束的column
Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK), avoids the duplicate row check!
任何保证唯一性的索引或者限制(当然也可以让Teradata 忽略/避开重复行校验),都可以避免重复行校验!
SET tables are good candidates for performance improvement. The easiest way to find all SET tables on a Teradata system is to query the table “TABLES” in database “DBC” :
SET tables对于性能提升也很理想。发现SET tables的最简单方法是在Teradata 系统的“DBC”数据库中查询表。
SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’; — Set Tables
SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’; — Multiset tables
All tables where uniqueness is given programmatically – such is the case in a GROUP BY statement, – can be switched from SET to MULTISET. In this way, we can make performance improvements. The magnitude of the improvement depends on the number of rows per primary index value.
对于由编程方式给定的唯一性限制表-例如GROUP BY这种情况,可以从SET转换到MULTISET。这样的话,我们能够使性能得到提升。性能提升的程度取决于主键索引的行数。
Again: The number of DUPLICATE ROW CHECKS grows exponentially with the number of rows per Primary Index (PI)!
再次强调:重复行校验的数量将会随着每一个主键的行数而呈现指数增长!
Here is an example to prove the performance penalty of a SET table with many duplicate primary index values:
下是证明有多个重复主键索引值得SET table的性能惩罚实例:
In our example, we create two identical tables:
例子中我们创建了两张一样的表
CREATE SET TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
CREATE MULTISET TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);
— In the next step we check our session id, as we will need it to analyze the resource usage:
接下来我们检查一下session id,因为我们需要它来分析资源使用
SELECT SESSION;
7376827
— We insert random data into the Set and Multiset table but only use 500 different Primary Index values to cause
— some impact on performance. The “descr” column has to be unique as row level duplicates would be filtered
我们插入随机数据到Set 和 Multiset 表中(只使用500 个不同的主键索引值),对性能产生一些影响。“descr”栏必须是唯一行(因为重复行将会被过滤掉)
INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;
INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;
— We compare the CPU seconds and DISK IOs for the SET and the MULTISET table:
对比SET与MULTISET表的CPU 时间和DISK IOs
SELECT * FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;
Above example shows, that SET tables consume much more Disk IOs and CPU seconds.
从表中我们可以看出SET 表消耗了更多的磁盘IO以及CPU时间
While Multiload is compatible with SET and MULTISET tables, this is not the case for the Fastload Utility. Fastload will filter row duplicates whether the table is a SET table or a MULTISET table.
尽管Multiload 对于SET和MULTISET表是兼容的,但是这不是Fastload 的使用情况。不管是SET还是MULTISET表,Fastload 都会过滤掉重复行。
If we want to get rid of row duplicates in a Multiset table, we can do a combination of FastExport and Fastload. The Fastload will filter the row duplicates.
如果我们想要去除MULTISET表中的重复行,可以使用FastExport 和 Fastload的组合,Fastload 将过滤掉重复的行。