Teradata Multiset 与 Set Tables – 专家使用指南

本文翻译自DWH PRO

Teradata Multiset vs. Set Tables – Usage Guidlines for the Expert

In relational data modeling theory, each entity needs a primary key. Each primary key value uniquely identifies an object. Teradata initially followed the rules of relational data modeling tightly and did not allow row duplicates in a table.

在关系型数据模型理论中每一个实体需要一个主键,每一个主键值唯一确定/辨别一个对象。Teradata也牢牢遵循着关系型数据模型的规则,不允许表中出现重复的行。

As a logical data modeler, I would expect duplicate rows only in the physical data model. From a business point of view, I can’t imagine a scenario where we would need to store row level duplicates. Data cleansing as part of the ETL/ELT chain is the correct approach to handling row duplicates.

作为一种逻辑型的数据模型师,我希望重复的行只出现在物理数据模型层。从商业角度考虑,我无法想象我们需要存储行级重复数据的情形。数据清洗作为 ETL/ELT 链的一部分,是正确处理重复行的方法。

When the ANSI standard permitted to allow row duplicates, also Teradata added this option. The ANSI standard view on duplicates forced Teradata to allow SET tables (no row duplicates allowed) and MULTISET tables (row duplicates allowed) exist.

当美国国家标准协会允许使用重复行的时候,Teradata 也加上了这个选项,这个ANSI标准视图在处理重复数据的时候强制Teradata允许使用SET tables(不允许重复行)和 MULTISET tables (允许使用重复行)存在。

Above background knowledge makes clear that the default in Teradata mode are SET tables, and the default in ANSI mode are MULTISET tables.

通过以上背景知识我们知道了Teradata的默认模型是SET tables,而ANSI模型默认是MULTISET tables

Let me repeat the key aspects of SET and MULTISET tables:
以下是SET和MULTISET tables的details:

  • In contrast to MULTISET tables, SET tables forbid duplicate rows to be inserted with an INSERT statement or created with an UPDATE statement.(和MULTISET tables不同的是, SET tables禁止重复行被插入/更新到表中)
  • SET tables can have a negative performance impact.(SET tables会有负面的性能作用)
  • For INSERT INTO SELECT * FROM statements, duplicate rows are automatically for SET tables (no error occurs). Nevertheless, INSERT INTO
    VALUES (a, b) statements, will abort and throw a mistake message for duplicate rows.(对于INSERT INTO SELECT * FROM
    表达式,SET tables里的去重是自动发生的,这不会引发错误;然而,INSERT INTO
    VALUES (a, b)表达式将引发一个错误抛出)
  • There is no way to change an existing SET table into a MULTISET table. I don’t know why this limitation exists…(目前没有方法实现从SET table到MULTISET table的转换/变化,我不知道为什么会有这样的限制存在)
  • There is a negative performance impact for SET tables. Each time a row is inserted or updated, Teradata checks if this would violate row uniqueness. This test is called DUPLICATE ROW CHECK, and will severely degrade performance if many rows with the same primary index are inserted. The number of controls increases exponentially with each new row added to the table!(每次当一行数据被插入或者更新的时候,Teradata会检查是否违反行唯一性限制。这种测试被叫做重复行校验,如果存在多行具有相同主键索引的值被插入的时候这会严重降低性能。当每一个新行被插入的时候这种校验的数量就会呈指数增长!)
  • Still, the performance of SET tables is not degraded, when there is a UPI (Unique primary index) defined for the table. As the UPI itself ensures uniqueness, no DUPLICATE ROW CHECK is done.(然而,如果存在定义的UPI (Unique primary index) , SET tables的性能是不会降低的,因为UPI自身会保证唯一性,没有行唯一性验证操作执行。

Instead of a UPI, we can also use a USI (Unique Secondary Index), or any column with a UNIQUE or PRIMARY KEY constraint.

除了UPI之外,我们还可以使用USI(第二唯一索引),或者任意的使用唯一性或者主键约束的column

Any index or constraint which ensures uniqueness (and therefore allows Teradata to bypass the DUPLICATE ROW CHECK), avoids the duplicate row check!

任何保证唯一性的索引或者限制(当然也可以让Teradata 忽略/避开重复行校验),都可以避免重复行校验!

SET tables are good candidates for performance improvement. The easiest way to find all SET tables on a Teradata system is to query the table “TABLES” in database “DBC” :

SET tables对于性能提升也很理想。发现SET tables的最简单方法是在Teradata 系统的“DBC”数据库中查询表。

SELECT * FROM DBC.TABLES WHERE checkopt = ‘N’ AND TABLEKIND = ‘T’;  — Set Tables

SELECT * FROM DBC.TABLES WHERE checkopt = ‘Y’ AND TABLEKIND = ‘T’;  — Multiset tables 

All tables where uniqueness is given programmatically – such is the case in a GROUP BY statement, – can be switched from SET to MULTISET. In this way, we can make performance improvements. The magnitude of the improvement depends on the number of rows per primary index value.

对于由编程方式给定的唯一性限制表-例如GROUP BY这种情况,可以从SET转换到MULTISET。这样的话,我们能够使性能得到提升。性能提升的程度取决于主键索引的行数。

Again: The number of DUPLICATE ROW CHECKS grows exponentially with the number of rows per Primary Index (PI)!

再次强调:重复行校验的数量将会随着每一个主键的行数而呈现指数增长!

Here is an example to prove the performance penalty of a SET table with many duplicate primary index values:
下是证明有多个重复主键索引值得SET table的性能惩罚实例:

In our example, we create two identical tables:
例子中我们创建了两张一样的表

CREATE SET  TABLE TMP_SET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

CREATE MULTISET  TABLE TMP_MULTISET
(
PK INTEGER NOT NULL,
DESCR INTEGER NOT NULL
) PRIMARY INDEX (PK);

— In the next step we check our session id, as we will need it to analyze the resource usage:

接下来我们检查一下session id,因为我们需要它来分析资源使用

SELECT SESSION;

7376827

— We insert random data into the Set and Multiset table but only use 500 different Primary Index values to cause
— some impact on performance. The “descr” column has to be unique as row level duplicates would be filtered

我们插入随机数据到Set 和 Multiset 表中(只使用500 个不同的主键索引值),对性能产生一些影响。“descr”栏必须是唯一行(因为重复行将会被过滤掉)
INSERT INTO TMP_MULTISET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;

INSERT INTO TMP_SET
SELECT
RANDOM(1,500) AS x,
RANDOM(1,999999999) AS descr
FROM
;
;

— We compare the CPU seconds and DISK IOs for the SET and the MULTISET table:

对比SET与MULTISET表的CPU 时间和DISK IOs
SELECT * FROM DBC.DBQLOGTBL WHERE SESSIONID = 7376827;

Teradata Multiset 与 Set Tables – 专家使用指南_第1张图片

Above example shows, that SET tables consume much more Disk IOs and CPU seconds.

从表中我们可以看出SET 表消耗了更多的磁盘IO以及CPU时间

Set Tables / Multiset Tables and the Loading Utilities

While Multiload is compatible with SET and MULTISET tables, this is not the case for the Fastload Utility. Fastload will filter row duplicates whether the table is a SET table or a MULTISET table.

尽管Multiload 对于SET和MULTISET表是兼容的,但是这不是Fastload 的使用情况。不管是SET还是MULTISET表,Fastload 都会过滤掉重复行。

If we want to get rid of row duplicates in a Multiset table, we can do a combination of FastExport and Fastload. The Fastload will filter the row duplicates.

如果我们想要去除MULTISET表中的重复行,可以使用FastExport 和 Fastload的组合,Fastload 将过滤掉重复的行。

你可能感兴趣的:(数据库设计,数据库,Teradata,Multiset,与,Set,Tables,Teradata,Multiset,row,duplicates,check,UPI,USI)