DATASTAGE——关于数据分区的概述

Partitioning
       The aim of most partitioning operations is to end up with a set of partitions
that are as near equal size as possible, ensuring an even load across your
processors.
       When performing some operations however, you will need to take control
of partitioning to ensure that you get consistent results. A good example
of this would be where you are using an aggregator stage to summarize
your data. To get the answers you want (and need) you must ensure that
related data is grouped together in the same partition before the summary
operation is performed on that partition. DataStage lets you do this.
        There are a number of different partitioning methods available, note that
all these descriptions assume you are starting with sequential data. If you
are repartitioning already partitioned data then there are some specific
considerations (see “Repartitioning” on page 2-25):

        分区的目的是把海量的数据用接近平均分配的方法存储到一组数据分区里,并保证每个分区的数据都能加载到处理器运行。
        要对数据实施分区,需要设定相应的数据分区规则的,从而保证分区后的数据的整体性。

Round robin
        The first record goes to the first processing node, the second to the second
processing node, and so on. When DataStage reaches the last processing
node in the system, it starts over. This method is useful for resizing partitions
of an input data set that are not equal in size. The round robin
method always creates approximately equal-sized partitions. This method
is the one normally used when DataStage initially partitions data.

Round分区:
        将数据按一定的方式进行排序,并依照这种排序方式组条记录提取,组个放到分区里。比如有10000条数据,把这10000条数据分成3个分区,如果选择ROUND分区方法,DS就会把第一条数据放到第一个分区了,第二条记录放到第二个分区,第三条记录放到第三个分区,然后循环会来,第四条记录放到第一个分区,第五条记录放到第二个分区,依此类推。

Random
        Records are randomly distributed across all processing nodes. Like round
robin, random partitioning can rebalance the partitions of an input data set
to guarantee that each processing node receives an approximately equalsized
partition. The random partitioning has a slightly higher overhead than round robin
because of the extra processing required to calculate a random value for each record.

随机分区:
        数据随机分配到个分区里。

Same
        The operator using the data set as input performs no repartitioning and
takes as input the partitions output by the preceding stage. With this partitioning
method, records stay on the same processing node; that is, they are not redistributed.
Same is the fastest partitioning method. This is normally the method DataStage uses
when passing data between stages in your job.

不懂翻译。

Entire
  Every instance of a stage on every processing node receives the complete
data set as input. It is useful when you want the benefits of parallel execution,
but every instance of the operator needs access to the entire input data set.
You are most likely to use this partitioning method with stagesthat create lookup
tables from their input.

Hash by field
  Partitioning is based on a function of one or more columns (the hash partitioning
keys) in each record. The hash partitioner examines one or more
fields of each input record (the hash key fields). Records with the same
values for all hash key fields are assigned to the same processing node.
This method is useful for ensuring that related records are in the same
partition, which may be a prerequisite for a processing operation. For
example, for a remove duplicates operation, you can hash partition records
so that records with the same partitioning key values are on the
same node. You can then sort the records on each node using the hash key
fields as sorting key fields, then remove duplicates, again using the same
keys. Although the data is distributed across partitions, the hash partitioner
ensures that records with identical keys are in the same partition,
allowing duplicates to be found.
        Hash partitioning does not necessarily result in an even distribution of
data between partitions. For example, if you hash partition a data set
based on a zip code field, where a large percentage of your records are
from one or two zip codes, you can end up with a few partitions
containing most of your records. This behavior can lead to bottlenecks
because some nodes are required to process more records than other
nodes.
        For example, the diagram shows the possible results of hash partitioning
a data set using the field age as the partitioning key. Each record with a
given age is assigned to the same partition, so for example records with age
36, 40, or 22 are assigned to partition 0. The height of each bar represents
the number of records in the partition.
        As you can see, the key values are randomly distributed among the
different partitions. The partition sizes resulting from a hash partitioner
are dependent on the distribution of records in the data set so even though
there are three keys per partition, the number of records per partition
varies widely, because the distribution of ages in the population is nonuniform.
        When hash partitioning, you should select hashing keys that create a large
number of partitions. For example, hashing by the first two digits of a zip
code produces a maximum of 100 partitions. This is not a large number for
a parallel processing system. Instead, you could hash by five digits of the
zip code to create up to 10,000 partitions. You also could combine a zip
code hash with an age hash (assuming a maximum age of 190), to yield
1,500,000 possible partitions.
        Fields that can only assume two values, such as yes/no, true/false,
male/female, are particularly poor choices as hash keys.
        You must define a single primary collecting key for the sort merge
collector, and you may define as many secondary keys as are required by
your job. Note, however, that each record field can be used only once as a
collecting key. Therefore, the total number of primary and secondary
collecting keys must be less than or equal to the total number of fields in
the record. You specify which columns are to act as hash keys on the Partitioning
tab of the stage editor, see “Partitioning Tab” on page 3-21. An
example is shown below. The data type of a partitioning key may be any
data type except raw, subrecord, tagged aggregate, or vector (see
page 2-32 for data types). By default, the hash partitioner does case-sensitive
comparison. This means that uppercase strings appear before
lowercase strings in a partitioned data set. You can override this default if
you want to perform case-insensitive partitioning on string fields.

HASH分区
  太长了,看英文比翻译来得方便。

Modulus
  Partitioning is based on a key column modulo the number of partitions.
This method is similar to hash by field, but involves simpler computation.
In data mining, data is often arranged in buckets, that is, each record has a
tag containing its bucket number. You can use the modulus partitioner to
partition the records according to this number. The modulus partitioner
assigns each record of an input data set to a partition of its output data set
as determined by a specified key field in the input data set. This field can
be the tag field.
  The partition number of each record is calculated as follows:
partition_number = fieldname mod number_of_partitions
where:
  • fieldname is a numeric field of the input data set.
  • number_of_partitions is the number of processing nodes on which
the partitioner executes. If a partitioner is executed on three
processing nodes it has three partitions.
In this example, the modulus partitioner partitions a data set containing
ten records. Four processing nodes run the partitioner, and the modulus
partitioner divides the data among four partitions. The input data is as
follows:
The bucket is specified as the key field, on which the modulus operation
is calculated.
Here is the input data set. Each line represents a row:
64123  1960-03-30
61821  1960-06-27
44919  1961-06-18
22677  1960-09-24
90746  1961-09-15
21870  1960-01-01
87702  1960-12-22
4705   1961-12-13
47330  1961-03-21
88193  1962-03-12
  The following table shows the output data set divided among four partitions
by the modulus partitioner.
Partition 0:
Partition 1:61821 1960-06-27,22677 1960-09-24,47051961-12-13,88193 1962-03-12
Partition 2:21870 1960-01-01,87702 1960-12-22,47330 1961-03-21,90746 1961-09-15
Partition 3:64123 1960-03-30,44919 1961-06-18
  Here are three sample modulus operations corresponding to the values of
three of the key fields:
  • 22677 mod 4 = 1; the data is written to Partition 1.
  • 47330 mod 4 = 2; the data is written to Partition 2.
  • 64123 mod 4 = 3; the data is written to Partition 3.
  None of the key fields can be divided evenly by 4, so no data is written to
Partition 0.
  You define the key on the Partitioning tab (see “Partitioning Tab” on
page 3-21)

取模分区

Range
  Divides a data set into approximately equal-sized partitions, each of which
contains records with key columns within a specified range. This method
is also useful for ensuring that related records are in the same partition.
  A range partitioner divides a data set into approximately equal size partitions
based on one or more partitioning keys. Range partitioning is often
a preprocessing step to performing a total sort on a data set.
  In order to use a range partitioner, you have to make a range map. You can
do this using the Write Range Map stage, which is described in Chapter 55.
  The range partitioner guarantees that all records with the same partitioning
key values are assigned to the same partition and that the
partitions are approximately equal in size so all nodes perform an equal
amount of work when processing the data set.
  An example of the results of a range partition is shown below. The partitioning
is based on the age key, and the age range for each partition is indicated by
the numbers in each bar. The height of the bar shows the size of the partition.
  All partitions are of approximately the same size. In an ideal distribution,
every partition would be exactly the same size. However, you typically
observe small differences in partition size.
  In order to size the partitions, the range partitioner uses a range map to
calculate partition boundaries. As shown above, the distribution of partitioning
keys is often not even; that is, some partitions contain many
partitioning keys, and others contain relatively few. However, based on
the calculated partition boundaries, the number of records in each partition
is approximately the same.
  Range partitioning is not the only partitioning method that guarantees
equivalent-sized partitions. The random and round robin partitioning
methods also guarantee that the partitions of a data set are equivalent in
size. However, these partitioning methods are keyless; that is, they do not
allow you to control how records of a data set are grouped together within
a partition.
  In order to perform range partitioning your job requires a write range map
stage to calculate the range partition boundaries in addition to the stage
that actually uses the range partitioner. The write range map stage uses a
probabilistic splitting technique to range partition a data set. This technique
is described in Parallel Sorting on a Shared-Nothing Architecture Using
Probabilistic Splitting by DeWitt, Naughton, and Schneider in Query
Processing in Parallel Relational Database Systems by Lu, Ooi, and Tan, IEEE
Computer Society Press, 1994. In order for the stage to determine the partition
boundaries, you pass it a sorted sample of the data set to be range
partitioned. From this sample, the stage can determine the appropriate
partition boundaries for the entire data set. See Chapter 55, “Write
Range Map Stage,” for details.
  When you come to actually partition your data, you specify the range map
to be used by clicking on the property icon, next to the Partition type field,
the Partitioning/Collection properties dialog box appears and allows you
to specify a range map (see “Partitioning Tab” on page 3-21 for a description
of the Partitioning tab).

DB2
  Partitions an input data set in the same way that DB2 would partition it.
For example, if you use this method to partition an input data set
containing update information for an existing DB2 table, records are
assigned to the processing node containing the corresponding DB2 record.
Then, during the execution of the parallel operator, both the input record
and the DB2 table record are local to the processing node. Any reads and
writes of the DB2 table would entail no network activity.
  See the DB2 Parallel Edition for AIX, Administration Guide and Reference for
more information on DB2 partitioning.
  To use DB2 partitioning on a stage, select a Partition type of DB2 on the
Partioning tab, then click the Properties button to the right. In the Partitioning/Collection
properties dialog box, specify the details of the DB2 table whose partitioning
you want to replicate (see “Partitioning Tab” on page 3-21 for a description
of the Partitioning tab).

Auto
  The most common method you will see on the DataStage stages is Auto.
  This just means that you are leaving it to DataStage to determine the best
partitioning method to use depending on the type of stage, and what the
previous stage in the job has done. Typically DataStage would use round
robin when initially partitioning data, and same for the intermediate
stages of a job.

你可能感兴趣的:(db2,input,processing,each,parallel,Duplicates)