Tuesday, January 14, 2014

Partition methodology in DataStage

Given the numerous options for keyless and keyed partitioning, the following
objectives help to form a methodology for assigning partitioning:
Choose a partitioning method that gives close to an equal number of rows in
each partition, and which minimizes overhead.
This ensures that the processing workload is evenly balanced, minimizing
overall run time.The partition method must match the business requirements and stage functional requirements, assigning related records to the same partition if required.

Any stage that processes groups of related records (generally using one or
more key columns) must be partitioned using a keyed partition method.
This includes, but is not limited to the following stages:

– Aggregator
– Change Capture
– Change Apply
– Join, Merge
– Remove Duplicates
– Sort

It might also be necessary for Transformers and BuildOps that process
groups of related records.

In satisfying the requirements of this second objective, it might not be possible
to choose a partitioning method that gives close to an equal number of rows
in each partition.

Unless partition distribution is highly skewed, minimize repartitioning,
especially in cluster or grid configurations
Repartitioning data in a cluster or grid configuration incurs the overhead of
network transport.

Partition method must not be overly complex
The simplest method that meets these objectives generally is the most efficient and yields the best performance.

Using these objectives as a guide, the following methodology can be applied:
1. Start with Auto partitioning (the default)
2. Specify Hash partitioning for stages that require groups of related records
a. Specify only the key columns that are necessary for correct grouping as
long as the number of unique values is sufficient
b. Use Modulus partitioning if the grouping is on a single integer key column
c. Use Range partitioning if the data is highly skewed and the key column
values and distribution do not change significantly over time (Range Map
can be reused)
3. If grouping is not required, use round-robin partitioning to redistribute data
equally across all partitions
This is especially useful if the input dataset is highly skewed or sequential
4. Use Same partitioning to optimize end-to-end partitioning and to minimize
repartitioning
Be mindful that Same partitioning retains the degree of parallelism of the
upstream stage
In a flow, examine up-stream partitioning and sort order and attempt to
preserve for down-stream processing. This might require re-examining key
column usage in stages and re-ordering stages in a flow (if business requirements permit).

Across jobs, persistent datasets can be used to retain the partitioning and sort
order. This is particularly useful if downstream jobs are run with the same
degree of parallelism (configuration file) and require the same partition and sort order.

No comments:

Post a Comment