DataStage Questions and Answers: Partitioning Examples

In this section, we apply the partitioning methodology defined here to several example job flows.

Partitioning example 1: Optimized partitioning

The Aggregator stage only outputs key column and aggregate result columns. To add aggregate columns to every detail row, a Copy stage is used to send the detail rows to an Inner Join and an Aggregator. The output of the Aggregator is sent to the second input of the Join. The standard solution is to Hash partition (and Sort) the inputs to the Join and Aggregator stages as shown in Figure 6-16.

Figure 6-16 Standard Partitioning assignment

However, on closer inspection, the partitioning and sorting of this scenario can be optimized. Because the Join and Aggregator use the same partition keys and sort order, we can move the Hash partition and Sort before the Copy stage, and

apply Same partitioning to the downstream links, as shown in Figure 6-17.

Figure 6-17 Optimized Partitioning assignment

Partitioning example 2: Use of Entire partitioning

In this example, a Transformer is used to extract data from a single header row of an input file. In the Transformer, a new output column is defined on the header and detail links using a single constant value derivation. This column is used as the key for a subsequent Inner Join to attach the header values to every detail row. Using a standard solution, both inputs to the Join are Hash partitioned and sorted on this single join column (either explicitly, or through Auto partitioning). This is depicted in Figure 6-18.

Figure 6-18 Standard partitioning assignment for a Join stage

Although Hash partitioning guarantees correct results for stages that require groupings of related records, it is not always the most efficient solution, depending on the business requirements. Although functionally correct, the solution has one serious limitation. Remembering that the degree of parallel operation is limited by the number of distinct values, the single value join column assigns all rows to a single partition, resulting in sequential processing.

To optimize partitioning, consider that the single header row is really a form of
reference data. An optimized solution is to alter the partitioning for the input links to the Join stage, as depicted in Figure 6-19.

Use round-robin partitioning on the detail input to distribute rows across all partitions evenly.
Use Entire partitioning on the header input to copy the single header row to all partitions.

Figure 6-19 Optimized Partitioning assignment based on business requirements

Because we are joining on a single value, there is no need to pre-sort the input to the Join. We revisit this in the Sorting discussion.

To process a large number of detail records, the link order of the Inner Join is
significant. The Join stage operates by reading a single row from the Left input
and reading all rows from the Right input that match the key values. For this reason, the link order in this example must be set so that the single header row is assigned to the Right input, and the detail rows are assigned to the Left input, as shown in Figure 6-20.

Figure 6-20 Specifying link order in Join stage

If defined in reverse of this order, the Join attempts to read all detail rows from
the right input (because they have the same key column value) into memory.
For advanced users, there is one further detail in this example. Because the Join
waits until it receives an End of Group (new key value) or End of Data (no more
rows on the input dataset) from the Right input, the detail rows in the Left input
buffer to disk to prevent a deadlock. Changing the output derivation on the header row to a series of numbers instead of a constant value establishes the End of Group and prevent buffering to disk.

DataStage Questions and Answers

Tuesday, January 14, 2014

Partitioning Examples

No comments:

Post a Comment