In this section, we apply the partitioning methodology defined here to several example job flows.
Partitioning example 1: Optimized partitioning
The Aggregator stage only outputs key column and aggregate result columns. To add aggregate columns to every detail row, a Copy stage is used to send the detail rows to an Inner Join and an Aggregator. The output of the Aggregator is sent to the second input of the Join. The standard solution is to Hash partition (and Sort) the inputs to the Join and Aggregator stages as shown in Figure 6-16.
Figure 6-16 Standard Partitioning assignment
However, on closer inspection, the partitioning and sorting of this scenario can be optimized. Because the Join and Aggregator use the same partition keys and sort order, we can move the Hash partition and Sort before the Copy stage, and
apply Same partitioning to the downstream links, as shown in Figure 6-17.
Figure 6-17 Optimized Partitioning assignment
In this example, a Transformer is used to extract data from a single header row of an input file. In the Transformer, a new output column is defined on the header and detail links using a single constant value derivation. This column is used as the key for a subsequent Inner Join to attach the header values to every detail row. Using a standard solution, both inputs to the Join are Hash partitioned and sorted on this single join column (either explicitly, or through Auto partitioning). This is depicted in Figure 6-18.
Figure 6-18 Standard partitioning assignment for a Join stage
To optimize partitioning, consider that the single header row is really a form of
reference data. An optimized solution is to alter the partitioning for the input links to the Join stage, as depicted in Figure 6-19.
- Use round-robin partitioning on the detail input to distribute rows across all partitions evenly.
- Use Entire partitioning on the header input to copy the single header row to all partitions.
Figure 6-19 Optimized Partitioning assignment based on business requirements
To process a large number of detail records, the link order of the Inner Join is
significant. The Join stage operates by reading a single row from the Left input
and reading all rows from the Right input that match the key values. For this reason, the link order in this example must be set so that the single header row is assigned to the Right input, and the detail rows are assigned to the Left input, as shown in Figure 6-20.
Figure 6-20 Specifying link order in Join stage
the right input (because they have the same key column value) into memory.
For advanced users, there is one further detail in this example. Because the Join
waits until it receives an End of Group (new key value) or End of Data (no more
rows on the input dataset) from the Right input, the detail rows in the Left input
buffer to disk to prevent a deadlock. Changing the output derivation on the header row to a series of numbers instead of a constant value establishes the End of Group and prevent buffering to disk.
No comments:
Post a Comment