DataStage Questions and Answers: Sorting

Traditionally, the process of sorting data uses one primary key column and, optionally, one or more secondary key columns to generate a sequential ordered result set. The order of key columns determines the sequence and groupings in the result set. Each column is specified with an ascending or descending sort order. This is the method the SQL databases use for an ORDER BY clause, as illustrated in the following example, sorting on primary key LName (ascending), secondary key FName (descending).

The input data is shown in Table 7-1.

Table 7-1 Input data

After Sorting by LName, FName, the results are as in Table 7-2.

Table 7-2 Sort results

However, in most cases there is no need to globally sort data to produce a single
sequence of rows. Instead, sorting is most often needed to establish order in
specified groups of data. This sort can be done in parallel.

For example, the Remove Duplicates stage selects either the first or last row
from each group of an input dataset sorted by one or more key columns. Other
stages (for example, Sort Aggregator, Change Capture, Change Apply, Join,

Merge) require pre-sorted groups of related records.

7.1 Partition and sort keys

For parallel sort in DataStage (DS) parallel jobs, the following qualifications
apply: Partitioning is used to gather related records, assigning rows with the same key column values to the same partition.

Sorting is used to establish group order in each partition, based on one or
more key columns.

Note: By definition, when data is re-partitioned, sort order is not maintained.
To restore row order and groupings, a sort is required after repartitioning.

In the following example, the input dataset from Table 7-1 on page 115 is
partitioned on the LName and FName columns. Given a 4-node configuration
file, you would see the results depicted in Table 7-3, Table 7-4, Table 7-5, and
Table 7-6 on page 118.

Applying a parallel sort to this partitioned input dataset, using the primary key
column LName (ascending) and secondary key column FName (descending),
would generate the resulting datasets depicted in Table 7-7, Table 7-8, Table 7-9, and Table 7-10.

The partition and sort keys do not have to match. For example, secondary sort
keys can be used to establish order in a group for selection with the Remove
Duplicates stage (which can specify first or last duplicate to retain). Say that an
input dataset consists of order history based on CustID and Order Date. Using
Remove Duplicates, you can select the most recent order for a given customer.

To satisfy those requirements you could perform the following steps:
1. Partition on CustID to group related records
2. Sort on OrderDate in Descending order
3. Remove Duplicates on CustID, with Duplicate To Retain=First

7.2 Complete (Total) sort

If a single, sequential ordered result is needed, in general it is best to use a two
step process:
1. Partition and parallel Sort on key columns.
2. Use a Sort Merge collector on these same key columns to generate a
sequential, ordered result set.
This is similar to the way parallel database engines perform their parallel sort
operations.

7.3 Link sort and Sort stage

DataStage provides two methods for parallel sorts:

Standalone Sort stage This is used when execution mode is set to Parallel.
Sort on a link This is used when using a keyed input partitioning method.

By default, both methods use the same internal sort package (the tsort operator).

The Link sort offers fewer options, but is easier to maintain in a DataStage job, as there are fewer stages on the design canvas. The Standalone sort offers more options, but as a separate stage makes job maintenance more complicated.

In general, use the Link sort unless a specific option is needed on the
stand-alone stage. Most often, the standalone Sort stage is used to specify the

Sort Key mode for partial sorts.

7.3.1 Link sort

Sorting on a link is specified on the Input/Partitioning stage options, when
specifying a keyed partitioning method. (Sorting on a link is not available with
Auto partitioning, although the DS parallel framework might insert a Sort if
required). When specifying key columns for partitioning, the Perform Sort option
is checked. In the Designer canvas, links that have sort defined have a Sort icon
in addition to the Partitioning icon, as shown in Figure 7-1.

Figure 7-1 Link Sort Icon

Additional properties can be specified by right-clicking the key column, as shown

in Figure 7-2.

Figure 7-2 Specifying Link Sort options

Key column options let the developer specify the following options:
Key column usage: sorting, partitioning, or both
Sort direction: Ascending or Descending
Case sensitivity (strings)
Sorting character set: ASCII (default) or EBCDIC (strings)
Position of nulls in the result set (for nullable columns)

7.3.2 Sort stage

The standalone Sort stage offers more options than the sort on a link, as

depicted in Figure 7-3.

Figure 7-3 Sort stage options

Specifically, the following properties are not available when sorting on a link:

Sort Key Mode (a particularly important performance optimization)
Create Cluster Key Change Column
Create Key Change Column
Output Statistics
Sort Utility (do not change this)
Restrict Memory Usage

Of the options only available in the standalone Sort stage, the Sort Key Mode is

most frequently used.

Note: The Sort Utility option is an artifact of previous releases. Always specify
the DataStage Sort Utility, which is significantly faster than a UNIX sort.

7.4 Stable sort

Stable sorts preserve the order of non-key columns in each sort group. This
requires additional overhead in the sort algorithm, and thus a stable sort is
generally slower than a non-stable sort for the same input dataset and sort keys.

For this reason, disable Stable sort unless needed.

By default, the Stable sort option is disabled for sorts on a link and enabled with
the standalone Sort stage.

7.5 Subsorts

In the standalone Sort stage, the key column property, Sort Key Mode, is a
particularly powerful feature and a significant performance optimizer. It is used
when resorting a sub-grouping of a previously sorted input dataset, instead of
performing a complete sort. This subsort uses significantly less disk space and
CPU resource, and can often be performed in memory (depending on the size of
the new subsort groups).

To resort based on a sub-group, all key columns must still be defined in the Sort
stage. Re-used sort keys are specified with the “Do not Sort (Previously Sorted)”
property. New sort keys are specified with the Sort Key Mode property, as shown

in Figure 7-4.

Figure 7-4 Sort Key Mode property

To perform a sub-sort, keys with the “Do not Sort (Previously Sorted)” property
must be at the top of the list, without gaps between them. The key column order for these keys must match the key columns and order defined in the
previously-sorted input dataset.

If the input data does not match the key column definition for a sub-sort, the job aborts.

7.6 Automatically-inserted sorts

By default, the parallel framework inserts sort operators as necessary to ensure
correct results. The parallel job score (see Appendix E, “Understanding the
parallel job score” on page 401) can be used to identify automatically-inserted
sorts, as shown in Figure 7-5.

Figure 7-5 Inserted Sort operator

Typically, the parallel framework inserts sorts before any stage that requires
matched key values or ordered groupings (Join, Merge, Remove Duplicates, Sort
Aggregator). Sorts are only inserted automatically when the flow developer has
not explicitly defined an input sort.

Though ensuring correct results, inserted sorts can be a significant performance
impact if they are not necessary. There are two ways to prevent the parallel
framework from inserting an un-necessary sort:

Insert an upstream Sort stage on each link, define all sort key columns with the “Do not Sort (Previously Sorted)” Sort Mode key property.
Set the environment variable APT_SORT_INSERTION_CHECK_ONLY. This verifies sort order but does not perform a sort, aborting the job if data is not in the required sort order.

Revisiting the partitioning examples in 6.4, “Partitioning examples” on page 108,
the environment variable $APT_SORT_INSERTION_CHECK_ONLY must be set
to prevent the DS parallel framework from inserting unnecessary sorts before the Join stage.

7.7 Sort methodology

Using the rules and behavior outlined in the previous section, the following
methodology must be applied when sorting in a parallel data flow:

1. Start with a link sort.
2. Specify only necessary key columns.
3. Do not use Stable Sort unless needed.
4. Use a stand-alone Sort stage instead of a Link sort for options that are not
available on a Link sort:

– Sort Key Mode, Create Cluster Key Change Column, Create Key Change
Column, Output Statistics
– Always specify the DataStage Sort Utility for standalone Sort stages
– Use the “Do not Sort (Previously Sorted)” Sort Key Mode to resort a
sub-group of a previously-sorted input dataset

5. Be aware of automatically-inserted sorts.
Set $APT_SORT_INSERTION_CHECK_ONLY to verify but do not establish
required sort order.

6. Minimize the use of sorts in a job flow.

7. To generate a single, sequential ordered result set use a parallel Sort and a
Sort Merge collector

7.8 Tuning sort

Sort is a particularly expensive task which requires CPU, memory, and disk
resources.

To perform a sort, rows in the input dataset are read into a memory buffer on
each partition. If the sort operation can be performed in memory (as is often the
case with a sub-sort) no disk I/O is performed.

By default, each sort uses 20 MB of memory per partition for its memory buffer.
This value can be changed for each standalone Sort stage using the Restrict
Memory Usage option (the minimum is 1 MB/partition). On a global basis, the
APT_TSORT_STRESS_BLOCKSIZE environment variable can be use to
specify the size of the memory buffer, in MB, for all sort operators (link and

standalone), overriding any per-sort specifications.

If the input dataset cannot fit into the sort memory buffer, results are temporarily spooled to disk in the following order:

1. Scratch disks defined in the current configuration file (APT_CONFIG_FILE) in
the sort named disk pool
2. Scratch disks defined in the current configuration file default disk pool
3. The default directory specified by the environment variable TMPDIR
4. The directory /tmp (on UNIX) or C:/TMP (on Windows) if available
The file system configuration and number of scratch disks defined in parallel
configuration file can impact the I/O performance of a parallel sort. Having a
greater number of scratch disks for each node allows the sort to spread I/O
across multiple file systems.

7.8.1 Sorts and variable-length fields

Although it is usually recommended to define the maximum length of variable
length fields, there are situations when it is better to leave their lengths unbound.

The parallel framework always allocates the space equivalent to their maximum
specified lengths. If most values are much shorter than their maximum lengths,
there will be a large amount of unused space being moved around between
operators as well as to/from datasets and fixed format files. That happens, for
instance, when an address field is defined as "varchar(500)" but most addresses
are 30 characters long.

This severely impacts the performance of sort operations: the more unused
bytes a record holds, the more unnecessary data is moved to and from scratch
space.

In those situations, it is better to leave the length of those fields unspecified, as
the records will only allocate the exact space to hold the field values.
This rule must be applied judiciously, but it may result in great performance

gains.

DataStage Questions and Answers

Tuesday, January 14, 2014

Sorting

No comments:

Post a Comment