Tuesday, January 14, 2014

Sorting

Traditionally, the process of sorting data uses one primary key column and, optionally, one or more secondary key columns to generate a sequential ordered result set. The order of key columns determines the sequence and groupings in the result set. Each column is specified with an ascending or descending sort order. This is the method the SQL databases use for an ORDER BY clause, as illustrated in the following example, sorting on primary key LName (ascending), secondary key FName (descending).

The input data is shown in Table 7-1.

Table 7-1 Input data


 After Sorting by LName, FName, the results are as in Table 7-2.

 Table 7-2 Sort results

However, in most cases there is no need to globally sort data to produce a single
sequence of rows. Instead, sorting is most often needed to establish order in
specified groups of data. This sort can be done in parallel.

For example, the Remove Duplicates stage selects either the first or last row
from each group of an input dataset sorted by one or more key columns. Other
stages (for example, Sort Aggregator, Change Capture, Change Apply, Join,
Merge) require pre-sorted groups of related records.

7.1 Partition and sort keys

For parallel sort in DataStage (DS) parallel jobs, the following qualifications
apply: Partitioning is used to gather related records, assigning rows with the same key column values to the same partition.

Sorting is used to establish group order in each partition, based on one or
more key columns.

Note: By definition, when data is re-partitioned, sort order is not maintained.
To restore row order and groupings, a sort is required after repartitioning.

In the following example, the input dataset from Table 7-1 on page 115 is
partitioned on the LName and FName columns. Given a 4-node configuration
file, you would see the results depicted in Table 7-3, Table 7-4, Table 7-5, and
Table 7-6 on page 118.





Applying a parallel sort to this partitioned input dataset, using the primary key
column LName (ascending) and secondary key column FName (descending),
would generate the resulting datasets depicted in Table 7-7, Table 7-8, Table 7-9, and Table 7-10.


The partition and sort keys do not have to match. For example, secondary sort
keys can be used to establish order in a group for selection with the Remove
Duplicates stage (which can specify first or last duplicate to retain). Say that an
input dataset consists of order history based on CustID and Order Date. Using
Remove Duplicates, you can select the most recent order for a given customer.

To satisfy those requirements you could perform the following steps:
1. Partition on CustID to group related records
2. Sort on OrderDate in Descending order
3. Remove Duplicates on CustID, with Duplicate To Retain=First

7.2 Complete (Total) sort

If a single, sequential ordered result is needed, in general it is best to use a two
step process:
1. Partition and parallel Sort on key columns.
2. Use a Sort Merge collector on these same key columns to generate a
sequential, ordered result set.
This is similar to the way parallel database engines perform their parallel sort
operations.

7.3 Link sort and Sort stage

DataStage provides two methods for parallel sorts: 

  • Standalone Sort stage This is used when execution mode is set to Parallel.
  • Sort on a link   This is used when using a keyed input partitioning method.

By default, both methods use the same internal sort package (the tsort operator).

The Link sort offers fewer options, but is easier to maintain in a DataStage job, as there are fewer stages on the design canvas. The Standalone sort offers more options, but as a separate stage makes job maintenance more complicated.

In general, use the Link sort unless a specific option is needed on the
stand-alone stage. Most often, the standalone Sort stage is used to specify the

Sort Key mode for partial sorts.

7.3.1 Link sort

Sorting on a link is specified on the Input/Partitioning stage options, when
specifying a keyed partitioning method. (Sorting on a link is not available with
Auto partitioning, although the DS parallel framework might insert a Sort if
required). When specifying key columns for partitioning, the Perform Sort option
is checked. In the Designer canvas, links that have sort defined have a Sort icon
in addition to the Partitioning icon, as shown in Figure 7-1.

Figure 7-1 Link Sort Icon

Additional properties can be specified by right-clicking the key column, as shown

in Figure 7-2.


Figure 7-2 Specifying Link Sort options

  • Key column options let the developer specify the following options:
  • Key column usage: sorting, partitioning, or both
  • Sort direction: Ascending or Descending
  • Case sensitivity (strings)
  • Sorting character set: ASCII (default) or EBCDIC (strings)
  • Position of nulls in the result set (for nullable columns)

7.3.2 Sort stage

The standalone Sort stage offers more options than the sort on a link, as

depicted in Figure 7-3.


Figure 7-3 Sort stage options

Specifically, the following properties are not available when sorting on a link:

  • Sort Key Mode (a particularly important performance optimization)
  • Create Cluster Key Change Column
  • Create Key Change Column
  • Output Statistics
  • Sort Utility (do not change this)
  • Restrict Memory Usage

Of the options only available in the standalone Sort stage, the Sort Key Mode is
most frequently used.

 Note: The Sort Utility option is an artifact of previous releases. Always specify
the DataStage Sort Utility, which is significantly faster than a UNIX sort.

7.4 Stable sort

Stable sorts preserve the order of non-key columns in each sort group. This
requires additional overhead in the sort algorithm, and thus a stable sort is
generally slower than a non-stable sort for the same input dataset and sort keys.

For this reason, disable Stable sort unless needed.

By default, the Stable sort option is disabled for sorts on a link and enabled with
the standalone Sort stage.

7.5 Subsorts

In the standalone Sort stage, the key column property, Sort Key Mode, is a
particularly powerful feature and a significant performance optimizer. It is used
when resorting a sub-grouping of a previously sorted input dataset, instead of
performing a complete sort. This subsort uses significantly less disk space and
CPU resource, and can often be performed in memory (depending on the size of
the new subsort groups).

To resort based on a sub-group, all key columns must still be defined in the Sort
stage. Re-used sort keys are specified with the “Do not Sort (Previously Sorted)”
property. New sort keys are specified with the Sort Key Mode property, as shown

in Figure 7-4.


Figure 7-4 Sort Key Mode property

To perform a sub-sort, keys with the “Do not Sort (Previously Sorted)” property
must be at the top of the list, without gaps between them. The key column order for these keys must match the key columns and order defined in the
previously-sorted input dataset.

If the input data does not match the key column definition for a sub-sort, the job aborts.

7.6 Automatically-inserted sorts

By default, the parallel framework inserts sort operators as necessary to ensure
correct results. The parallel job score (see Appendix E, “Understanding the
parallel job score” on page 401) can be used to identify automatically-inserted
sorts, as shown in Figure 7-5.

Figure 7-5 Inserted Sort operator

Typically, the parallel framework inserts sorts before any stage that requires
matched key values or ordered groupings (Join, Merge, Remove Duplicates, Sort
Aggregator). Sorts are only inserted automatically when the flow developer has
not explicitly defined an input sort.

Though ensuring correct results, inserted sorts can be a significant performance
impact if they are not necessary. There are two ways to prevent the parallel
framework from inserting an un-necessary sort:


  • Insert an upstream Sort stage on each link, define all sort key columns with the “Do not Sort (Previously Sorted)” Sort Mode key property.
  • Set the environment variable APT_SORT_INSERTION_CHECK_ONLY. This verifies sort order but does not perform a sort, aborting the job if data is not in the required sort order.

Revisiting the partitioning examples in 6.4, “Partitioning examples” on page 108,
the environment variable $APT_SORT_INSERTION_CHECK_ONLY must be set
to prevent the DS parallel framework from inserting unnecessary sorts before the Join stage.

7.7 Sort methodology

Using the rules and behavior outlined in the previous section, the following
methodology must be applied when sorting in a parallel data flow:

1. Start with a link sort.
2. Specify only necessary key columns.
3. Do not use Stable Sort unless needed.
4. Use a stand-alone Sort stage instead of a Link sort for options that are not
available on a Link sort:

– Sort Key Mode, Create Cluster Key Change Column, Create Key Change
Column, Output Statistics
– Always specify the DataStage Sort Utility for standalone Sort stages
– Use the “Do not Sort (Previously Sorted)” Sort Key Mode to resort a
sub-group of a previously-sorted input dataset

5. Be aware of automatically-inserted sorts.
Set $APT_SORT_INSERTION_CHECK_ONLY to verify but do not establish
required sort order.

6. Minimize the use of sorts in a job flow.

7. To generate a single, sequential ordered result set use a parallel Sort and a
Sort Merge collector

7.8 Tuning sort

Sort is a particularly expensive task which requires CPU, memory, and disk
resources.

To perform a sort, rows in the input dataset are read into a memory buffer on
each partition. If the sort operation can be performed in memory (as is often the
case with a sub-sort) no disk I/O is performed.

By default, each sort uses 20 MB of memory per partition for its memory buffer.
This value can be changed for each standalone Sort stage using the Restrict
Memory Usage option (the minimum is 1 MB/partition). On a global basis, the
APT_TSORT_STRESS_BLOCKSIZE environment variable can be use to
specify the size of the memory buffer, in MB, for all sort operators (link and

standalone), overriding any per-sort specifications.

If the input dataset cannot fit into the sort memory buffer, results are temporarily spooled to disk in the following order:

1. Scratch disks defined in the current configuration file (APT_CONFIG_FILE) in
the sort named disk pool
2. Scratch disks defined in the current configuration file default disk pool
3. The default directory specified by the environment variable TMPDIR
4. The directory /tmp  (on UNIX) or C:/TMP  (on Windows) if available
The file system configuration and number of scratch disks defined in parallel
configuration file can impact the I/O performance of a parallel sort. Having a
greater number of scratch disks for each node allows the sort to spread I/O
across multiple file systems.

7.8.1 Sorts and variable-length fields

Although it is usually recommended to define the maximum length of variable
length fields, there are situations when it is better to leave their lengths unbound.

The parallel framework always allocates the space equivalent to their maximum
specified lengths. If most values are much shorter than their maximum lengths,
there will be a large amount of unused space being moved around between
operators as well as to/from datasets and fixed format files. That happens, for
instance, when an address field is defined as "varchar(500)" but most addresses
are 30 characters long.

This severely impacts the performance of sort operations: the more unused
bytes a record holds, the more unnecessary data is moved to and from scratch
space.

In those situations, it is better to leave the length of those fields unspecified, as
the records will only allocate the exact space to hold the field values.
This rule must be applied judiciously, but it may result in great performance

gains.


Partitioning Examples

In this section, we apply the partitioning methodology defined here to several example job flows.

Partitioning example 1: Optimized partitioning

The Aggregator stage only outputs key column and aggregate result columns. To add aggregate columns to every detail row, a Copy stage is used to send the detail rows to an Inner Join and an Aggregator. The output of the Aggregator is sent to the second input of the Join. The standard solution is to Hash partition (and Sort) the inputs to the Join and Aggregator stages as shown in Figure 6-16.

Figure 6-16 Standard Partitioning assignment

However, on closer inspection, the partitioning and sorting of this scenario can be optimized. Because the Join and Aggregator use the same partition keys and sort order, we can move the Hash partition and Sort before the Copy stage, and
apply Same partitioning to the downstream links, as shown in Figure 6-17.

Figure 6-17 Optimized Partitioning assignment

Partitioning example 2: Use of Entire partitioning

In this example, a Transformer is used to extract data from a single header row of an input file. In the Transformer, a new output column is defined on the header and detail links using a single constant value derivation. This column is used as the key for a subsequent Inner Join to attach the header values to every detail row. Using a standard solution, both inputs to the Join are Hash partitioned and sorted on this single join column (either explicitly, or through Auto partitioning). This is depicted in Figure 6-18.



Figure 6-18 Standard partitioning assignment for a Join stage

 Although Hash partitioning guarantees correct results for stages that require groupings of related records, it is not always the most efficient solution, depending on the business requirements. Although functionally correct, the solution has one serious limitation. Remembering that the degree of parallel operation is limited by the number of distinct values, the single value join column assigns all rows to a single partition, resulting in sequential processing. 

To optimize partitioning, consider that the single header row is really a form of
reference data. An optimized solution is to alter the partitioning for the input links to the Join stage, as depicted in Figure 6-19.

  • Use round-robin partitioning on the detail input to distribute rows across all partitions evenly.
  • Use Entire partitioning on the header input to copy the single header row to all partitions.

Figure 6-19 Optimized Partitioning assignment based on business requirements

Because we are joining on a single value, there is no need to pre-sort the input to the Join. We revisit this in the Sorting discussion.

To process a large number of detail records, the link order of the Inner Join is
significant. The Join stage operates by reading a single row from the Left input
and reading all rows from the Right input that match the key values. For this reason, the link order in this example must be set so that the single header row is assigned to the Right input, and the detail rows are assigned to the Left input, as shown in Figure 6-20.


Figure 6-20 Specifying link order in Join stage

If defined in reverse of this order, the Join attempts to read all detail rows from
the right input (because they have the same key column value) into memory.
For advanced users, there is one further detail in this example. Because the Join
waits until it receives an End of Group (new key value) or End of Data (no more
rows on the input dataset) from the Right input, the detail rows in the Left input
buffer to disk to prevent a deadlock. Changing the output derivation on the header row to a series of numbers instead of a constant value establishes the End of Group and prevent buffering to disk.

Partition methodology in DataStage

Given the numerous options for keyless and keyed partitioning, the following
objectives help to form a methodology for assigning partitioning:
Choose a partitioning method that gives close to an equal number of rows in
each partition, and which minimizes overhead.
This ensures that the processing workload is evenly balanced, minimizing
overall run time.The partition method must match the business requirements and stage functional requirements, assigning related records to the same partition if required.

Any stage that processes groups of related records (generally using one or
more key columns) must be partitioned using a keyed partition method.
This includes, but is not limited to the following stages:

– Aggregator
– Change Capture
– Change Apply
– Join, Merge
– Remove Duplicates
– Sort

It might also be necessary for Transformers and BuildOps that process
groups of related records.

In satisfying the requirements of this second objective, it might not be possible
to choose a partitioning method that gives close to an equal number of rows
in each partition.

Unless partition distribution is highly skewed, minimize repartitioning,
especially in cluster or grid configurations
Repartitioning data in a cluster or grid configuration incurs the overhead of
network transport.

Partition method must not be overly complex
The simplest method that meets these objectives generally is the most efficient and yields the best performance.

Using these objectives as a guide, the following methodology can be applied:
1. Start with Auto partitioning (the default)
2. Specify Hash partitioning for stages that require groups of related records
a. Specify only the key columns that are necessary for correct grouping as
long as the number of unique values is sufficient
b. Use Modulus partitioning if the grouping is on a single integer key column
c. Use Range partitioning if the data is highly skewed and the key column
values and distribution do not change significantly over time (Range Map
can be reused)
3. If grouping is not required, use round-robin partitioning to redistribute data
equally across all partitions
This is especially useful if the input dataset is highly skewed or sequential
4. Use Same partitioning to optimize end-to-end partitioning and to minimize
repartitioning
Be mindful that Same partitioning retains the degree of parallelism of the
upstream stage
In a flow, examine up-stream partitioning and sort order and attempt to
preserve for down-stream processing. This might require re-examining key
column usage in stages and re-ordering stages in a flow (if business requirements permit).

Across jobs, persistent datasets can be used to retain the partitioning and sort
order. This is particularly useful if downstream jobs are run with the same
degree of parallelism (configuration file) and require the same partition and sort order.

Fatal Error: Fatal: Right truncation of string data : RDBMS Code 3996

We faced this issue while testing our jobs in Datastage 8.5 during TeraData 13 to 14 upgrade.

Explanation: The user tried to assign a longer to a shorter string destination. This error can also occur if the length of one or more range boundaries in a RANGE_N function is a string literal with a length greater than that of the test value.

But the above explanation didnt help us since the sql we are having issue ran successfully in Teradata 13 and more over that sql ran fine on TeraData 14 too but only when we ran it manually using Teradata SQL Assistant.

Problem at our end :

we are storing year as 2014.00 as decimal datatype in a table and then converting it into 2014 as char. we used (cast (Year as char (4)) Year) to convert decimal to char directly. This used to give 2014 as result in Teradata 13 but on Teradata 14 it failed with the above error. But if you change the above as char(5) the sql runs successfully but the result you get is 2014.(dot at the end which we dont want).

so we changed the sql in such a way first we convert decimal to integer and then integer to char as below.
(cast  (cast(Year as INTEGER) as CHAR(4) ) Year). This sql ran successfully and gave us the result we expected.