Loading...
Five strategies for writing a proper join.

Tech2Tech

Hands On

Data Connections

Five strategies for writing a proper join.

The best way to access data on a single table on a Teradata Database is to use the primary index, followed by secondary indexes. However, once you begin joining tables together, queries become more complicated, and proper joining is critical. Poorly written joins can be costly in terms of response time and performance.

When a query is submitted, the Teradata optimizer will choose the best join strategy based on the query join type and constraint (WHERE or ON clause). Because using an index is preferred whenever possible, different strategies can be used to access data, and the primary and secondary indexes play a significant role in the join strategies. Although the user can help influence the strategy, the optimizer is the definitive authority. These join strategies are discussed below.

1. Merge

There are four merge join strategies:


Co-locate

The co-located strategy utilizes the primary index on both tables to complete the join. It can be used only if the join columns and primary index columns (or partitioned primary indexes) are the same on both tables within the join.

In this case, no spool space is required, because the data resides on the same AMPs. No redistribution is necessary, as the join is AMP-local. The optimizer will perform a merge join using a row hash match scan. This is the fastest type of join because no data must be moved.


Redistribute One of the Tables

This merge strategy can be used if one of the join columns is a primary index and the other is a non-indexed column.

With this approach, data must be redistributed, and the optimizer has a couple of redistribution options to consider. First, it can duplicate the entire smaller table on all AMPs. Second, it can leave the table that is using the primary index column alone and move the rows from the non-indexed column’s table in spool space.


Redistribute Both of the Tables

The optimizer will utilize this option when the join is performed on neither table’s primary index column. The data from both tables is redistributed into spool and sorted by hash code.

Because data from both tables must be hashed and redistributed, this process is both time-consuming and inefficient.

In order to redistribute the data, the values of the non-indexed columns are hashed. Once hashed, the rows are moved to spool onto the appropriate AMPs, and then the rows are physically joined.


Create a Copy of the Smaller Table in Spool

The fourth merge strategy is implemented when one of the tables being joined is small enough to justify using the strategy. If so, the optimizer will take the smallest table in the join and duplicate it across all of the AMPs. This strategy has the potential to be very efficient.

When the optimizer determines that the table is small enough, it will create a copy of the table within the spool space on each AMP. After the table has been copied to all AMPs, the join can easily be completed, because the data for both tables coexists on each AMP.

Collecting statistics on the columns of tables being joined will help influence this join strategy.

2. Hash

A hash join can be performed only if one or both of the join tables can fit into the AMP’s memory, rather than spool. Disk operations are always more time-intensive than memory operations because they require many physical read/write operations.

The AMP will always try to fit the smallest table into memory. Once the table has been selected for memory insertion, the join column value of the smaller table will be hashed and sorted. The column from the larger table will then be used to perform the match against the rows of the smaller table, which are stored in the AMP’s memory. The hash join is completed quickly because the larger table does not need to be sorted or redistributed into spool.

Join Types

As of Teradata 13, you can join up to 128 tables in a single SQL statement. The optimizer will analyze the user’s query and review the join types within the query. Here are the different types of joins:

  • Inner: Returns the matched rows from both tables. No unmatched rows are returned.
  • Left outer: Returns the matched rows from both tables. It also returns any unmatched rows from the table on the left side of the join.
  • Right outer: Returns the matched rows from both tables. It also returns any unmatched rows from the table on the right side of the join.
  • Full outer: Returns both the matched and unmatched rows from both tables.
  • Self: Compares rows inside the table against other rows inside the same table.
  • Cartesian: Compares each row in the first table to each row in the second table, resulting in the return of all rows in the first table multiplied by the number of rows in the second table.

—S.W. and E.R

3. Nested

A nested join uses either a unique primary index or a unique secondary index from one of the join columns in order to retrieve a single row. Using this type of join requires a WHERE clause, with the indexed column’s value set equal to a particular value, in order to limit the data to a single row.

The nested join approach is extremely efficient because the optimizer can easily find the WHERE clause information as it is indexed and also because it has to redistribute only a single row across all AMPs. Once the row has been redistributed into all AMPs’ spool space, the joins can complete and the data can be returned.

4. Exclusion

This type of join strategy is either a merge or product join strategy. An exclusion join is used when you are trying to discover what rows do not match with a row in another table.

Unlike the first three joins, the purpose of an exclusion join is to prevent certain data from returning. It can be used only if the following operators are used in the query: NOT IN, EXCEPT or MINUS. The exclusion join will always result in a full-table scan.

5. Cartesian and Product

These joins compare each row in one table with each row in another table because of the lack of a join equality condition. Almost always, this type of join is the result of an erroneously written query.

To avoid these types of joins, make sure your query is using an equality condition, such as E.DEPT = D.DEPT. If a query joins N tables, there must be N-1 join conditions.


Users Have Influence

The optimizer will analyze the users’ SQL statement and determine the best join plan to successfully complete the query. However, users do have some influence on the join strategy. Simple techniques such as employing primary and secondary indexes efficiently and collecting stats on columns used in the WHERE and ON clauses will provide the optimizer key pieces of information to ensure that the optimal plan is in place to complete the SQL request. Finally, use the EXPLAIN facility for your query plans.


Your Comment:
  
Your Rating:

Comments
 
Thanks for Excellent stuff..

6/7/2011 9:20:01 AM
— Anonymous
 
OK, Cartesian product is bad. But there cases where it is intentional and valid. How is it done by Teradata? Smaller table duplicated on all AMPS, correct?

12/15/2010 12:05:47 PM
— Anonymous
 
Well covered theory part, would be great if added some examples

9/7/2010 2:34:51 AM
— Anonymous
 
Excellent post! Lack of examples ...

9/4/2010 1:27:08 AM
— Anonymous