Tech2Tech
Hands On
Data Connections
Five strategies for writing a proper join.
by Steve Wilmes and Eric Rivard
The best way to access data on a single table on a Teradata Database is to use the primary index, followed by secondary indexes. However, once you begin joining tables together, queries become more complicated, and proper joining is critical. Poorly written joins can be costly in terms of response time and performance.
When a query is submitted, the Teradata optimizer will choose the best join strategy based on the query join type and constraint (WHERE or ON clause). Because using an index is preferred whenever possible, different strategies can be used to access data, and the primary and secondary indexes play a significant role in the join strategies. Although the user can help influence the strategy, the optimizer is the definitive authority. These join strategies are discussed below.
1. Merge
There are four merge join strategies:
Co-locate
The co-located strategy utilizes the primary index on both tables to complete the join. It can be used only if the join columns and primary index columns (or partitioned primary indexes) are the same on both tables within the join.
In this case, no spool space is required, because the data resides on the same AMPs. No redistribution is necessary, as the join is AMP-local. The optimizer will perform a merge join using a row hash match scan. This is the fastest type of join because no data must be moved.
Redistribute One of the Tables
This merge strategy can be used if one of the join columns is a primary index and the other is a non-indexed column.
With this approach, data must be redistributed, and the optimizer has a couple of redistribution options to consider. First, it can duplicate the entire smaller table on all AMPs. Second, it can leave the table that is using the primary index column alone and move the rows from the non-indexed column’s table in spool space.
Redistribute Both of the Tables
The optimizer will utilize this option when the join is performed on neither table’s primary index column. The data from both tables is redistributed into spool and sorted by hash code.
Because data from both tables must be hashed and redistributed, this process is both time-consuming and inefficient.
In order to redistribute the data, the values of the non-indexed columns are hashed. Once hashed, the rows are moved to spool onto the appropriate AMPs, and then the rows are physically joined.
Create a Copy of the Smaller Table in Spool
The fourth merge strategy is implemented when one of the tables being joined is small enough to justify using the strategy. If so, the optimizer will take the smallest table in the join and duplicate it across all of the AMPs. This strategy has the potential to be very efficient.
When the optimizer determines that the table is small enough, it will create a copy of the table within the spool space on each AMP. After the table has been copied to all AMPs, the join can easily be completed, because the data for both tables coexists on each AMP.
Collecting statistics on the columns of tables being joined will help influence this join strategy.
2. Hash
A hash join can be performed only if one or both of the join tables can fit into the AMP’s memory, rather than spool. Disk operations are always more time-intensive than memory operations because they require many physical read/write operations.
The AMP will always try to fit the smallest table into memory. Once the table has been selected for memory insertion, the join column value of the smaller table will be hashed and sorted. The column from the larger table will then be used to perform the match against the rows of the smaller table, which are stored in the AMP’s memory. The hash join is completed quickly because the larger table does not need to be sorted or redistributed into spool.
3. Nested
A nested join uses either a unique primary index or a unique secondary index from one of the join columns in order to retrieve a single row. Using this type of join requires a WHERE clause, with the indexed column’s value set equal to a particular value, in order to limit the data to a single row.
The nested join approach is extremely efficient because the optimizer can easily find the WHERE clause information as it is indexed and also because it has to redistribute only a single row across all AMPs. Once the row has been redistributed into all AMPs’ spool space, the joins can complete and the data can be returned.
4. Exclusion
This type of join strategy is either a merge or product join strategy. An exclusion join is used when you are trying to discover what rows do not match with a row in another table.
Unlike the first three joins, the purpose of an exclusion join is to prevent certain data from returning. It can be used only if the following operators are used in the query: NOT IN, EXCEPT or MINUS. The exclusion join will always result in a full-table scan.
5. Cartesian and Product
These joins compare each row in one table with each row in another table because of the lack of a join equality condition. Almost always, this type of join is the result of an erroneously written query.
To avoid these types of joins, make sure your query is using an equality condition, such as E.DEPT = D.DEPT. If a query joins N tables, there must be N-1 join conditions.
Users Have Influence
The optimizer will analyze the users’ SQL statement and determine the best join plan to successfully complete the query. However, users do have some influence on the join strategy. Simple techniques such as employing primary and secondary indexes efficiently and collecting stats on columns used in the WHERE and ON clauses will provide the optimizer key pieces of information to ensure that the optimal plan is in place to complete the SQL request. Finally, use the EXPLAIN facility for your query plans.
Steve Wilmes, founder and CEO of Cerulium Corp., has more than 20 years’ experience in the computer industry and is a Teradata Certified Master.
Eric Rivard, vice president of Cerulium Corp., has more than 15 years’ experience in the computer industry and is a Teradata Certified Master.