Table Dressing
Explore three methods to modify large table structures.
by Gregory J. Sannik, with contributions from Joe Rarey
Even with the most flexible and extensible physical designs, it is nearly impossible to avoid making changes to physical table definitions. Finding the most efficient means of modifying structural definitions of large tables is crucial to limiting the impact on system resources and to the time users are without access to their data.
Three methods are used to change a table's physical layout:
- ALTER TABLE modifies the original table with the new definition
- INSERT/SELECT is used to copy the data from the original table into a new one where the definition is updated
- ’N+1’ is a new, slender table—created by performing INSERT/SELECT—that has the original table's primary key and primary index, plus the new column(s) definition
Factors to consider
Selecting the appropriate strategy for changing table definitions is case-specific and depends on several factors. Data availability is most important, because any system downtime must be acceptable to the business community. The system resource requirements (disk and CPU) and the time required to complete the process must also be examined. Finally, if a column has to be dropped, the ability to revert to the original version of the table is important.
Other considerations include the data size and access frequency of the data columns, the role of the table to be changed and the amount of spare disk capacity on the platform. The data types and nullability of the new columns, and whether the data population strategy will be retroactive or progressive, are significant. The NULL presence bits in the table header indicate the nullability status of the column. The ALTER TABLE process can be very quick if a column is added that is defined as NULL, has available presence bits and is not back-filled with data. In this case, ALTER TABLE is strictly limited to a table header change. In addition, data blocks are not rewritten.
How row widths are managed for the largest tables is also important. To take less space and time when altering and copying table definitions, multiple vertical table partitions should be created rather than a single, monolithically wide table. For instance, when redefining a table that is 100TB in size, 100TB of available space is required—at least temporarily—to execute ALTER TABLE or perform the INSERT/SELECT procedure. This is particularly true when the large table's structural and content volatility is a concern.
Method comparisons
While the end result of table restructuring is the same, the approach is different for each method. The table and the following descriptions list some attributes:
1. ALTER TABLE
In this approach, the structural definitions are modified in an existing table. Under certain circumstances, however, this practice can cause every row in the table to be rewritten, making it a potentially significant database operation. A rewrite can take hours on a large table, even with the highly optimized, block-oriented method used by the Teradata Database. In addition, when an ALTER TABLE request is aborted, a rollback will not occur and the operation will complete.
The time required to perform ALTER TABLE can be affected by:
- Number of indexes
- Degree of index uniqueness
- Number and width of rows
- Table size
- Degree of primary index data skew
- Column type and nullability
- Execution process, such as table rewrite or table header modification
2. INSERT/SELECT
Use a CREATE TABLE statement to create a new version of the original table with a ‘_NEW' suffix and revised layout. Execute an INSERT/SELECT to copy the data from the original structure to the new one. Rename the original table with the suffix ‘_OLD' and drop the ‘_NEW' suffix from the table newly created. Using this method, the original table can still be accessed as the new table is built, thus increasing data availability.
3. ’N+1’
In this less-frequently used technique, a very narrow ‘N+1’ version of the original table is created and populated that contains its primary key and primary index, plus the new column(s). A view is then defined that joins the ‘N+1’ version with the original table. This creates the illusion of a single data structure with one or more additional columns. Soft referential integrity (RI) can be used to join eliminate the ‘N+1’ table when the new column(s) are not referenced. Subsequent column additions are handled in the new table as INSERT/SELECTs or ALTER TABLEs. Creating an ‘N+2’,‘N+3’... ‘N+Ω’ version of the table to support future additions is recommended only when the number and size of the columns in the prior version match those in the original table.
A downside to this approach is that it results in additional table(s), which increases extract, transform and load (ETL) complexity. A benefit is that creating the ‘N+1’ table consumes fewer resources because it acts as a "slender" vertical partition of the original table that can be restructured independent of (but used in conjunction with) the existing table or its vertical partitions. This significantly decreases the time data is unavailable.
Also, if the new columns are populated strictly on a progression basis and include a compression specification, the bytes will be smaller. Smaller bytes will reduce the table size, enabling INSERT/SELECT to complete more quickly. Finally, if a problem occurs, modifications can be backed out by changing the view that joins the old and new vertically partitioned tables.
“Finding the most efficient means of modifying structural definitions of large tables is crucial to limiting the impact on system resources and to the time users are without access to their data.”
What's the best approach?
The ALTER TABLE approach can be more efficient and will complete faster than the INSERT/SELECT method but only when conditions are right. Of these two common methods, INSERT/SELECT is the better approach when a problem occurs during the deployment of the data structure modification. INSERT/SELECT makes it easy to back out the data structure (such as renaming an old table or new table, or dropping a new table) while the issue is being fixed, thus minimizing access interruption.
To ensure synchronization as the data structure is changed, organizations typically process a few ETL cycles before dropping off the old table. This can decrease how often a large table definition is restructured to an average of only once or twice a year, if at all.
Deciding to partition should not be based solely on the ease of modifying the underlying data definition in the table. The data access frequency identified in the database query SQL analysis and object log table should also be considered. In addition, findings from rigorous benchmarking in the development and testing phases are important, followed by the execution of a sound plan for data structure and content.
Ultimately, the key to a successful and least-disruptive modification to a large data table is knowing the benefits and drawbacks of each approach.
Gregory J. Sannik, principal consultant with Teradata Professional Services, has worked in IT since 1978 and is a data warehousing expert in the financial services, insurance and healthcare industries.
Joe Rarey started his IT career in 1966 and joined Teradata in 1987, working in the professional services and engineering fields.