Tech2Tech
Applied Solutions 5
Scalable approach
to real-time data access
Teradata Parallel Transporter provides timely data movement to enable active data warehousing.
by Alex Yung
A recent trend emerging in the data warehousing marketplace is the continued integration of data warehouses into the enterprise IT infrastructure. And the concept of an active data warehouse—the timely, integrated store of detailed data available for both tactical and strategic business decision making—is finding wider acceptance across industries.
The traits in active data warehousing are supplemental to traditional data warehouse functionality. For example, the complex, strategic decision queries normally found can be expanded to include short, tactical queries, with real-time data feeds—and possibly event-driven updates—at the same time. This is what makes it active.
A major difference between an active data warehouse and a traditional data warehouse is that the former is not bound by predefined and rigid load cycles. Since source data might appear at unpredictable times, active data warehousing offers an advantage because the data can be continuously loaded, usually with 24x7 availability. The resulting high data freshness and consistency are requirements for the success of the active data warehouse.
Teradata Parallel Transporter
Specifically optimized for the Teradata Database, Teradata Parallel Transporter is a flexible, high-performance tool that facilitates real-time data extraction, transformation and loading. It supports an infrastructure that enables parallel execution of the product’s components—known as operators—to integrate with the infrastructure in a plug-in fashion to perform these functions:
- Load. Places data into an empty table using Teradata FastLoad
- Update. Loads data and applies updates to new and existing tables using Teradata MultiLoad; the updates can be applied either conditionally or unconditionally based on user-defined rules
- Export. Extracts data from Teradata tables using Teradata FastExport
- Stream. Loads data and applies updates to new and existing tables by using multiple SQL protocol sessions in a high-performance, workload-balancing manner; the updates can be applied either conditionally or unconditionally based on user-defined rules; this operator can be used for loading data from continuous data sources, such as queuing systems (e.g., Microsoft Message Queuing (MSMQ) and WebSphere MQ) and enterprise application integration products
- SQL Inserter. Loads data, including large objects (LOBs), into a new or an existing table using a single SQL protocol session
- SQL Selector. Extracts data, including LOBs, from an existing table using a single SQL protocol session
- Open Database Connectivity (ODBC). Extracts data from external third-party ODBC sources
- Data Connector. Supports simultaneous, parallel reading of multiple data sources, such as various types of files or queuing systems; also allows writing to external data sources
Operators provide access to external resources, such as files, database management system tables, message queues such as MSMQ, WebSphere MQ, etc. Operators also perform various filtering and transformation functions. The Teradata Parallel Transporter uses data streams to transmit data between the operators. A typical data stream connects two operators: The producer writes the data to the data stream, and the consumer reads the data from the data stream. Operators connected by the data stream must share the same schema (i.e., data layout and format) for the data to be interchanged.
The use of multiple parallel instances of operators, as well as operator multi-session capabilities, enables Teradata Parallel Transporter jobs to be scaled up to maximize the use of available CPU resources on ETL client hosts. This results in shorter job runtimes if either I/O or CPU creates a bottleneck in ETL processing.
Scalable access to message queues
Teradata Parallel Transporter offers several ways for moving data from third-party sources into the Teradata Database. Aside from using the producer and consumer operators for data extraction and loading, Teradata Parallel Transporter also allows external units, called access modules, to be used as plug-ins through the Data Connector operator.
Access modules are software that encapsulate the details of access to various data stores, such as files, tapes, named pipes and message queues. The Data Connector operator, which acts as an adapter for access modules, insulates Teradata Parallel Transporter from knowing the inner working of these modules, thus allowing them to be user-defined and -constructed and then executed under Teradata Parallel Transporter as if they were operators. Teradata Parallel Transporter provides these access modules, all of which are checkpoint-restartable:
- Named pipes Access Module allows users to load data into Teradata Database from a named pipe.
- WebSphere MQ Access Module allows users to load data into Teradata Database from a message queue using WebSphere MQ message queuing middleware.
- Java Message Service (JMS) Access Module allows users to load data into Teradata Database from a JMS-enabled messaging system.
Scalable, continuous loading using the Stream operator
Like Teradata TPump, the Stream operator is a general-purpose load utility for the Teradata Database that supports continuous loading. Unlike TPump, which runs as a single-process application, the Stream operator can scale to run with multiple instances in a Teradata Parallel Transporter job while supporting most TPump features. The Stream operator uses standard SQL to insert, update or delete data. It also supports UPSERT, an operation that allows rows to be inserted if they are not found for an update.

Click to enlarge
The UNION ALL feature enables parallel access to MQs via the MQ or JMS access modules. As shown in figure 1, multiple copies of access modules can be launched by multiple instances of the Data Connector operator for reading transactional data from the same or different MQs. This parallel arrangement, which enables data parallelism, could significantly improve the performance of data extraction. For recovery purposes, both the JMS and MQ Access Modules provide support for a fallback data file for storing checkpoint data and for a copy of the data between checkpoints.
Scalable access to transactional files
While the MQ approach in transactional processing is common in active data warehousing, the file-oriented approach is increasing in popularity because of its simplicity and ease of control. Many companies monitor and store thousands—sometime hundreds of thousands—of transactions per day. This transactional data is collected and usually stored as files in directories before they are merged into the enterprise-wide data warehouse. These files are considered active, meaning they are created as transactions are collected.
ACTIVE approach
Unlike batch file processing in which files are stationary and typically contain huge numbers of rows, files that represent transactions in the active data warehouse environment are dynamic and relatively small—a few hundred rows per file, on average. The term dynamic means these files can be created or removed from the directory while the ETL job is running.
Because these files represent real-time transactional data flow, they are usually created in a short time and made available for updates, in which they are processed in or close to time-sequence order, and committed to the data warehouse in a timely manner. With Teradata Parallel Transporter, users can employ the active directory scan feature to continuously collect data from these directories, based on a user-defined time interval, while the Data Connector operator activates the start and stop time for the entire scan job.
The Data Connector operator also processes files in the source directories that meet the user-specified file name criteria, which can include wild cards. Whenever the defined scan interval expires, the Data Connector operator scans the directory for new files since the last interval. It then reads the rows from each of the files collected and sends them to the consumer operator—which is usually the Stream operator for continuous loading purposes. If no new files are found, the Data Connector operator waits for the defined interval to expire before scanning the directory again.

Click to enlarge
Figure 2 shows a typical real-time database-loading scenario. Multiple copies of the Data Connector operator can collect data in real time from multiple sources with different types (or schemas) and process them with the Teradata Parallel Transporter SQL-like language (i.e., SELECT/CASE/WHERE). The data sets are then written to their respective output data streams. The UNION ALL operation, aside from providing parallel processing of the data sources, merges the source data into a union-compatible row set, which is then sent to the target system.
All of these operations can be accomplished in a single job without resorting to intermediate files or staging tables. The user can also define checkpoint rules, whereby data can be committed to the database according to established time intervals or the number of files processed. Such periodic transactional commits enable data to be incrementally and readily available for real-time queries or further transactional processing.
Files can also be sorted by request based on creation date/time (for ordered transaction processing) and automatically archived once they have been committed to the target tables. To commit transactions, the Data Connector operator can be directed to take a checkpoint based on a time interval or the number of files that have been processed.
Mini-batch approach
Like the active directory scan, this approach allows transactional files to be collected periodically from a directory and their contents loaded into the Teradata Database. Unlike the active directory scan, the mini-batch approach uses the Load or Update operator instead of the Stream operator for loading the data. There are trade-offs between active, continuous and mini-batch loading.
In terms of data freshness, the active directory scan or the scalable access to MQs provides close to real-time loading. The mini-batch approach, on the other hand, could have a longer data-refresh cycle time, which could be further compromised by the number of files being loaded per time interval. This is because the mini-batch loading is usually done with the Load or Update operator, and these lock the target table until the loading is done. However, while the refresh time might be delayed in the mini-batch approach, this bulk-loading method could be more cost-effective because less overhead is incurred compared with the row-at-a-time update using the Stream operator.
Despite the differences in the two approaches, continuous and mini-batch loading are not mutually exclusive. In fact, combining the two approaches can obtain a good balance of data freshness, throughput performance and use of system resources.
Extract, load and transform approach
Although periodic load with the Load or Update operator offers the block-at-a-time performance of high-volume data loading, it has restrictions for applications. For example, it does not support target tables with unique secondary indexes (USI), join indexes (JI), referential integrity (RI) or triggers.
With the extract, load and transform (ELT) approach, however, these restrictions can be avoided. By first loading a small batch of files into a staging table, you can use SQL statements (such as INSERT-SELECT, UPDATE-FROM or MERGE-INTO) to apply the data from the staging table to the target table.
The ELT approach can also take advantage of the SQL bulk load operations that are available within the Teradata Database. These operations not only support MERGE-INTO but also enhance INSERT-SELECT and UPDATE-FROM. This enables primary, fallback and index data processing with block-at-a-time optimization.
The Teradata bulk load operations also allow users to define their own error tables to handle errors from operations on target tables. These are separate and different from the Update operator’s error tables. Furthermore, the no primary index (NoPI) table feature also extends the bulk load capabilities. By allowing NoPI tables, Teradata can load a staging table faster and more efficiently.
Benefits
Teradata Parallel Transporter provides flexible, high-performance load and unload facilities that enable population of the data warehouse. It supports many traditional functions of Teradata utilities, as well as many that are new and advanced.
Through its parallel execution design, Teradata Parallel Transporter improves ETL performance while providing a uniform interface. It also facilitates active data warehousing by allowing continuous and scalable access to data sources and targets, thus keeping the data fresh and with maximum throughput.
In all, Teradata Parallel Transporter drives business value by enabling real-time decision making as required by enterprise-wide business intelligence (BI) applications.
Alex Yung is the architect of Teradata Parallel Transporter.