Loading...

Tech2Tech

Applied Solutions 5

Scalable approach
to real-time data access

 

Teradata Parallel Transporter provides timely data movement to enable active data warehousing.

A recent trend emerging in the data warehousing marketplace is the continued integration of data warehouses into the enterprise IT infrastructure. And the concept of an active data warehouse—the timely, integrated store of detailed data available for both tactical and strategic business decision making—is finding wider acceptance across industries.

The traits in active data warehousing are supplemental to traditional data warehouse functionality. For example, the complex, strategic decision queries normally found can be expanded to include short, tactical queries, with real-time data feeds—and possibly event-driven updates—at the same time. This is what makes it active.

A major difference between an active data warehouse and a traditional data warehouse is that the former is not bound by predefined and rigid load cycles. Since source data might appear at unpredict­able times, active data warehousing offers an advantage because the data can be continuously loaded, usually with 24x7 availability. The resulting high data fresh­ness and consistency are requirements for the success of the active data warehouse.

Teradata Parallel Transporter

Specifically optimized for the Teradata Database, Teradata Parallel Transporter is a flexible, high-performance tool that facilitates real-time data extraction, transformation and loading. It supports an infrastructure that enables parallel execution of the product’s components—known as operators—to integrate with the infrastructure in a plug-in fashion to perform these functions:

  • Load. Places data into an empty table using Teradata FastLoad
  • Update. Loads data and applies updates to new and existing tables using Teradata MultiLoad; the updates can be applied either condi­tionally or unconditionally based on user-defined rules
  • Export. Extracts data from Teradata tables using Teradata FastExport
  • Stream. Loads data and applies updates to new and existing tables by using multiple SQL protocol sessions in a high-performance, workload-balancing manner; the updates can be applied either conditionally or unconditionally based on user-defined rules; this operator can be used for loading data from continu­ous data sources, such as queuing systems (e.g., Microsoft Message Queuing (MSMQ) and WebSphere MQ) and enterprise application inte­gration products
  • SQL Inserter. Loads data, including large objects (LOBs), into a new or an existing table using a single SQL protocol session
  • SQL Selector. Extracts data, including LOBs, from an existing table using a single SQL protocol session
  • Open Database Connectivity (ODBC). Extracts data from external third-party ODBC sources
  • Data Connector. Supports simultane­ous, parallel reading of multiple data sources, such as various types of files or queuing systems; also allows writ­ing to external data sources

Operators provide access to external resources, such as files, database manage­ment system tables, message queues such as MSMQ, WebSphere MQ, etc. Operators also perform various filtering and transformation functions. The Teradata Parallel Transporter uses data streams to transmit data between the operators. A typical data stream connects two operators: The producer writes the data to the data stream, and the consumer reads the data from the data stream. Operators connected by the data stream must share the same schema (i.e., data layout and format) for the data to be interchanged.

The use of multiple parallel instances of operators, as well as operator multi-session capabilities, enables Teradata Parallel Transporter jobs to be scaled up to maxi­mize the use of available CPU resources on ETL client hosts. This results in shorter job runtimes if either I/O or CPU creates a bottleneck in ETL processing.

Scalable access to message queues

Teradata Parallel Transporter offers several ways for moving data from third-party sources into the Teradata Database. Aside from using the producer and consumer operators for data extraction and loading, Teradata Parallel Transporter also allows external units, called access modules, to be used as plug-ins through the Data Connector operator.

Access modules are software that encap­sulate the details of access to various data stores, such as files, tapes, named pipes and message queues. The Data Connector operator, which acts as an adapter for access modules, insulates Teradata Parallel Transporter from knowing the inner work­ing of these modules, thus allowing them to be user-defined and -constructed and then executed under Teradata Parallel Transporter as if they were operators. Teradata Parallel Transporter provides these access modules, all of which are checkpoint-restartable:

  • Named pipes Access Module allows users to load data into Teradata Database from a named pipe.
  • WebSphere MQ Access Module allows users to load data into Teradata Database from a mes­sage queue using WebSphere MQ message queuing middleware.
  • Java Message Service (JMS) Access Module allows users to load data into Teradata Database from a JMS-enabled messaging system.

Scalable, continuous loading using the Stream operator

Like Teradata TPump, the Stream operator is a general-purpose load utility for the Teradata Database that sup­ports continuous loading. Unlike TPump, which runs as a single-process application, the Stream operator can scale to run with multiple instances in a Teradata Parallel Transporter job while supporting most TPump features. The Stream operator uses standard SQL to insert, update or delete data. It also supports UPSERT, an opera­tion that allows rows to be inserted if they are not found for an update.

Click to enlarge

The UNION ALL feature enables parallel access to MQs via the MQ or JMS access modules. As shown in figure 1, multiple copies of access modules can be launched by multiple instances of the Data Connector operator for reading trans­actional data from the same or different MQs. This parallel arrangement, which enables data parallelism, could signifi­cantly improve the performance of data extraction. For recovery purposes, both the JMS and MQ Access Modules provide support for a fallback data file for storing checkpoint data and for a copy of the data between checkpoints.

Scalable access to transactional files

While the MQ approach in transac­tional processing is common in active data warehousing, the file-oriented approach is increasing in popularity because of its simplicity and ease of control. Many companies monitor and store thousands—sometime hundreds of thousands—of transactions per day. This transactional data is collected and usually stored as files in directories before they are merged into the enterprise-wide data warehouse. These files are considered active, meaning they are created as trans­actions are collected.

ACTIVE approach

Unlike batch file processing in which files are stationary and typically contain huge numbers of rows, files that represent transactions in the active data warehouse environment are dynamic and relatively small—a few hundred rows per file, on average. The term dynamic means these files can be created or removed from the directory while the ETL job is running.

Because these files represent real-time transactional data flow, they are usually cre­ated in a short time and made available for updates, in which they are processed in or close to time-sequence order, and committed to the data warehouse in a timely manner. With Teradata Parallel Transporter, users can employ the active directory scan feature to continuously collect data from these directories, based on a user-defined time interval, while the Data Connector operator activates the start and stop time for the entire scan job.

The Data Connector operator also processes files in the source directories that meet the user-specified file name criteria, which can include wild cards. Whenever the defined scan interval expires, the Data Connector operator scans the directory for new files since the last interval. It then reads the rows from each of the files collected and sends them to the consumer operator—which is usually the Stream operator for continuous loading purposes. If no new files are found, the Data Connector operator waits for the defined interval to expire before scanning the directory again.

Click to enlarge

Figure 2 shows a typical real-time data­base-loading scenario. Multiple copies of the Data Connector operator can collect data in real time from multiple sources with dif­ferent types (or schemas) and process them with the Teradata Parallel Transporter SQL-like language (i.e., SELECT/CASE/WHERE). The data sets are then written to their respective output data streams. The UNION ALL operation, aside from providing parallel processing of the data sources, merges the source data into a union-compatible row set, which is then sent to the target system.

All of these operations can be accom­plished in a single job without resorting to intermediate files or staging tables. The user can also define checkpoint rules, whereby data can be committed to the database according to established time intervals or the number of files processed. Such periodic transactional commits enable data to be incrementally and readily available for real-time queries or further transactional processing.

Files can also be sorted by request based on creation date/time (for ordered transaction processing) and automatically archived once they have been committed to the target tables. To commit transac­tions, the Data Connector operator can be directed to take a checkpoint based on a time interval or the number of files that have been processed.

Mini-batch approach

Like the active directory scan, this approach allows transactional files to be collected periodically from a directory and their contents loaded into the Teradata Database. Unlike the active directory scan, the mini-batch approach uses the Load or Update operator instead of the Stream operator for loading the data. There are trade-offs between active, continuous and mini-batch loading.

In terms of data freshness, the active directory scan or the scalable access to MQs provides close to real-time loading. The mini-batch approach, on the other hand, could have a longer data-refresh cycle time, which could be further compromised by the number of files being loaded per time interval. This is because the mini-batch loading is usually done with the Load or Update operator, and these lock the target table until the loading is done. However, while the refresh time might be delayed in the mini-batch approach, this bulk-loading method could be more cost-effective because less overhead is incurred compared with the row-at-a-time update using the Stream operator.

Despite the differences in the two approaches, continuous and mini-batch loading are not mutually exclusive. In fact, combining the two approaches can obtain a good balance of data freshness, throughput performance and use of system resources.

Extract, load and transform approach

Although periodic load with the Load or Update operator offers the block-at-a-time performance of high-volume data loading, it has restrictions for applica­tions. For example, it does not support target tables with unique secondary indexes (USI), join indexes (JI), referen­tial integrity (RI) or triggers.

With the extract, load and transform (ELT) approach, however, these restrictions can be avoided. By first loading a small batch of files into a staging table, you can use SQL statements (such as INSERT-SELECT, UPDATE-FROM or MERGE-INTO) to apply the data from the staging table to the target table.

The ELT approach can also take advan­tage of the SQL bulk load operations that are available within the Teradata Database. These operations not only support MERGE-INTO but also enhance INSERT-SELECT and UPDATE-FROM. This enables primary, fallback and index data processing with block-at-a-time optimization.

The Teradata bulk load operations also allow users to define their own error tables to handle errors from operations on target tables. These are separate and different from the Update operator’s error tables. Furthermore, the no primary index (NoPI) table feature also extends the bulk load capa­bilities. By allowing NoPI tables, Teradata can load a staging table faster and more efficiently.

Benefits

Teradata Parallel Transporter provides flexible, high-performance load and unload facilities that enable population of the data warehouse. It supports many traditional functions of Teradata utilities, as well as many that are new and advanced.

Through its parallel execution design, Teradata Parallel Transporter improves ETL performance while providing a uniform interface. It also facilitates active data warehousing by allowing continuous and scalable access to data sources and targets, thus keeping the data fresh and with maxi­mum throughput.

In all, Teradata Parallel Transporter drives business value by enabling real-time decision making as required by enterprise-wide business intelligence (BI) applications.


Your Comment:
  
Your Rating:

Comments
 
Thanks a lot for such a crisp and to the point description of TPT and the different load techniques.

12/18/2011 1:46:08 PM
— Anonymous
 
Thanks a lot for such a crisp and to the point description of TPT and the different load techniques.

12/18/2011 1:44:47 PM
— Anonymous