Applied Solutions 2
All about availability
Data synchronization and a dual active solution using ELT are key.
Data warehouse systems are increasingly becoming part of the mission-critical infrastructure for an enterprise. As data warehouses evolve from back-room reporting solutions to front-line sources of business intelligence (BI), the service levels for availability of decision support capability become more aggressive. Information must be available when decisions are required in the enterprise. And, increasingly, decisions are required on a 24x7 basis.
The extreme in high-availability implementation is to protect the delivery of BI capability within the enterprise from downtime caused by natural and human disasters that can take out a full data center.
As data warehousing becomes more important to the enterprise and solutions for redundant systems implementation become more cost-effective, disaster recovery for BI is emerging as a best practice in the industry. Implementation of data synchronization, along with the extract, load and transform (ELT) approach to data cleansing and integration, can be used to support multi-systems deployment for high availability.
The dual active approach
Traditional implementations of disaster recovery are deployed using a hot standby system with the capability to switch over when the primary system becomes incapacitated. This style of implementation means that the hot standby system is not being used “productively,” except when a disaster occurs.
In more progressive implementations, both the primary and secondary systems in a disaster recovery configuration are “live” and “productive” for executing data warehousing workloads. This is called the dual active approach. In situations where service levels for recovery require both systems to be up to date in data content, it makes sense to put the secondary system to work by load-balancing across the two systems. Greater throughput on the data warehousing workload is achieved when both systems work together, rather than the secondary system sitting in idle mode waiting for a disaster to occur.
Teradata Query Director can be used to transparently load-balance queries across the primary and secondary systems. In the case of a disaster at one of the two data centers, the entire workload will be directed to fail over to the surviving system. One danger when sharing workload across two systems is that end users begin to take the additional performance for granted. Careful capacity planning must be undertaken to ensure that if one of the two systems goes down, enough performance capacity will be available in the surviving system to execute the critical workload for the enterprise.
In the Teradata environment, introducing a second physical system into the architecture implies that data must be synchronized between the two Teradata platforms. An inventory of analytic applications must be performed to understand what data is considered mission-critical—and, therefore, required in both the primary and secondary systems.
Of course, it is possible to keep all data in both systems for a full dual active solution, but it is often attractive to configure the secondary system with only a subset of the data and processing power in order to yield a more economical solution. The secondary system is typically 30% to 50% of the size of the primary system, depending on the number of mission-critical applications running on top of the data warehouse.
Several methods can be used to synchronize data across Teradata systems in a dual active implementation. The active replication method is recommended when near-immediate synchronization is required between the systems, and volumes are relatively low. Active replication involves performing change-data capture on (usually) the primary system to identify all insert, update and delete operations, and then replicating it to the secondary system. Data changed on the primary system is captured and then applied to the secondary system in near real time. This method can handle thousands of rows per second using replication services provided by the Oracle GoldenGate software on top of the Teradata relational database management system. (See figure 1.)
The packaging of a solution for metadata-driven orchestration of ELT processes in a dual active deployment is a huge leap forward for practitioners of
active data warehousing.
For high-volume data synchronization it is more effective to use bulk data management techniques. When data is updated relatively infrequently (e.g., nightly), the Teradata Data Mover utility can be used to effectively take data loaded into the primary system and bring it across to the secondary system. Performance is highly optimized using this approach.
Use of the Teradata Data Mover approach is independent of the method by which data is acquired into the primary system. This means the existing extract, transform and load (ETL) infrastructure for getting data into the primary system remains unchanged, and Teradata Data Mover is simply used as a post-processing step to synchronize with the secondary system.
When data volumes are high and multiple-times-per-day loading frequencies demand tight coupling with data integration processes, the dual-load approach will be the most attractive. Using a traditional ETL approach, data is extracted from the source systems, transformed and cleansed to meet target table requirements; then loading jobs are initiated, in parallel, against both systems.
For efficiency reasons, the loads proceed asynchronously and monitoring capabilities provided by Teradata Multi-Systems Manager ensure that data content stays in sync within defined tolerances. If data freshness service levels are not in line, then Query Director is informed that one of the systems has fallen behind, and the workload is funneled to the most up-to-date copy of the data while the other system catches up. Checkpoint-restart capabilities in the loading utilities ensure that data is never lost—even if one or both of the systems experiences a failure.
As data volumes grow and transformation rules become sophisticated, many data warehouse users have begun executing the ELT approach rather than the ETL method. With ELT, data extracted from source systems is loaded into “raw” staging tables that closely resemble the source system files from which they were extracted. Metadata-driven SQL generation then creates transformation programs that execute directly in the data warehouse on the staging tables to orchestrate transformation and loading into target tables. Transformation rules that rely upon access to historical data can simply join to that data directly within the data warehouse rather than extracting into a separate ETL server, as would be the case in the traditional method.
Besides the ease of in-database processing, the ELT approach also benefits from the performance advantage of the Teradata system’s massively parallel processing capability and obviates the need for large ETL servers.
Efficiency and simplicity are fundamental design principles for successful deployment of ELT with a dual active architecture. A packaged solution is essential for driving the simplicity of the implementation. Because hand-coded scripts and homegrown failover implementations are expensive to deploy and prone to error, many have migrated from custom-developed ETL programs to the metadata-driven tools approach that currently dominates the industry.
Click to enlarge
Evolving these packaged tools to handle automation of more advanced data acquisition services that consider disaster recovery requirements is the next step in the progression of these solutions. Oracle Data Integrator Enterprise Edition is an early innovator in this space.
One of the challenges in the deployment of dual load for ELT implementations is managing the trade-off between simplicity and efficiency. The simplest approach for dual load is complete symmetry. In other words, implementation of data transformations and loading on the two systems is as identical as possible within the constraints of the design requirements.
With the ELT approach, this would mean a single extraction of data from the source systems, followed by dual load into staging areas of the two data warehouse databases, and then execution of the generated SQL to perform transformations before inserting into the target tables on both systems.
The issue is that the symmetrical approach requires the transformation logic to be executed twice—once on the primary system and again on the secondary system. This can be expensive compared with dual-load implementations with the ETL approach in which transformations are executed only once into both systems, prior to dual load. On the other hand, the in-database processing efficiencies offered by ELT can be a significant advantage for this approach.
Designing a universal solution to address all of the ELT challenges inherent in a dual active deployment may not be possible. Several factors will influence the best solution for a good design, some of which are:
- Relative size of the primary and secondary data warehouse servers
- Complexity of the data transformations to be performed
- Data volumes to be processed
- Availability of network bandwidth
- Amount of data reduction or expansion from source to target
- Mini batch window size for completing data acquisition services
- Operational maturity of the data warehouse environment
These considerations, and many more, can drive quite different design trade-offs with significant implications upon the deployment architecture.
To accommodate trade-offs associated with the factors described above, Teradata and Oracle formulated two distinct design options for implementation of ELT in a dual active environment. In both design options, the primary and secondary data warehouse platforms are allowed to have different platform configurations (e.g., the secondary platform is often configured with less performance capability than the primary platform). Both design options also permit only a subset of tables to be present from the primary to the secondary platform (or vice versa) to enable disaster recovery without full duplication of all data across both systems.
Click to enlarge
The first design option, referred to as parallel load and transform, leverages the two data warehouse systems from Teradata symmetrically with concurrent load and transform on both platforms. The second design option, referred to as load, transform and synchronize, uses an asymmetric design whereby a single Teradata system performs a given set of transformations and then pushes the transformed data to the other system for loading into target tables. This is done in addition to inserting the transformed data into its own target tables.
Parallel load and transform
In the parallel load and transform app-roach, Oracle Data Integrator Enterprise Edition first extracts the source data from the various source systems into flat files that it stores in the landing and recovery area. The staging areas of both systems are then loaded in parallel. The third step, in which transformations are executed, occurs simultaneously in both Teradata systems to populate the target enterprise data warehouse (EDW) tables.
The initial extraction is a key step in this design scenario. The ELT processes read only once from the source systems to avoid extra processing against the source application platforms. It is also critical to avoid a point-to-point dependence between the sources and the target dual active data warehouse environment in which one of the two systems may be down. In this way, recovery capability and high availability are provided at all times.
The landing and recovery area acts as a file storage space that decouples the sources from the targets and archives the extracted files as sets identified by a generated unique batch ID for each run. It is required that the landing and recovery area be deployed using a fault-tolerant file system with disaster recovery capability to guarantee no data loss and continued operation during a catastrophic event.
In the second step, files are bulk-loaded into the staging areas of each system. The metadata-driven generation of Teradata Parallel Transporter scripts is used to orchestrate high-performance loading from the landing and recovery area into staging-area tables on each system. The same ELT logic, with only one copy of the metadata and execution scripts, is used for both systems. The loading processes can run independently and in parallel.
This is also the process by which a potential crash recovery occurs. Oracle Data Integrator Enterprise Edition checks the high-water mark of the latest batch ID for each system and determines which sets of files need to be loaded in the staging area. If one of the two systems had been down and unable to successfully process extracts during previous runs, these sets of files would automatically be re-played at the next run, or when the system is up again.
The third step is where high-performance data transformation and cleansing take place. With ELT processing, Oracle Data Integrator Enterprise Edition uses metadata to generate optimized Teradata SQL to drive set processing for transforming and moving data from the staging area to the target tables of each system. Both Teradata systems are active and act as transformation engines. The data consumed in this step captures all records from the staging area that have not yet been integrated into the target tables, based on the last successful batch ID run.
Load, transform and synchronize
In the load, transform and synchronize approach only one of the Teradata engines will execute a given set of transformation logic. Rather than having two systems run the same transformations in parallel, one of them (usually the more powerful) acts as the transformation server.
Click to enlarge
Source data is first loaded directly in the staging area of this system. The data is then transformed and simultaneously inserted into the target tables and pushed to a landing and recovery area. The final step ensures that the secondary system consumes the transformed data from the landing area. In case of failure of the primary system, the secondary system takes the lead and becomes the transformation server. In other words, the secondary server switches from passive to active to guarantee data acquisition for the warehouse.
Similar to the previous design scenario, the landing and recovery area acts as a reliable queuing system for extracted source data waiting to be processed. This allows for decoupling of the two systems. Data files are grouped as sets and identified by the same concept of batch ID as in the previous design scenario.
When the load starts, the primary system first checks for any pending changes that it has not yet consumed. This situation may occur if it had previously failed and the secondary system had taken the lead, thus pushing final transformed data to the landing and recovery area. Once pending data is consumed and the current system is considered up to date with the other system, it bulk-loads its staging area from the source applications’ data.
The next step of the process is performing the data transformation and cleansing logic using ELT processing and optimized Teradata SQL on the primary system. The transformed data is inserted to the target and pushed to the landing and recovery area. In the last step the secondary system is updated by loading the transformed data file snapshots for all batch IDs that have not yet reached the second system.
Efficiency and simplicity are two fundamental design principles for successful deployment of ELT with a dual active architecture.
In the overall process, the concept of batch ID is propagated across steps to identify the data sets. With its modular metadata architecture, Oracle Data Integrator Enterprise Edition supports the ability to transparently switch the primary and secondary systems in case of failure by updating context variables and re-using a single instance of the transformation metadata. (See figure 3.) Oracle Data Integrator Enterprise Edition acts as the overall conductor of this architecture, orchestrating the various steps and recovering from any unexpected failure.
Choosing the best approach
The parallel transform and load design is the simplest approach because it is completely symmetrical. The systems in the dual active configuration are behaving identically and the approach requires very little modification of an ELT job originally developed for a single-system implementation. When data volumes are relatively low or when the transformation processing is not very heavy, this is generally the recommended approach.
However, when significant transformation processing is required, it may not be advisable to duplicate execution on both systems. This is especially true if the secondary system is configured significantly smaller than the primary system. In these cases, use of the load, transform and synchronize approach avoids execution of the transformation logic on the secondary system by pushing the results of the transforms executed on the primary system.
In some cases, the volume of data pushed from the primary system after transformation will be significantly smaller than what was extracted from the source systems because of filtering or aggregation in the transformation logic. In these cases, the push approach can save significant resources in network traffic and loading, in addition to eliminating the redundant transformations.
The disadvantage of the load, transform and synchronize approach is that it is a bit more complicated than the parallel transform and load approach. The asymmetry of the design makes it harder to understand, and additional metadata must be constructed to orchestrate the push from the transformation server to the secondary disaster recovery system. The metadata architecture makes it so that only one copy of the rules for transformation and pushing need be constructed as logical constructs—using context variables to determine the physical assignment of the server to use for transformation and the server to use for synchronization.
In some cases the parallel transform and load approach will be the best method to use, and in other cases the load, transform and synchronize approach will be best. The use of Oracle Data Integrator Enterprise Edition for dual active implementation of ELT allows either technique to be deployed, or both of them at the same time on different sets of source-to-target mappings.
Delivering high availability
The packaging of a solution for metadata-driven orchestration of ELT processes in a dual active deployment is a huge leap forward for practitioners of active data warehousing. Previous solutions involving lots of custom-coded scripting and a hodgepodge of tools were difficult to consistently maintain and deploy.
The joint development between the Oracle and Teradata research and development resources sets a new standard for cost-effectively delivering high availability for mission-critical data acquisition services using the ELT approach.