Loading...
4 steps to diagnosing and treating enterprise data warehouse problems.

Tech2Tech

Applied Solution 2

Keep Your Data Healthy

4 steps to diagnosing and treating enterprise data warehouse problems.

Health problems often warrant a visit to the doctor’s office, leading to tests to help the physician determine the issue. These tests measure results against a pre-determined range of acceptable numbers. If the results are higher or lower than what’s considered normal, the doctor prescribes treatment.

But it may not end there. Ongoing checkups ensure that the numbers remain in the safe range. The fact is that humans are intricate beings who require regular healthcare.

Enterprise data warehouses (EDWs) are also intricate. An EDW spans verticals—sources, centralized repository and targets. The “health” of its core, i.e., the quality of the data, determines the well being of the entire data warehouse. Quality converts data into information/knowledge. Unless the core is healthy, the abundant information becomes useless, especially for downstream systems that use that data. The "unhealthy" data could lead to inaccurate decisions, which would negatively affect business growth.

Hence, a diagnosis and treatment plan for the data’s overall good health becomes a basic necessity.

Diagnosis

EDW data can be prone to error because of potentially questionable data coming in from source systems or insufficient extract, transform and load (ETL) processes. Real-time access of the warehouse via Web-based technologies adds to the chaos. Given this multi-fold scope of possible data corruption, multiple user groups are responsible for the diagnosis.

STEP 1: Identification and Selection

Businesses need to assess the data quality and raise red flags in case of any issues. High-level dimensions for assessing data quality are accuracy, completeness, timeliness and consistency—the importance of each dimension being business-dependent. There is no one-size-fits-all model, so businesses play a primary role in identifying suspect data.

Next, the technology group has to run with the identified data using inputs from business to select potentially unhealthy data. It’s important that the business and the technology group agree upon the outcome of this step.

The driving force behind this activity is business knowledge, not technology (tools/utilities).

Analysis Requires Technology

Accurate profiling depends heavily on technology, and numerous smart tools and utilities are available that provide data analysis results more quickly and easily. Teradata Profiler carries out this analysis. It also:

  • Is part of Teradata Warehouse Miner
  • Includes a Windows-based client utility, normally accessed via a graphic user interface
  • Generates optimized Teradata SQL available for export as a result of all functions
  • Supports concurrent usage, interactive profiling, profiling script development and batch execution
  • Processes any number of columns in Teradata tables or views with-out movement
  • Maintains Teradata security and administration standards
  • Provides the option to store results as Teradata objects and/or return the results to the calling program

—T.F.B.

STEP 2: Profiling and Analysis

Profiling is the process of examining data available in either source systems or the warehouse and then gathering statistics and information about it to arrive at potential data issues.

Profiling is executed on identified and selected suspect data to run various kinds of analysis:

  • Values. Column-level data popula-tion insights—nulls, blanks, unique values, zeroes, positives and negatives
  • Statistical. Information on numeric columns—minimum, maximum, mean and standard deviation
  • Frequency. Validations on codes, indicators, flags, types and categories in general
  • Histogram/referential integrity. Columns/set of columns uniqueness determination and detection of orphan records

Manual analysis of the profiling results is crucial to uncovering genuine data issues and their causes, and it needs to be conducted by both the business and technical teams.

Treatment

Causes identified during diagnosis need to be studied in detail, their root causes traced, and potential fixes formulated and applied across verticals.

STEP 3: Identify and Fix Causes

Consider data flowing in from different source systems that feed a single data store. At the end of the day, the namesake master data store has ample duplicates and bad master data overall. Profiling reveals the duplicates, and the analysis leads to different source systems that provide such data.

One way to deal with this issue is to have a separate master data store managed in-house and a cross-reference that maps source information to such master data. Processes would need to be defined to get the incoming information correctly mapped to the master data in order to avoid duplicates and orphan records.

Such a fix involves people, processes and technology spanning verticals.

Data Quality Scorecard Service

Teradata Professional Services provides a data quality scorecard implementation service consisting of Data Quality Rules Manager, a set of Teradata tables to store the rules/results and a scorecard for users, data stewards and governance bodies to:

  • Understand quality issues
  • Evaluate improvement opportunities
  • Measure progress over time

STEP 4: Monitor and Control

Treatment mandates regular monitoring. With fixes in place, ongoing monitoring is imperative to verify successful results and for ensuring that the data remains healthy.

Monitoring needs data quality rules to be defined and established. These rules are checks that validate various information scenarios against a pre-defined safe range and allow discovery of issues in an automated fashion. The scope of such rules is vast and includes:

  • Attribute domain constraints
  • Relational integrity
  • Historical data
  • State-dependent objects
  • Attribute dependency

Data quality scorecards allow summarized and detailed results to be presented to a broad audience, namely the governance body, data stewards and end users.

Routine checkups help prevent health issues for humans, and they can do the same for an EDW’s core. Prevention is always better than needing a cure, and proactive measures help businesses reap great benefits in the long run.


Your Comment:
  
Your Rating:

Comments
 
interesting article.Good Job.

8/17/2010 4:48:51 PM
— Anonymous