Tech2Tech
Applied Solution 2
Keep Your Data Healthy
4 steps to diagnosing and treating enterprise data warehouse problems.
by Taher F. Borsadwala
Health problems often warrant a visit to the doctor’s office, leading to tests to help the physician determine the issue. These tests measure results against a pre-determined range of acceptable numbers. If the results are higher or lower than what’s considered normal, the doctor prescribes treatment.
But it may not end there. Ongoing checkups ensure that the numbers remain in the safe range. The fact is that humans are intricate beings who require regular healthcare.
Enterprise data warehouses (EDWs) are also intricate. An EDW spans verticals—sources, centralized repository and targets. The “health” of its core, i.e., the quality of the data, determines the well being of the entire data warehouse. Quality converts data into information/knowledge. Unless the core is healthy, the abundant information becomes useless, especially for downstream systems that use that data. The "unhealthy" data could lead to inaccurate decisions, which would negatively affect business growth.
Hence, a diagnosis and treatment plan for the data’s overall good health becomes a basic necessity.
Diagnosis
EDW data can be prone to error because of potentially questionable data coming in from source systems or insufficient extract, transform and load (ETL) processes. Real-time access of the warehouse via Web-based technologies adds to the chaos. Given this multi-fold scope of possible data corruption, multiple user groups are responsible for the diagnosis.
STEP 1: Identification and Selection
Businesses need to assess the data quality and raise red flags in case of any issues. High-level dimensions for assessing data quality are accuracy, completeness, timeliness and consistency—the importance of each dimension being business-dependent. There is no one-size-fits-all model, so businesses play a primary role in identifying suspect data.
Next, the technology group has to run with the identified data using inputs from business to select potentially unhealthy data. It’s important that the business and the technology group agree upon the outcome of this step.
The driving force behind this activity is business knowledge, not technology (tools/utilities).
STEP 2: Profiling and Analysis
Profiling is the process of examining data available in either source systems or the warehouse and then gathering statistics and information about it to arrive at potential data issues.
Profiling is executed on identified and selected suspect data to run various kinds of analysis:
- Values. Column-level data popula-tion insights—nulls, blanks, unique values, zeroes, positives and negatives
- Statistical. Information on numeric columns—minimum, maximum, mean and standard deviation
- Frequency. Validations on codes, indicators, flags, types and categories in general
- Histogram/referential integrity. Columns/set of columns uniqueness determination and detection of orphan records
Manual analysis of the profiling results is crucial to uncovering genuine data issues and their causes, and it needs to be conducted by both the business and technical teams.
Treatment
Causes identified during diagnosis need to be studied in detail, their root causes traced, and potential fixes formulated and applied across verticals.
STEP 3: Identify and Fix Causes
Consider data flowing in from different source systems that feed a single data store. At the end of the day, the namesake master data store has ample duplicates and bad master data overall. Profiling reveals the duplicates, and the analysis leads to different source systems that provide such data.
One way to deal with this issue is to have a separate master data store managed in-house and a cross-reference that maps source information to such master data. Processes would need to be defined to get the incoming information correctly mapped to the master data in order to avoid duplicates and orphan records.
Such a fix involves people, processes and technology spanning verticals.
STEP 4: Monitor and Control
Treatment mandates regular monitoring. With fixes in place, ongoing monitoring is imperative to verify successful results and for ensuring that the data remains healthy.
Monitoring needs data quality rules to be defined and established. These rules are checks that validate various information scenarios against a pre-defined safe range and allow discovery of issues in an automated fashion. The scope of such rules is vast and includes:
- Attribute domain constraints
- Relational integrity
- Historical data
- State-dependent objects
- Attribute dependency
Data quality scorecards allow summarized and detailed results to be presented to a broad audience, namely the governance body, data stewards and end users.
Routine checkups help prevent health issues for humans, and they can do the same for an EDW’s core. Prevention is always better than needing a cure, and proactive measures help businesses reap great benefits in the long run.
Taher F. Borsadwala, PMP and a Teradata Certified Master, specializes in enterprise data management methodologies and implementations. He works in the Teradata Enterprise Data Management Center of Expertise.