Loading...

Tech2Tech

Applied Solutions 4

Know your data

Data Explorer provides fundamental analyses for profiling.

Analytics provides a wealth of strategic insight to help decision makers arrive at well-informed choices. Best practices begin with the data warehouse, which integrates vast quantities of detailed data from disparate sources and operational systems. Once your data has been integrated, you must explore and understand the characteristics of that data—including its anomalies—before building reports or creating predictive models. Poor-quality data compromises virtually all analytics, can add weeks to your project and critically devalues the data warehouse.

Profiling, or data exploration, is commonly used to assess data quality. However, many profiling tools cannot scale to match the ever-increasing volume of data that is required for analysis. The tools fall short because as the data is extracted to the server then merged and profiled, it can cause an I/O bottleneck.

Rather than move the data out of the system, Teradata Analytic Data Set Generator provides a unique approach in which the processing is done directly in the database and only the profiling results are returned to the analyst. In-database exploration is fast and simple, and it delvers a wealth of information in an easily understood visual format.

Automated data exploration

The Teradata Analytic Data Set Generator feature, Data Explorer, is a broad-based analysis that quickly identifies problem areas and anomalies that require further investigation. Exploration can be done on selected columns across multiple tables within the data warehouse.

Data Explorer was designed to automate analysis of large amounts of data, using four fundamental techniques: values, statistical, frequency and histogram analyses. It uses built-in logic to determine which of the four analyses to perform on specific columns.

Values Analysis

The goal of values analysis is to provide a basic assessment of the data to determine obvious cases where its values fall outside the expected parameters. It identifies the following:

  • Data types
  • NULL and non-NULL
  • Number of unique values
  • Blank characters
  • Zero, positive and negative numerics

The analysis results are graphically displayed either in a relational circle or a bar chart. (See figure 1A.) In this way, data entry errors or an abundance of NULL variables that can skew your analysis can be quickly identified. For example, when analyzing employee income, any negative numbers appearing in a column would fall outside normal parameters, prompting further evaluation.

Values analysis can also recognize constants in your data if the number of unique values equals "1." The results can flag potential errors or identify variables that provide minimal value to your analysis.

image

Click to enlarge

Statistical Analysis

To validate business assumptions about your data, a statistical analysis can be performed on columns that are numeric or date types. The analyst can select any of these measures:

  • Minimum and maximum ranges
  • Mean and standard deviation
  • Skewness, kurtosis
  • Standard error
  • Variance and coefficient of variance
  • Sum, uncorrected and corrected sum of squares

The results are displayed in a box-and-whiskers graph in which the top of the whisker is the maximum and the bottom is the minimum. The box represents standard deviation and the mean is identified with a dot. By examining the location of the box in the graph, you can determine skewness of the data values. (See figure 1B.)

Frequency Analysis

The frequency analysis counts the number of occurrences a particular value appears in a column and helps the analyst determine its validity.

For example, if four unique values appear in the "Gender" column, the analyst may discover that “Gender” had a data entry error, or that the data was used for another purpose, such as categorizing “minor” as a value. (See figure 1C.)

Histogram Analysis

The histogram analysis is performed on columns of numeric or date types, and groups the data according to user-defined ranges, also referred to as bins. The analyst has an option of defining the ranges by number of bins, width of each bin or equal distribution of data across the bins. The histogram charts the distribution of the variables allowing the analyst to validate business assumptions.

Distribution of "Age" as a bell curve may be appropriate for your analysis, but if your target audience is a particular age group, any skewed results may uncover another age group with an affinity to your product. (See figure 1D.)

Best Practice

Understanding your data and identifying and correcting its anomalies are critical steps for a successful analytic project. The increasing data volumes and complexity, along with the pressure to deliver accurate and timely results, are driving many analysts to adopt in-database exploration as a best practice. The profiling capabilities of Teradata Analytic Data Set Generator provide an efficient, easy and accurate method to address quality issues.


Your Comment:
  
Your Rating:

Comments
 


8/30/2011 4:56:59 PM
— Anonymous