Loading...

Tech2Tech

Applied Solutions 2

efficient

 

Quick, efficient and accurate

Teradata Analytic Data Set Generator accelerates model development.

The accuracy and usefulness of an analytic model is directly proportional to the relevance and freshness of the data. If the data is incomplete, inaccurate, outdated or irrelevant, then developing useful analytics will be difficult, if not impossible.

Thus analysts and data management teams dedicate a large percentage of their development time to preparing the data to ensure that it correctly models their business for accurate analysis.

Wide and deep

In most organizations, data is usually scattered across multiple data stores or tables inside the data warehouse, making it unsuitable for input directly into analytic data mining tools. These tools require data to be in a single flat table to perform their analysis. Analytic modeling requires that the organization’s data be stored in a table called an analytic data set (ADS). This table, which contains all of the data elements for a particular segment, is fed directly into the data mining tools for analysis.

Within an ADS, each row of data represents one observation or entity being modeled, such as a customer. Columns in the table contain all attributes or characteristics of that customer. It is easy to imagine how wide and deep an ADS can be if the organization has millions of customers who each have thousands of attributes.

Eliminate data movement

To create an ADS, the data must be explored, aggregated, transformed and manipulated. New variables are derived, and tables are joined and merged. The traditional approach to building an ADS is to extract data from the data warehouse and temporarily store it in an analytic server where these tasks are completed. This iterative process requires processing large volumes of data and is very labor-intensive. In fact, more than 70% of an analyst’s time is often spent solely on assembling the analytic data.

The improved method is to build the ADS directly in the Teradata Database using the Teradata Analytic Data Set Generator. Because all tasks can be easily represented as a SQL expression, the data can stay in the warehouse. Not only is the costly task of data movement eliminated, but the parallel processing power of the Teradata Database is also leveraged. At least half the time analysts spend preparing the ADS is saved when in-database processing is used.

image

Click to enlarge

Preparing the ADS

The Teradata Analytic Data Set Generator provides a Windows-based graphical user interface (GUI) and wizards to simplify the ADS-building process. The GUI supplies users with easy access to the data within the warehouse and to the system’s analytical functions. Included in many of the ADS-building features are wizards that facilitate and automate common tasks such as calculating new predictive variables from different data combinations, profiling data elements to identify patterns and outliers, transforming data, and integrating variables and tables into a consolidated ADS.

The Variable Creation analysis derives new variables from the data for better predictors and greater analytic insight. With a drag-and-drop GUI, the user selects SQL keywords and operations that include arithmetic and logical operations, date/time operators, aggregation functions and ordered analytic functions.

Users can generate similar variables at once with an operation called “dimensioning.” For example, target marketing requires analysis of customer transactions. While the analyst has the detailed sales transactions, the key metrics—recency, frequency and monetary (RFM)—are necessary to complete the analysis. The Variable Creation analysis calculates the RFM variables based on transaction data. Then the user applies the RFM calculations across Product and Time dimensions, creating RFM variables for each dimension. The result is multiple sets of RFM calculations. (See figure.)

Most of an analyst’s needs for metrics building are handled by the Variable Creation feature, including generating indicator variables, expanding metrics into numerous dimensions and producing sophisticated ranks. Often, once a metric is derived, additional transformations are necessary.

Data transformations

Data is frequently formatted to meet the specifications of an analytic method. For instance, most neural networks require numeric data; in other cases, numeric data must be converted to nominal formats such as “bins” and numeric codes. These are just a couple of the required transformations.

Conversions are performed as simple SQL expressions using the Variable Creation feature. Transformations that require a more elaborate SQL structure are handled using the Variable Transformation feature. This function accepts a single table as input, performs a column-by-column transformation and creates a new table with the transformed variables. It can also export captured SQL and embed it into a user’s favorite tool.

Described are some transformation techniques and their purposes:

  • Binning changes continuous numeric variables into categorical values.
  • Design code converts a categorical data element into at least one numeric data element, thus creating a new binary variable for each unique value.
  • Recoding transforms a categorical data element into a new coding scheme and maps individual values, NULL values or any number to a new value.
  • Rescaling adjusts a continuous numeric value between an upper and lower boundary. Linear, sigmoid and Z-score are some rescaling methods.

To correct missing data in an ADS, the user has several options: exclude any data elements that are frequently omitted from the records, delete records that have missing data, extrapolate values that are left blank from other known values, use a predictive model to assign the value, or use a common value such as the average.

Finally, the multiple tables or views developed through the Variable Creation and Variable Transformation analyses must be consolidated into a final ADS. The Build Analytic Data Set feature provides a GUI to select tables and variables, as well as a wizard to help establish recommended join paths and primary indexes.

Save time and effort

Data preparation is the most time-consuming step of analytic development. It is also the most critical and deserves careful consideration and evaluation.

Many companies that have mature analytic practices have built sophisticated ADS infrastructures that are reusable, shareable and automatically refreshed as new data is updated in the enterprise data warehouse. Such an infrastructure, often referred to as an analytic sandbox, provides analytic-ready data, allowing the analyst to bypass much of the ADS development and focus instead on the modeling task.

For analysts who have adopted the in-database approach to creating an ADS, the development cycles and time-to-value have been dramatically reduced while the precision and usefulness of analytics have significantly increased.


Your Comment:
  
Your Rating:

Comments