Tech2Tech
Hands On
In the Details
Granular data can drive more predictive analytical models.
by Barry Leventhal
Constructing a good analytical model requires powerful predictive data. Yet, rather than exploiting the detailed interactions in an organization’s data warehouse, many models are built entirely from standard data extracts containing simple aggregate variables.
The reasons for relying on standard extracts are clear and understandable. In retail banking, for example, a typical data extract may contain as many as 2,000 variables aggregated to the customer (or account) level. This data set can cover customer attributes, product holdings, volume and values of transactions by product category, time period and so on.
While this extract provides a summary of customer behavior, it does not contain the most predictive variables for modeling a specific product and activity. A mortgage lender building an attrition model to predict which customers are most likely to close their loans will require more specific mortgage-related factors, such as the loan amount, monthly payment and comparative interest rates.
The same rationale holds true for any activity or behavior pattern to be predicted—specific, detailed information will make the most accurate analysis.
Predictive power
Detailed data comes in different flavors, including:
- Point-of-sale transactions, market baskets in retail
- Debit and credit transactions in finance
- Call detail records (CDRs) in telco
The common theme is that detailed data provides individual instances of customer behaviors or interactions that are more predictive than summary data. This power can be harnessed in various ways:
- Individual usage preferences can be tracked. For example, identifying each customer’s preferred channel for transactions can help a financial institution target communications and plan resources.
- Significant behaviors or events may be identified and used to trigger an immediate follow-up activity by the organization. A customer’s abnormally large bank deposit could prompt a sales call from the investments department, for instance.
- More specific variables may be derived for use in analytical models, resulting in more accurate customer segmentation and targeting.
All of these applications will involve preliminary exploration to understand the data, followed by data manipulation to create data sets for analysis and modeling.
Variables from detailed data
In a Teradata environment, the ideal place to generate detailed variables is in-database, since their calculation will typically involve aggregations across large numbers of customer interactions. Also, the speed of computation will benefit from the parallel processing power of the system.
While the required variable derivations may be programmed manually in SQL, a better alternative is to employ Teradata Analytic Data Set Generator, essentially a highly sophisticated SQL generator.
As a component of Teradata Warehouse Miner, Teradata Analytic Data Set Generator provides data profiling and exploration analyses, plus tools to create variables, build analytic data sets and perform other data manipulation tasks. A benefit is that the definitions of the analytic data sets are stored centrally as metadata so they may be easily shared, reused and refreshed.
Within Teradata Analytic Data Set Generator, the Variable Creation analysis allows new predictor variables to be built through a visual SQL generator. (See table, below.) With a few mouse clicks, the user can select data elements and apply SQL elements and functions.
Additionally, each predictor variable may be constructed by combining variable and dimension definitions:
Variable Definition
The data to be aggregated, methods of aggregation and calculations to be performed to derive each new variable is specified. A variable may be a simple aggregation of detailed records, such as “number of calls” (count) and, similarly, sum, minimum or maximum call duration. Alternatively, it may be more complex—such as a ratio, percentage, trend or moving average.
Dimension Definition
A dimension, generated in SQL as a Teradata CASE statement, indicates a subset of records for which a variable is to be calculated. Two possibilities for a telco company are “day of week = Monday” and “call type = voice.” Dimensions are defined separately from variables but may be combined to provide more detail, as in “day of week = Monday and call type = voice.” Having been defined, dimensions may be applied to create the predictor variables to be tested, for example, “number of voice calls made on a Monday.”
The complexity of variables and dimensions that may be built using the Variable Creation analysis are virtually unlimited.
Flexibility
The key benefit of using detailed data for analytics is its flexibility—starting at the finest level of granularity, any number of behavior patterns can be built that may help predict the target outcome. The decision as to which measures will be useful requires an understanding of customer behavior combined with analytical expertise. At this stage, the analyst is creating variables to test hypotheses about which factors influence the outcome to be predicted.
In the telco industry, a reasonable hypothesis is that the more people you regularly call, the less likely you are to stop using your phone. Therefore, churn analysis may benefit from details of each customer’s “calling circle,” such as how many different numbers are regularly called, what proportion of call time these account for, and whether these numbers are on the same network. Information obtained from CDRs enables this analysis, whereas most standard data extracts do not carry this level of detail—unless, of course, those measures had already been identified and calculated.
After the calling circle metrics are identified, those variables could be derived for a sample of churners and non-churners and used as part of a model. Furthermore, the variables could be incorporated in the standard extract for further analysis and future modeling.

Click to enlarge
Developing models
Making this hypothesis a reality, a customer life cycle model was built for a telco using one segment of its file. The aim was to predict survival probabilities over time for each individual, so that the results could be employed as part of the customer management process. The model was developed using a standard set of inputs, including monthly summaries of inbound and outbound calls, and other variables such as demographics, handset features and payment data.
Once the initial model (Model 0) was built, an extract of CDR data was selected by the analysts to create additional sets of detailed call variables at monthly, weekly and daily levels. Four other models were then developed. Each of them used the variables in Model 0, plus an additional set, as follows:
- Model 1—monthly variables built from detailed data
- Model 2—weekly variables built from detailed data
- Model 3—daily variables built from detailed data
- Model 4—the best inputs from all of the models
All of the models that used detailed data were dramatically more powerful than Model 0. (See figure, below.) These results suggested to the telco that the existing standard data extract needed to be redesigned.

Click to enlarge
Highly effective
The benefits of using detailed data to create predictor variables in analytical models can be significant, and this is one valuable way to exploit the granular information stored in your data warehouse.
Teradata Analytic Data Set Generator is a highly effective toolkit for exploring data and creating predictor variables from detailed data. A key advantage is that all projects and analyses are stored centrally in the Teradata Database, where they may be easily shared, reused and refreshed.
Barry Leventhal, director of advanced analytics for Teradata UK, has more than 20 years of experience in this field.
Photography by Shutterstock