Plug in With Muscle
The functionality of R combined with a Teradata Database provides an innovative solution for advanced analytics.
The “R project,” commonly known as “R,” is a powerful solution to implement analytic methods for business applications such as churn, cross-selling and credit risk analysis. With this solution, users can easily enact all of the required steps to prepare, run and interpret statistical analysis.
Teradata supports R as a cost-effective option for companies. Like Linux, Apache and Firefox, it is an open-source program—free for anyone to use and modify—encouraging organizations to explore analytic techniques and experiment with analytic applications without procurement and licensing of software.
Leverage In-Database Functions
Analysts are required to move data into the R environment, which can be a challenge depending on the volume and data source. To address this, Teradata developed an add-on that enables users to push key analytic tasks directly into the database for processing. This eliminates the need to move information from the data warehouse into an R data frame. The R add-on allows users to easily connect to the Teradata Database, establish data frames to tables within the database, and use the more than 45 in-database analytic functions callable from R.
The add-on takes a unique approach to data frames by establishing a pointer (virtual table) to Teradata Database tables, which eliminates the need to move the entire table into the R environment.
The add-on also provides the programmer with the opportunity to leverage the processing power of the Teradata Database with the R interface. The advantages of using R in-database include:
- Keeping data movement to a minimum
- Supporting big data processing
- Executing R process steps in parallel
R at Work
The end-to-end process of analytical modeling starts with business specifications. The process addresses statistical data preparation, the actual modeling, and preparation of (recurrent) scoring after the modeling.
To be understandable for statistical methods, information mostly needs to be organized in a data set or matrix form. In the case of churn prediction, for example, the most basic element of information is the line level Customer Analytic Record (CAR). Data preparation behavioral details, such as the number of calls or minutes of use, are aggregated to a weekly or monthly level. Finally, information about whether the individual line has churned or not is attached. Typically, a CAR covers 100 to 300 attributes per line.
To be used by regression analysis, which is usually the preferred option for analyzing churn, a certain number of churned lines are combined with a number of lines still active in order to prepare modeling. The resulting sample of records is called an analytic data set (ADS).
An essential part of the modeling process is the preparation of the ADS. Data preparation is recommended to take place completely in-database. The best practice is to use Teradata ADS Generator. To initiate modeling with R, users can create the modeling ADS R Teradata data frame.
For example, if the model will be used for scoring, R offers predictive model markup language (PMML) to import the regression model into the Teradata Database. Other options include parameter handover using command coefficients, plus code parsing or scoring with R.
This code example shows the syntax used to generate histograms (see figure) for a churn analysis and to create the results in table 1:
Click to enlarge
Click to enlarge
tdf <- td.data.frame("CHURN_ADS", "CHURN_SOURCE_DB")
td.hist(tdf, "minutes_of_use")td.hist(tdf, "age")
Table 1 provides an analysis summary. This code provides the corresponding R output for the churn model:
A_churn_model <- glm(formula = churn_event ~ minutes_of_use +
age, family = binomial, data = tdf)
Click to enlarge
Table 2 shows the estimated regression coefficients for minutes of use and age attributes. A regression analysis typically uses many more types of attributes.
Increased Benefits, Reduced Costs
The functionality of R and its free access to numerous statistical techniques gives users of Teradata systems a powerful environment for advanced analytics. Teradata’s add-on package allows users to capitalize on the benefits of R and leverage in-database processing for analytical experimentation, prototyping and development, then deploy models using commercial tools. This reduces development costs, delivers emerging analytic techniques and accelerates delivery with reduced risk.