Python, Perl, Ruby and R Programming Language Logos


Hands On

Pick Your Programming Language for In-Database Analytics

The ‘Scripting and Language Support’ feature on the Teradata Database lets users run scripts in whatever language they choose.

Data has never been as important or prolific as it is today. It’s being produced not only by humans, but also by machines and devices. No longer bound by rows, columns and tables, data can be structured or unstructured and in name-value pairs, graphs and other forms.

Users need the right tools and algorithms to process today’s data at scale and gain insights from it. Tools such as R, Python, SAS and Matlab are very popular in the data scientist community and can be leveraged by other users as well. To that end, Teradata® Database 15.0 and later releases offer a new feature called “Scripting and Language Support” that provides the ability to run R, Python and Perl directly on the database. This gives application developers the flexi­bility to run scripts written in the language of their choice in addition to taking advantage of the massive parallelism capability enabled by the Teradata Database.

As analysts know, processing data where it resides is much more efficient than first offloading it in bulk to a remote engine. With data scientists and others finding new ways of performing data exploration and analysis, the “Scripting and Language Support” feature enables users to pick their own programming language and run analytics inside the database, preventing the expensive and time-consuming transfer of data.

Perform Queries With Script Table Operator

The “Scripting and Language Support” feature provides a Script Table Operator (STO), a powerful SQL-based tool that can execute any valid Linux command including R, Ruby, Perl and Python scripts. The simplified syntax of a query using STO is:

Select <list of fields> from SCRIPT
(< Input table/view>
SCRIPT_COMMAND('<any valid linux command to execute scripts'>)
RETURNS('<list of return columns >')
) as Result;

4 Best Practices for Running Scripts

“Scripting and Language Support,” a feature offered in Teradata® Database 15.0 and later versions, allows R, Python and Perl to be used directly on the database. These best practices will help users be successful:

  • Test the scripts before trying to run them via the Script Table Operator (STO).
  • Write all input, output and other operations within a try-catch block.
  • Use StdErr to log in between results, which are printed to the “scriptlog” located at /var/opt/teradata/tdtemp/uiflib/
  • Monitor the memory usage since these scripts are running on Teradata nodes.

The Input parameter in this example is a single input stream that contains rows from a table/view or query, or is optionally empty. Rows are sent to the script/command being executed via stdin as a string separated by a delimiter. The Output parameter is a result from the script and is sent to stdout. Values can be converted into data types, as specified by the RETURNS clause. The SCRIPT_Command clause contains a valid Linux command that may be used to execute the scripts written in a language of your choice.

In addition, STO offers the capability to install, replace, drop and redistribute files in the database. The external files, which can be a script, text or even a flat data file, are referred to as User Installed Files, or UIFs. Once installed, UIFs can be used via the STO.

The stored procedure SYSUIF.INSTALL_FILE XSP is used to install an external file on all nodes of the Teradata system, while the parameter SearchUIFDBPath sets the database search path before STO execution. This means that any UIF installed in the databases specified in SEARCHUIFDBPATH can be used by STO.

Run MapReduce Using STO

STO also lets companies benefit from MapReduce by using large, complex data sets to gain deeper, richer insights into a business or processes. MapReduce can be leveraged for data analytics, text analysis and classification, data mining and other uses. If companies have a large computational problem that can benefit from distributing the work on Teradata’s scalable platform, then MapReduce is the best way to handle it.

The following example describes a MapReduce use case invoked by STO. The fact that STO runs the user script on all AMPs makes it suitable for a MapReduce job. The AMPs act as mappers and reducers during the STO execution. In this example, the user is logged on as “testuser.”

Step 1. Create a table to load the data:

create table lobout_romeo(pkey integer, a CLOB (2000000) CHARACTER SET LATIN);

Step 2. Install the user files in the database testuser:

call sysuif.install_file('romeo', 'romeo.txt', 'cz!romeo.txt');
call sysuif.install_file('mapper', '', 'cz!');
call sysuif.install_file('reducer', '', 'cz!');

Step 3. Set the search path for user scripts and the data file. You can, if needed, mention multiple database names here:

.set session searchuifdbpath = testuser;

Step 4. Use the cat command to insert the contents of data file romeo.txt into the lobout_romeo table:

insert into lobout_romeo select * from SCRIPT(
ON (select 1)
SCRIPT_COMMAND('cat ./testuser/romeo.txt')
RETURNS('pkey integer', 'a CLOB(2000000) CHARACTER SET LATIN')
) as tab;

Step 5. Run a script to find the top five words used in the text. The HASH BY function in the script is used to send the words to the reducer:

select top 5 "count", word from SCRIPT(
ON( select word, "count" from SCRIPT(
  ON (select a from lobout_romeo)
  RETURNS('word varchar(40) CASESPECIFIC', '"count" int'))
   RETURNS('word varchar(40) CASESPECIFIC', '"count" int')
) ORDER BY "count" DESC;

	count  word
----------  ----------
	  611  the
	  546  I
	  471  and
	  432  to
	  398  a

Python script was used in the preceding example. The script breaks down into mapper and reducer codes.

This mapper code ( counts the words in the input file:

import sys
for line in sys.stdin:
	words = line.split()
	for word in words:
		print '%s\t%s' % (word, 1)

This reducer code ( aggregates the results for different words:

import sys
for line in sys.stdin:
	word, count = line.split('\t', 1)
	if current_word == word:
		current_count += count
		print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
if current_word == word:
	print '%s\t%s' % (current_word, current_count)

Advance Your Analytics

Analytical programs, like pattern matching for fraud detection or prediction analysis to anticipate sales, require a scalable platform that can meet growing memory and computing requirements. Exporting data to compute clusters limits users from running some types of analytics.

But with STO, users can now focus on performing analytics rather than wasting time managing memory and computing resources. The feature empowers users to take advantage of new and powerful analytical tools and run analytics directly on data residing in the database while benefiting from Teradata’s scalable architecture. 

Ranjan Priyadarshi is the Teradata® Database 15.0 product manager. He has more than 15 years of experience in data warehousing and analytics.

Preeti Javaji is a software engineer for Teradata. She has been associated with Teradata for more than 10 years.

Your Comment:
Your Rating:

Fuzzy Logix