Tech2Tech

Hands On

Turn Text Into Information

Regular expression makes business sense of unstructured data.

The explosion of new types of data offers opportunities and challenges for busi­nesses. Information is coming from a wider variety of sources, such as blogs and mobile devices, that can give organiza­tions greater insight. But some of this new data, particularly unstructured data, is not well suited to SQL.

Options to Unlock Unstructured Data

Teradata offers solutions that support unstructured data for analysis. Teradata Aster and Attensity unlock unstructured information that is typically text-heavy, causing irregularities and ambiguities that make the data difficult to under­stand using traditional computer programs.

The Teradata Aster MapReduce Platform consists of the Aster Data nCluster and SQL-MapReduce. The Aster Data nCluster is a massively parallel processing (MPP) database that provides end-to-end data and analytic processing, allowing organi­zations to examine large data sets with unprecedented granularity and depth of analysis. This results in faster performance than other architectures and provides scalability to terabytes and even petabytes of data. The nCluster uses a row and column database with an integrated analytics engine to provide a powerful plat­form for high-performance, advanced analytics on large volumes of data, which greatly improves the ability of enterprise users to make informed decisions.

SQL-MapReduce is a software framework that supports distributed computing of large data sets on clusters of computers. MapReduce processes large amounts of unstructured data at high speed for analysis.

Attensity is a text analytics software solution that applies semantic, linguistic and statistical engines to transform various forms of unstructured data into a structured form. It creates output in a structured

Typically, unstructured data makes up about 80% of the information within an organization. Unlike structured data, which can be easily stored in rows and columns in a data warehouse, unstructured data usually has long strings of characters and may contain dates, numbers and facts that can be difficult to analyze.

That’s where regular expression can help. A function in the Teradata Database, regular expression can eliminate the com­plex programming necessary to interpret blogs and other text strings in SQL. It can perform sophisticated analysis on textual data, based on characters, words, patterns of letters, patterns of numbers, etc., to derive business information. The recent—and significant—deluge of unstructured data has increased the need for organiza­tions to perform this type of analysis.

Put Regular Expression to Work

Many languages, such as Java, Perl, Ruby and Python, can perform regular expression. Using it for pattern matching in Teradata SQL is done through the LIKE command. However, there are a lot of limitations in specifying patterns that could include numbers, case-sensitive data, optional conditions, etc. While the simpler patterns could be handled using Teradata functions and the stored procedure, handling code logic can become difficult, even impossible, with these functions and consume a lot of development and testing time.

Using regular expression in Teradata SQL simplifies performing text searches and text manipulation.

This SearchRegex_Func.java code demonstrates the simple usage of regular expression with a Java String statement:

public class SearchRegex_Func {
public static String main(String args1, String args2) throws Exception {
if(args1.matches(args2)){ return args1; } else return null; } }

A file class and jar file along with a stored procedure will need to be created.

This statement compiles the code:

$/opt/teradata/jvm64/jdk5/bin/javac SearchRegex_Func.java

The file SearchRegex_Func.class is cre­ated. This statement creates the jar file:

$/opt/teradata/jvm64/jdk5/bin/jar -cf SearchRegex_jar.jar SearchRegex_Func.class

The file SearchRegex_jar.jar is created. To load the jar file (Java Archive) into the database, log in using bteq and install the jar file. Make sure the necessary privi­leges are granted on the packages in the SQLJ database:

CALL SQLJ.INSTALL_JAR(‘CJ!/tmp/SearchRegex_jar.jar’, ‘SearchRegex_jar’, 0);

Where /tmp is the location of the jar file created in the earlier step.

The user also needs the “CREATE EXTERNAL PROCEDURE” and “CREATE FUNCTION” on the database where the function is created:

REPLACE FUNCTION SearchRegex_fun (str1 VARCHAR(32000),
str2 VARCHAR(1000))
RETURNS VARCHAR(32000)
LANGUAGE JAVA NO SQL
PARAMETER STYLE JAVA
EXTERNAL NAME ‘SearchRegex_jar:SearchRegex_Func.main’;

 

Table 1: Original Data; Table 2: Result Set

The data in table 1 is provided by testing the code.

The next task is to list all the column data that does not match this pattern of text:

1. The text should start with “Harry” case specific

2. Followed by 1 or more spaces (includes space, tabs, etc.)

3. Followed by text “Potter” case insensitive

4. Followed by any number of characters

5. Ends with digits

Table 2 shows the results of executing the command:

SELECT descr FROM book
whe re searchregex_fun(descr,’^Harry(\s)*((?i)Potter).*\d’) is null;

Take It to the Next Level

Advanced regular expressions can be achieved with the package java.util.regex.* and using the objects Pattern and Matcher, which provide more functions, such as:

  • Position of the occurrence of the pattern
  • Number of occurrences
  • Replace strings
  • Split the text

Read on for an example of how to update the data provided above using an advanced regular expression for both Linux and BTEQ.

This example shows how to update data using advanced regular expression:

SearchReplaceRegex_Func.java
import java.util.regex.*;
public class SearchReplaceRegex_Func {
     public static String main(String strtext, String regtext, String replacetext)
         {
          Pattern p = Pattern.compile(regtext);
          Matcher m = p.matcher(strtext);
          String result=””;
          if (m.find()) {
            result = m.replaceAll(replacetext);
             }
          else{
            result=strtext;
          }
          return result;
      }
  }

For Linux:

$/opt/teradata/jvm64/jdk5/bin/javac SearchReplaceRegex_Func.java
$/opt/teradata/jvm64/jdk5/bin/jar -cf SearchReplaceRegex_jar.jar 
SearchReplaceRegex_Func.class

For BTEQ:

CALL SQLJ.INSTALL_JAR(‘CJ!/tmp/SearchReplaceRegex_jar.jar’, ‘SearchReplaceRegex_jar’, 
0);

REPLACE FUNCTION SearchReplaceRegex_Func (strtext VARCHAR(32000), regtext 
VARCHAR(1000), replacetext VARCHAR(1000))
RETURNS VARCHAR(32000)
LANGUAGE JAVA NO SQL
PARAMETER STYLE JAVA
EXTERNAL NAME ‘SearchReplaceRegex_jar:SearchReplaceRegex_Func.main’;

The data can be updated with the fol­lowing statement:

UPDATE book
 SET descr=SearchReplaceRegex_Func (descr,’^(\s)*((?i)Harry)(\s)*((?i)Potter)’,
’Harry Potter’) 
WHERE searchregex_fun(descr,’^((?i)Harry)(\s)*((?i)Potter).*\d’) IS NOT NULL
Table 3: Updated Data

Table 3 shows updated information that is:

1. Starting with any number of spaces

2. Followed by the word “Harry” case insensitive

3. Followed by any number of spaces

4. Followed by the word “Potter” case insensitive

The java function could use the “jdbc:default:connection” database driver for performing in-database operations, where the full capabilities of Java and Teradata can be leveraged. The text data could also be split into separate columns using XML for further analysis, based on the pattern sequence or specific text.

Faster, Simpler Analysis

Regular expression is a powerful analyz­ing solution that opens up opportunities for gathering information from textual and other unstructured data. As organi­zations seek to capitalize on the business value created by analyzing the growing volumes of unstructured data, regular expressions can help them make the process faster and simpler.


Your Comment:
  
Your Rating:

Comments
 


2/13/2012 12:26:24 PM
— Anonymous
NetApp Q1-2014