eBay Corporate Headquarters


Tech Spotlight

Sold on Machine-Generated Data

A Teradata solution enables eBay to easily process and analyze one of the fastest-growing areas of big, diverse data.

The next wave of big, diverse data is coming from machines capable of firing off information in volumes and velocity that dwarfs all other forms of data. Additionally, machine-generated data is unlike most seen today because it is variable during definition and production, and it is semi-structured.

Its very nature requires a different approach to handle it effectively and derive full business value. The most common format is name-value pairs, which is used extensively by mega online websites to record user actions and site responses. There can be an arbitrary number of name-value pairs, in any order, and they can change frequently, which is challenging for traditional extract, transform and load (ETL) functions and databases.

Best Approach

In 2008, eBay was facing challenges posed by the growth rate in Web log records. Business users understood the value of the data and wanted to analyze the information using tools they already knew, rather than receiving batch reports from IT. That meant the company needed to move away from restructuring log files using an ETL approach and find a better process that would:

  • Make the data available interactively to more users
  • Scale to petabytes
  • Improve flexibility to deal with changing data
  • Avoid requiring schema changes every time a Web log changes

Several approaches were investigated but only one allowed for name-value pair processing simplicity and interactivity. Although Apache™ Hadoop® and NoSQL are often considered first for processing name-value pair data because they are seen as part of the big, diverse data phenomenon, they are not necessarily the best options. Lack of a highly interactive business intelligence (BI) tool interface prevents the majority of business users from using these tools since inserting a programmer and batch job between them and the data is less than ideal. However, traditional relational database designs suffer from the need for ETL and schema changes every time more name-value pairs
are added.

The solution was found in the Teradata® Database 14.0, which maintains the flexibility of name-value pairs while providing all the benefits of an integrated data warehouse. It stores the query string as-is in a Varchar field, while specialized SQL functions parse out specific name-value pairs at query time. In addition to the Varchar string, the Web log table contains standard keys and timestamp columns, and other fixed fields. To further improve usability and performance, popular name-value pairs that seldom change can be extracted to their own columns at load time and exploited directly by business users.

Teradata 14.0 offers three functions for processing the data (See "Code for Handling Name-Value Pair Data"):

  • NVP extracts name-value pair
  • STRTOK extracts a single item from a delimited list
  • STRTOK_SPLIT_TO_TABLE transposes or pivots a delimited list into multiple rows of a column

Key Advantages

The Teradata solution provides several advantages to eBay. Normally, every change in input format requires changes in ETL and database schemas. With Teradata 14.0, the flexibility inherent in the name-value approach is maintained throughout the ETL processing and storage. Only at the last moment, at query time, are the name-value pairs parsed out. Business users gain direct access to the query string data, which they can manipulate and analyze using common SQL skills.

The figure shows the architecture of the eBay solution. It includes the traditional enterprise data warehouse (EDW). Supporting thousands of users with petabytes of structured data, the Teradata Active EDW 6690 uses sophisticated workload and data storage management tools to maximize the system for balanced performance and availability.

The second component, Singularity, helps hundreds of users. It handles name-value pair Web log data, with the largest table holding more than 3 petabytes of event data and providing a few thousand tables for additional context. The system includes a Teradata Extreme Data Appliance 1700 that is optimized for deep data analytics.

A Hadoop-based environment is the final piece. It consists of two clusters of more than 2,000 nodes each to hold semi-structured data and hundreds of EDW tables that feed deep discovery and pattern detection applications.

The infrastructure linking the three platforms easily moves large amounts of data back and forth quickly and on demand. This design keeps the systems closely synchronized in near real time.

Data in Analysts’ Hands

It’s critical for eBay to get extensive, rapidly changing Web log data into the hands of its business analysts quickly and in a form they can easily manipulate and query. The built-in functionality of Teradata 14.0 meets those needs.

The database enables the parsing and reformatting of name-value pair data at the time of querying in a powerful BI environment, allowing the inherent flexibility and compactness of the original format to be maintained as long as possible. The analysis via standard SQL enables business planners to improve page layouts and product placements, and drive higher sales.

Code for Handling Name-Value Pair Data

This code example shows how the functions enabled by Teradata® Database 14.0 can be applied to automotive sensor data. Processing name-value pair data has benefits for a wide variety of industries.

Define table and insert two sample log data lines:

	create volatile table car_data (log_ts timestamp(6), vin_nbr bigint,
	sensor_name varchar(128), payload varchar(4096) character set unicode)
	on commit preserve rows;
	Insert into car_data('2012-01-01
	Insert into car_data('2012-01-01

Extract the temperature and pressure for fuel:

	select log_ts,vin_nbr,nvp(payload,'temperature') as
	Temperature,nvp(payload,'pressure') as Pressure from car_data where
	sensor_name = 'Fuel';

  log_ts vin_nbr Temperature Pressure
1 2012-01-01 14:12:31:000000 123 85 3

Extract the first cylinder with a malfunction:

	Select log_ts,vin_nbr,strtok​(nvp​(payload,​'cyl_list'),​',',1) as
	firstBadCyl from car_data where sensor_name = 'Cylinders'

  log_ts vin_nbr firstBadCyl
1 2012-01-01 14:12:31.000000 789 2

Dr. Barry Devlin is a widely respected consultant, lecturer and author. He is the founder and principal of 9sight Consulting.

Your Comment:
Your Rating:

Good to see how big data challenges are getting handled. May be going forward a set-top box kind of approach de-linking main database thereby handling frequent changes to input data and to ETL and database schema's.

3/14/2014 9:48:32 AM
— Anonymous
Fuzzy Logix