A Look at Two Big Data Architectures
A business case illuminates the differences between Apache™ Hadoop® YARN and Teradata® Aster SNAP Framework™.
Benefitting from big data requires businesses to manage—in an integrated way—a wide variety of information that can be highly complex. That variety, in both structure and forms of data, means that no single engine or data store can hold, manage and access them all. In addition, no single framework has been sufficient for handling the greater variety of analytics used by organizations.
Solving business problems now requires multiple data structures and analytic techniques, which has given rise to new big data architectures. This has led to a growing interest in an integrated platform that enables the use of multiple data stores and analytics within a single solution. As a result, two main architectures have emerged: Apache™ Hadoop® YARN and Teradata® Aster Seamless Network Analytical Processing (SNAP) Framework™.
New Architecture for Resource Management
YARN stands for “Yet Another Resource Negotiator” and was added as part of Hadoop 2.0. YARN packages the resource management capabilities from Hadoop 1.0 so new engines can use them. This also streamlines MapReduce to do what it does best—process data.
With YARN, organizations can now run multiple applications in Hadoop, all sharing a common resource management layer. Previously, applications had to manage their own relationship with the cluster and its resources. That put applications in contention for key resources, such as reducer slots, which delayed work and reduced the throughput of the cluster. This also placed a burden on the application programmer.
YARN makes managing large clusters easier and reduces the need for application programmers to deal with resource management issues. When more than one application is required, data can be written to the Hadoop Distributed File System (HDFS) by one application and then read by the next one in the processing chain. MapReduce is one of many applications running on the cluster. New applications include Tez for interactive query and Giraph for graphical processing. YARN also enables a greater range of open-source big data solutions.
A Unified Framework
Like YARN, the Teradata Aster SNAP Framework allows various language constructs to be used for diverse classes of data structures and problem solving. But the two architectures are fundamentally different. The SNAP Framework lets users integrate the various language constructs into a single query within an extended version of SQL. This enables graph, MapReduce, R and tabular query functions to be woven together in one query.
The SNAP Framework also has a layer with four major integrated data management functions, including a single optimizer, single executor and single unified SQL interface, along with common storage services. This layer is aware of the different data stores and analytic techniques in use, which enables it to integrate data access and optimization.
Another layer of the SNAP Framework offers a set of data stores. There is a row and a column store for tabular data, and a file store for unstructured data. Other stores can be introduced for additional varieties of data. All of these stores are capable of distributing data across the cluster and can also support parallel operations on large volumes of data. Information in all of the stores is accessible via SQL queries created in another layer, and all data and specialized processing capabilities are accessible from within a unified SQL framework.
Approaches to Problem Solving
To best identify the strengths of each architecture, consider a use case in which a business needed to increase customer satisfaction and boost profits. This first required determining which customers were most likely to influence others and also gauge customer sentiment. Then the company could target certain individuals for buying premium products and services that were more profitable. Interaction data was available from call detail records, customer transactions and machine logs.
With YARN, a programmer approached the problem by building a graph of telephone numbers and their interactions. Each customer was a node in the graph and each contact an edge connecting those nodes, which resulted in millions of nodes and billions of edges. With Giraph, the graph engine for Hadoop, the programmer built and analyzed the graph using Pig scripts and Giraph functions. The Pig script used PageRank to produce a list of phone numbers with a score indicating how influential each customer was. Pig then wrote the list of numbers and scores to HDFS.
Then the programmer used YARN to load call center text notes into a Hive table and write procedural MapReduce programs to analyze sentiment. The output of the analysis was then written to another Hive table to get an aggregated sentiment score for each phone number. Once the tables were sorted, joined and aggregated, each customer was given two scores—one for influence and one for sentiment.
With the SNAP Framework, the problem was solved with one SQL statement. The statement used the PageRank function to calculate the influence associated with each customer. Built-in functions parsed the text in the call center records and scored the sentiment expressed in each one. A single SQL query then joined together and aggregated data on customers, calls and call center interactions. Each customer was listed in a single table with two scores—one for influence and one for sentiment. Solving the problem required 10 steps for YARN and four for the SNAP Framework. (See table.)
Both architectures represent an advancement within their domains. YARN brings new capabilities and value to Hadoop users. It can make them more efficient users of their clusters and apply the resources of those clusters to a greater variety of big data problems. Clusters can simultaneously run MapReduce for general analytics, Giraph for graph processing and other specialized engines. YARN also helps organizations gain more value from their investment in Hadoop.
The SNAP Framework likewise provides specialized functions for executing MapReduce and graph processing, but implementation is deeply integrated. The language is integrated so users can get results with SQL, and the query planning and execution are joined together to optimize processing across the entire operation. All data management operations in the Teradata Aster Database share a common set of storage services, which increases efficiency when highly varied data types are involved in the same analysis. This allows the SNAP Framework to execute analysis in much less time, using fewer resources.
Richard Winter, CEO of Winter Corp., specializes in large database technology and big data implementation. He has more than 20 years of experience in the field.