Loading...
MapReduce provides a complementary programming model for high-performance analytic applications.

Tech2Tech

Insider's Warehouse

Together, showing the way

MapReduce provides a complementary programming model for high-performance analytic applications.

As data management professionals are transi­tioning high-performance computing tech­nologies like cloud computing and analytical databases into the mainstream, a small cadre of analytical programmers have recognized the value of massively parallel paradigms like Google’s MapReduce as another weapon in the analytics arsenal. These programmers are gravitating toward implementations such as Hadoop—an open-source MapReduce framework—to support the development of high-performance data-analysis applications. But is there really a need to decide that two different approaches to analytic application development cannot coexist?

While these techniques for high-performance analytics might seem mutually exclusive, synergies exist between distributed, massively par­allel batch applications suitable for MapReduce and high-performance data warehouses that can result in an enhanced business intelligence (BI) program.

What is MapReduce?

MapReduce, which is typically used to analyze Web statistics on hundreds, sometimes thousands, of Web application servers without moving the data into a data warehouse, is not a database system. Rather, it is a programming model introduced and described by Google researchers for parallel, distributed computation involving massive data sets (ranging from hundreds of terabytes to petabytes). As opposed to the familiar procedural/imperative approaches used by Java or C++ programmers, MapReduce’s programming model mim­ics functional languages (notably Lisp and APL), mostly because of its dependence on two basic operations that are applied to sets or lists of data value pairs:

  • Map. Describes the computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs
  • Reduce. The set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results

With some applications applied to massive data sets, the theory is that the computations applied during the Map phase to each input key/value pair are independent of one another. Combining data and computational independence means that the data and the computa­tions can be distributed across multiple storage and processing units and automatically parallelized. This allows the programmer to exploit scalable, massively parallel processing resources for increased process­ing speed and performance.

Scalability

A simple example can demonstrate MapReduce’s scalability: counting the number of times each word appears in a massive collection of Web pages. A recursive approach to solving this challenge considers incrementally smaller data “chunks” for analysis:

  • The total number of occurrences of each word in the entire collection is equal to the sum of the occurrences of each word in each document.
  • The total number of occurrences of each word in each document can be computed as the sum of the occurrences of each word in each paragraph.
  • The total number of occurrences of each word in each para­graph can be computed as the sum of the occurrences of each word in each sentence.
Figure: MapReduce in action

Click to enlarge

From a functional perspective, the programmer’s goal is to map each word to its number of occurrences in all of the documents. This suggests the context for both the Map function, which allocates a data chunk to a processing node and then asks each processing node to map each word to its count, and the Reduce function, which col­lects the word count pairs from all of the processing nodes and adds together the counts for each particular word. (See figure.)

The basic steps are simple, and the implementation of the program is straightforward. The programmer relies on the underlying runtime system to distribute the data to be analyzed to the processing nodes, instantiate the Map and Reduce directives across the processor pool, initiate the Map phase, coordinate the communication of the inter­mediate results, initiate the Reduce phase, and then collect and collate the final results. Some example applications include:

  • Document aggregations, such as sorting, word counts, phrase counts, and building inverted indexes for word and phrase searching
  • Real-time statistical blog and traffic analysis to facilitate offer placement and dynamic product recommendations
  • Data enhancements associated with data migration, data extraction, content tagging, standardization and other types of transformations
  • Data mining algorithms such as clustering, classification, market basket analysis or abandoned cart analysis
  • Social network analyses associated with social media webs and interactive behavior assessment

In the past, the programmer was responsible for considering the level of granularity for computation (document versus paragraph versus sentence) and the mechanics of data distribution and inter­mediate communication. With MapReduce, the mechanics of data distribution and communication are handled by the model, freeing the application programmer to focus on solving the problem instead of its implementation details. In fact, the simplicity of MapReduce allows the programmer to describe the expected results of each computation while relying on the compiler and runtime systems for optimal parallelization and providing fault tolerance.

Programming models versus database management

High-performance analytical database appliances and the MapReduce programming model are both appealing, albeit for different reasons. High-performance analytical database appliances have a long history of providing high-performance reporting and analysis. Yet as a framework for application develop­ment, MapReduce helps address some challenges in distilling and analyzing massive amounts of data that are not easily solvable in other ways. Despite various opinions of the benefits of selecting one approach over the other, using MapReduce is not mutually exclusive from using a high-performance analytical database appliance.

Operational and transactional systems generating billions of log entries, real-time Web statistics, unstructured documents and mil­lions of call detail records are examples of continuously streaming data sources that must be combined and analyzed, with the results reported to key stakeholders in the business environment. The high-performance analytical database appliance is nicely suited to that type of application. Alternatively, MapReduce applications are for batch processing of large volumes of many different types of data and, gen­erally, exhibit these types of characteristics:

  • Have little or no data dependence
  • Analyze massive data volumes
  • Are amenable to massive parallelism
  • Use structured and unstructured data
  • Require limited inter-process communication

What differentiates these two approaches? Essentially, each supports different business analytics needs in different ways. The MapReduce programming paradigm supports the design, implemen­tation and deployment of scalable analysis algorithms and is used for batch processing a variety of different types of data, from unstruc­tured data to information stored in databases. But MapReduce is not a database, nor was it intended to replace a database system.

The high-performance analytical database appliance approach is designed to manage structured information within a data warehouse or to restructure previously semi-structured data. Business analysts and other BI consumers interact with the data warehouse via ad hoc queries, generated reports, visualization, and integrated data min­ing and predictive models. The warehouse provides drill-down and dimensional views for interactive analysis. Yet the analytical database is not a programming model and is not intended to have the capabil­ity a programming model provides.

Suitability and synergy

You might think the inherent scalability, fault tolerance and simple yet flexible computing framework of the MapReduce approach are more suitable for analyzing large sets of data. Or you might suggest that the efficiency, speed of storage, and speed of data access for reporting and analysis of the high-performance analytical database appliance make it the better choice. But not only are the two approaches similar in many ways, there are also some business applications that are significantly improved through combining the two approaches, particularly in environments driven by collaboration between the power analysts and casual business users exploiting their results.

For example, a MapReduce application can be used for complex analysis combining various real-time data feeds, potentially in com­bination with data sourced from existing high-performance analyti­cal database appliances. Aggregations, results or summaries can be forwarded to the data warehouse and made available for the typical business users for drill-down or through standard reports. Also, tighter coupling between the high-performance analytical database appliance and a MapReduce environment can supplement standard SQL queries with services provided by MapReduce applications. Some example applications that can benefit:

  • Extraction and transformation, in which a large part of the extraction, integration and consolidation of data to be loaded into a data warehouse is parallelized and distributed across a MapReduce batch application
  • Continuous aggregation, wherein the MapReduce application aggregates captured data streams (Web statistics, for example), the results of which are fed into an analytical database for slicing and dicing
  • Time series analysis, where long customer transaction histories are mined for interesting patterns that can be used to signal emerging business opportunities in real time
  • Connectivity and network analysis, in which a MapReduce application profiles patterns associated with individuals in a social network and enhances existing customer profiles managed by the data warehouse for purposes such as affinity marketing, fraud detection, behavior analysis or analyzing persons of interest

To meet growing computational and data management requirements, scalable tools must be adopted, and both analytic database manage­ment systems and MapReduce enable massive data analytics. A collaborative approach is better, though, as any application that demands rapid analysis and requires frequent updates and integration with the structured environ­ment to meet real-time analytical needs can benefit from combining the techniques.

So although these two approaches are not replacements for each other, there is a promis­ing synergy for certain classes of applications. Combining these high-performance analysis techniques will help organizations identify and respond to emerging business opportunities more rapidly and successfully.


Your Comment:
  
Your Rating:

Comments