Choose the best MapReduce platform to get maximum business value from your data.
Technologies for big data analytics are evolving rapidly. This growth has sparked interest in new analytic approaches such as Hadoop MapReduce and Hive, and MapReduce extensions to relational database management systems (DBMSs).
MapReduce allows organizations to rapidly process and analyze very large volumes of multi-structured data, enabling them to make smarter decisions faster. To realize the full value of massive volumes of data, including multi-structured data, organizations need a platform that provides users with the ability to ingest, structure and analyze information. They can then extract business value from the information—value that was not easy to discover in data’s raw, native form.
When and Where to Use MapReduce
Typically, programmers prefer the procedural approaches for accessing and manipulating data offered by Hadoop MapReduce, while non-programmers prefer the declarative manipulation languages in relational DBMSs and SQL. (The figure below shows the roles of relational DBMSs and Hadoop Hive.) However, the availability of an SQL-like language in Hadoop Hive and the addition of MapReduce functions in relational DBMSs make preferences more complicated.
MapReduce programs can process data stored in different file and database systems. Each platform for the programs has specific advantages—and drawbacks:
HIVE FOR IMPROVING MAPREDUCE DEVELOPMENT
For sequential processing of very large multi-structured data files such as Web logs, use Hadoop Hive or Hadoop MapReduce. Hive’s main benefit is its ability to dramatically improve the simplicity and speed of MapReduce development. The Hive optimizer also makes processing interrelated files easier, and its SQL-like syntax makes it easy to use by non-programmers who are comfortable with SQL.
The downside is that the Hive’s optimizer is not fully insulated from the underlying file system. As a result, the user is frequently required to aid the optimizer with language constructions to process more complex queries. Handling traditional SQL-like queries extends Hive’s use to structured data. But Hive cannot substitute for the functionality, usability, performance and maturity of a relational DBMS.
Click to enlarge
DBMS FOR ISOLATING DATA
If SQL users want to keep data physically independent, they should use a relational DBMS. It keeps the logical and physical views of data completely isolated from each other, providing physical data independence. This has the advantage of allowing vendors to extend or add runtime and data storage engines without affecting existing applications. The Teradata Aster Database, for example, includes a runtime engine for MapReduce processing and a data store that enables both row and column data storage.
Adding MapReduce to a relational DBMS extends its use to multi-structured data. Some vendors now support MapReduce functions inside the DBMS. This offers the benefits of deploying user-defined functions and also adds the advantages of MapReduce to the relational DBMS environment—such as the ability to process multi-structured data using SQL.
Although data independence makes life easier for non-programmers, the disadvantage is that experienced developers have little or no control over how data is accessed and processed. Instead, they have to rely on the relational optimizer to make the right decisions about how data is accessed.
HADOOP FOR FAST PROCESSING
Hadoop is a good choice for organizations with large amounts of multi-structured data, allowing them to process petabytes of data in a timely, cost-effective manner. Non-relational systems like Hadoop are not new, but they are now designed to exploit commodity hardware in a large-scale distributed computing environment and have been made available as open source.
Hadoop has several components:
- File System (HDFS) stores and replicates large files across multiple machine nodes. It can be a source or target file system for MapReduce programs.
- MapReduce is the programming model for distributing the processing of large data files (usually HDFS files) across a large cluster of machines.
- Hive offers the SQL-like language (HiveQL) and optimizer that create MapReduce jobs for analyzing large data files.
Hadoop does have its drawbacks: HDFS supports multiple readers and one writer. Since it does not provide an index mechanism, it’s best suited to read-only applications. The actual location of data within an HDFS file is transparent to applications and external software, meaning that software built on top of HDFS has little control over data placement or knowledge of data location. This can make it difficult to optimize performance.
Although Hadoop MapReduce can process large amounts of data, coding map and reduce programs using low-level procedural interfaces is time consuming.
RELATIONAL DBMS MAPREDUCE FOR DEEP DIVES
If an organization needs to run sophisticated analyses on a diverse set of both structured and multi-structured data, a good choice is a relational DBMS that supports MapReduce, such as the Teradata Aster MapReduce platform. Merging it with SQL (SQL-MapReduce) preserves the declarative and storage independence benefits for SQL while exploiting the power of the MapReduce procedural approach to extend SQL’s analytic capabilities. SQL-MapReduce creates a library of prebuilt analytic functions to speed the development of analytic applications. Functions provide path, pattern, statistical, graph, text and cluster analysis, and data transformation.
Custom functions can be written in several languages, including Java, for use in batch and interactive environments. One key objective of the Teradata Aster Database is to make it easier for less-experienced users to exploit the analytical capabilities of existing and packaged MapReduce functions.
TAKE FULL ADVANTAGE
Big data analytics and associated technologies offer significant business benefits. For data that remains outside the integrated data warehouse, developers should carefully evaluate whether to use a relational DBMS such as the Teradata Aster Database or a non-relational system such as Hadoop with Hive. Because of the many approaches and components now available, organizations must think of this new infrastructure as an extended data warehouse that is essential if they are to reap the full benefits of their data.