The Need for Query Speed
Presto delivers a fast, scalable and flexible open-source SQL on Hadoop query engine solution.
In Italian, “Presto” means fast. In the tech world, it means an open-source distributed SQL query engine for Apache™ Hadoop® that runs interactive analytic queries against data sources of all sizes. Through a single query, data is accessed where it resides. Typically, this means data in a Hadoop Distributed File System (HDFS). However, unlike other SQL on Hadoop engines, Presto can query data sources such as Apache Cassandra™, relational databases or even proprietary data stores.
Presto complements Teradata® QueryGrid™ within the Teradata Unified Data Architecture™. It serves as a key engine to enable interactive querying against Hadoop, while Teradata QueryGrid allows queries to be initiated from the Teradata Database and the Teradata Aster Database, all through a common SQL protocol.
Unlike some SQL on Hadoop engines, Presto is not a SQL front-end for a general MapReduce or other analytical engine. Instead, it’s a dedicated SQL query engine, architected much like a standard database. It does not need to leverage a proprietary SQL-like language because it speaks standard ANSI SQL.
Architected like many massively parallel processing (MPP) databases, the main coordinator node handles the parsing, analysis and planning of the query execution. Worker nodes then execute the queries and handle the data tasks such as joins and aggregations. Yet it differs from a database in that it does not maintain its own tables, have indexes or catalog data. Instead, it accesses data and tables that exist on other platforms.
The simplified system architecture (See figure) allows the user to send SQL to the Presto coordinator. The scheduler wires together the execution pipeline, assigns work to nodes closest to the data and monitors the progress.
The pipeline execution model runs multiple stages at once and streams data from one stage to the next as it becomes available. Then a modern query and execution engine with operations designed to support SQL semantics is employed. In addition to improved scheduling, all processing is in memory and pipelined across the network between stages to avoid unnecessary I/O overhead and associated latency. The solution offers:
Super High-Speed In-Memory Processing
Created by Facebook to meet the analytic needs of extremely large data-driven organizations, Presto’s pure memory-based architecture is built for speed. It performs jobs in memory where all stages of a task are pipelined to enable high-speed query capabilities. All data transfers in the worker nodes occur memory-to-memory, with no disk I/O.
Although this provides extremely high-speed processing, the caveat is that everything must fit in memory, which must be taken into account when looking at the type of workloads to be performed. For example, long-running ETL-type workloads on Hadoop might perform poorly. However, more interactive-type workloads are well suited for Presto.
From the beginning, Presto was designed to be distribution agnostic. It is portable across Hadoop distributions and not tied to any specific solutions such as Cloudera®, Hortonworks or MapR®, so companies can change the one they use without impacting the underlying query environment.
Presto has the unique ability to query data that lives in Hadoop and in other database management systems (DBMSs). The open-source community has created numerous plug-ins that allow querying in solutions such as MySQL, PostgreSQL, Cassandra and Kafka, making Presto an extremely powerful query tool across data platforms, including Hadoop.
Presto has a wide array of query capabilities, one of which is pushdown. Users can create a query that joins data across multiple sources, such as MySQL and Hadoop. Query processing can be pushed down to the underlying data solution, depending on the connector and the platform. For example, if a query is:
select avg(column) from postgrestable where postgrestable.column > 10;
- Presto will push the following query into postgres:
select column from postgrestable where postgrestable.column > 10
- Then Presto will handle the avg(column) part from the results returned from postgres.
In some cases, Presto will leverage predicate pushdown and perform the aggregation part of the query within its own engine. The platform has numerous built-in analytical and windowing functions as well as map and array support.
Bringing Open Source to the Enterprise
Teradata is contributing to Presto’s open-source development to increase adoption of the software within the enterprise. Enhancements such as YARN integration to enable resource management within Hadoop, support for enterprise-class ODBC/JDBC drivers, improved security and BI tools certification, along with customer support and professional services, will allow organizations to more effectively use their analytics environment to run interactive queries at scale and leverage their existing BI and visualization tools.
Mark Shainman is the program manager for Teradata Presto and for Teradata’s competitive programs, which include Oracle, Netezza and Microsoft SQL Server migration.