In the series of Presto SQL articles, this article explains what is Presto SQL and how to use Presto SQL for newbies. Presto is a high-performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB. One can even query data from multiple data sources within a single query.
Let’s begin with what is Presto. Presto is a massively parallel programming engine that allows users to execute against any database. If you define a database as software that stores data and processes it, Presto does not fall under the database category. Rather I prefer to call it a data or computing engine because Presto itself does not provide a storage solution. Instead, Presto focuses on how to query different data sources such as MySQL, SQLServer, Hive, Cassandra even possibly CSV files. Presto achieves such flexibility in querying anything using its plugin architecture as shown below:
In the future, if you find a new database to be supported by Presto, you only need to write a new connector to connect that database with Presto. Though it looks like connectors doing the heavy lifting here, actually connectors only provide simple API to connect to the database. For example, connectors tell Presto what are the tables available in the underlying database and how to read raw data from them. Given that information, Presto decides how to process those data and respond to a user’s request. The coolest thing here is that you can join a table from one database with a table in another database. For example, consider a bank has account details in MySQL database and transaction history in Hive, they don’t need to migrate data from one database to another to join them. Presto supports SQL like the following query out of the box:
Connector vs Catalog
As discussed earlier, a connector is a Presto plugin that tells Presto how to connect to a given type of database. For example, the open-source Presto comes with a built-in MySQL connector. Using this connector, you can create a catalog to connect to a MySQL server. In other words, a connector is a plugin using it you can create any number of catalogs. For example, if you have two MySQL servers, you can create two property files in the etc/catalog folder as shown below to connect to those two MySQL servers.
Note that the connector name is mysql in both catalog properties. With these property files in the etc/catalogs folder, if you execute SHOW CATALOGS from the command line, you will see an output similar to this:
High-level Components of Presto
As a distributed system, Presto has two types of servers: Coordinator and Worker. The community version is supposed to have one coordinator and one or more workers. Some companies have their version of Highly-Available coordinators and there is a GitHub issue tracking the HA deployment of community version. However, the HA scenario is out of the scope of this article. Therefore we will continue with the assumption that there is always a single coordinator and one or more workers. Of course, the same server can act as a coordinator and a worker but it is not preferred in a production environment.
Presto CLI
The command-line tool is available to download from the official website or if you build from the source code, you can find it in presto/presto-cli/target folder. You can run the presto-cli-xxx-executable.jar from the terminal and there are a whole bunch of command line parameters available to use. By default. presto-cli connects to the coordinator running at localhost:8080. If the Presto server you want to connect to is running at a different location, use the –server parameter to specify the target.
Presto JDBC Driver
Presto also provides a JDBC driver to connect to the cluster. Similar to the CLI, you can either download it from the website or get it from the built source code (presto/presto-jdbc/target).
Both Presto CLI and Presto JDBC drivers are using RESTful APIs under the hood. Therefore there is always an opportunity to use a REST client to connect to the Presto server.
Presto Dashboard
Presto Dashboard is a tool to get statistics about submitted queries and the system. You can see how many workers are available in the cluster and some statistics about the query including the execution time, input size, output size, and some advanced details on how the query is compiled into a physical plan and how it is scheduled on available workers.
Unlike the CLI and JDBC drivers, the dashboard is a read-only platform. There are some third-party tools like Apache Superset or Airpal that allows you to run SQL from a GUI. You can find the list of available tools on the official website.
Coordinator
The coordinator is a Presto server that receives SQL queries, compiles them, optimizes them, and schedules them on workers. Besides, it also maintains the cluster of workers and runs the dashboard. A cluster without a coordinator is not usable. In the community version of Presto, a discovery service is running along with the coordinator to which workers register themselves to form the cluster.
Worker
As the name suggests, workers execute the query. The coordinator compiles a SQL query into logical plans and sends fragmented logical plans combined with the data source information to workers (known as tasks). Workers execute those tasks and produce the output.
In-House Terms
I am working on an article completely about the internals of Presto. However, some terms are introduced here to get yourself familiar with them. You need to know nothing about these if you use Presto just to execute some SQL queries. However, if you are a database engineer working with the internals of databases, these keywords should be already in your dictionary.
Split
The smallest unit of data a Presto operator can process. Connectors generate splits according to their standards. For example, the MySQL connector may generate one split representing a full table whereas the Hive connector may create 100 splits representing one ORC file. A split object itself does not carry the data. Instead, it has the information on the data location. For example, a Hive split may have the ORC file name, length, and offset to read the piece of information defined by that split.
Logical Plan
A Directed Acyclic Graph is generated by compiling the SQL query. Presto uses this tree representation to optimize and analyze the SQL query.
Stage
A sub-tree (not necessarily have branches) of a logical plan after fragmentation is called a stage.
Task
A stage bundled with a split and ready for scheduling is called a task. A task informs the worker what to do and from where to get the data.
Physical Plan
A physical plan is created by workers by converting the given stage into an operator pipeline. Usually, the nodes in the logical plan will be replaced by operators and some additional operators will be injected in between to connect them.
For more details about the internals of Presto, read the Presto: SQL on Everything paper. I hope this article gives you a basic introduction to Presto and its tools. If you have any questions or suggestions, please comment below.