Presto lately renamed to TrinoDB is a distributed big data SQL engine initially developed by Facebook and later open-sourced and being led by the community. This article explains the internals of Presto SQL Join algorithms and how Presto executes join operations internally.
Implementation of Joins
Almost all database engines out there join only two tables at a time. Even if there are more than two tables to be joined in the SQL query, the database will join the first two tables and join the output with the third one and continue this for the remaining tables. Database engineers refer those two tables involved in a join operation as Build Table and Probe Table.
Build Table
The build table is the table that is used to create an in-memory index. Usually, the build table has to be read completely before reading the probe table.
Probe Table
The probe table is read row by row once the Build Table is read and stored in memory. Each row read from the probe table will be joined with the build table based on the join criteria.
Presto uses the right table in a logical plan after optimization as the Build Table and the left table in a logical plan as the Probe Table. Note that the tables in a logical plan are not necessary to be in the same order as they are in the SQL query. Presto has some cost-based optimizers which may reorder the join to keep the smallest table on the right side (i.e. build table) so that it can fit in the memory. If the join reordering optimizer is disabled or if connector specific statistics (for example Hive statistics) are disabled, then Presto will not reorder the join query. In such a situation, it is advisable to keep the smallest table on the right side of the join so that Presto can use it as the build table.
Join Algorithms
Databases use different algorithms to join two tables depending on the data type and join type. For example, SQL Server uses the Nested Loop algorithm, Merge Join algorithm, Hash Join algorithm, and Adaptive Join algorithm. By the time of writing this article, the open-source Presto SQL engine employs the Nested Loop algorithm and Hash Join algorithm to support all different join types discussed in Presto SQL: Types of Joins. This section briefly explains the Nested Loop algorithm and the Hash Join algorithm and discusses the applicability of other algorithms in Presto to improve the performance.
Nested Loop Algorithm
As the name suggests, the nested loop algorithm joins two tables using nested loops. An array joining example is used below to explain the nested loop join algorithm. Suppose you are given two arrays of integers and asked to print the Cartesian product of those arrays, how would you solve this problem? A naive approach to print the Cartesian product of two arrays is given below.
The above code has a loop inside another loop to print the Cartesian product of two arrays. The above problem is a modest version of cross-joining two tables in a database where each table has a single column. The nested loop algorithm takes O(n²)
time complexity because it has to join each row from the probe table with every row from the build table. Since every combination is required, the cross-join operation cannot be performed better than O(n²)
time complexity. Presto is using the nested loop algorithm to execute cross-join operations and this is why cross-join takes a long time if the joining tables are extremely large. It is not recommended to join two large tables without a join condition because of the O(n²) time complexity.
Hash Join Algorithm
The hash join algorithm generates a hash key for the columns in the build table which are used in the equality join conditions like left.x = right.y AND left.z = right.w
. Each of such equality conditions is called join equi criteria. Though the term equi criteria is widely being used in the database domain, they are also known as equality conditions. To make use of a Hashing algorithm, let’s consider the problem of printing all Customers along with their Order information. The Customer and the Order classes used in this problem are defined below. Observe that both classes have a common attribute: custKey.
Coming back to the question: how can we print all customers along with their orders? Knowing the nested loop algorithm, one can simply apply the nested loop algorithm with an if condition inside the loop as given below:
Though the nested loop join works, it is inefficient as it iterates n² times given n customers and n orders. An efficient solution can be developed using a Hashtable to store all Orders using the equijoin criteria: custKey as the hash key. Then as we iterate through the list of Customers, we can generate the hash value of Customer.custKey and get the list of Orders having the same custKey as given below:
In the above algorithm, separate chaining is used to avoid hash collisions because there is a high chance of having multiple Orders placed by the same customer. A hash value using the equijoin criteria columns is used to store the build table in buckets. Then the same hashing algorithm is applied to the equijoin criteria columns of the probe table to find the bucket containing matching items. Though the worst-case time complexity of the Hash Join algorithm is O(n²) in an average case, it is expected to be O(n).
The aforementioned problem can be defined as a SQL query as given below to join the Customer table with the Orders table.
All join operations with equijoin criteria are executed using the hash join algorithm in Presto. However, join operations are not limited to equijoin criteria. For example, two tables can be joined if the column value is greater than or less than the value of another column as in the following query:
The Hash Join algorithm is not suitable for join conditions with inequality constraints. First, it is hard to come up with a perfect hashing algorithm that maintains the inequality property of the input (i.e given x > b does not guarantee that hash(a) > hash(b)). Secondly, even if we come up with a hashing function that satisfies the inequality requirement, we cannot simply join all values from a bucket. To join in-equal rows, every row greater/less than the given column should be matched. Therefore, Presto is using the Nested Loop algorithm with a filter instead of the Hash Join algorithm to perform joins with non-equijoin criteria.
Having at least one equijoin criteria in Presto SQL improves the performance a lot. Though the open-source Presto SQL employs only the Nested Loop algorithm and the Hash Join algorithm for join operations, Merge Join is another well-known algorithm used in relational databases. The following section introduces the Merge Join algorithm and explains why the Presto community did not consider adding support for the Merge Join algorithm.
Merge Join
The merge Join algorithm is coming from the well-known Merge-Sort algorithm. The merge-sort algorithm has two phases: sorting and merging. Assuming two arrays are already sorted, they can be merged in O(n)
time complexity. Presto could implement this algorithm by sorting the build table and probe table using the columns used in equijoin criteria and then by performing a merge operation. Ignoring the sorting part, the merge join algorithm is expected to perform better than the aforementioned algorithms but the Presto community found it requires both tables to be sorted in memory which is time-consuming in the world of Big Data and maybe even infeasible considering the limited memory. However, if there is a chance for the data to be sorted in the underlying data source, the merge join algorithm can be a better candidate.
In my opinion, if the build table is small enough to fit in memory it wouldn’t be a bad choice to sort it and compare the probe table rows with the build table using a binary search algorithm. It may improve join operations with inequality conditions like greater than or less than. Presto also supports relational databases which usually have less amount of data compared to big data storage. If the use case is joining two tables from a relational database or if a table from a relational database is joined with a table from a Hadoop filestore, there is a chance to ask the underlying relational database to return sorted results. Therefore, I feel that Merge Join is still a candidate to consider even in the big data domain.
Keeping my thoughts aside, the Presto community has done a good job already and they keep improving the engine to conquer the big data domain. In another article, I will cover how distributed join works in Presto. If you have any questions, please comment below. I will try my best to clear your doubts.