This article is by David Hare and originally appeared on the Alteryx Data Science Blog here: https://community.alteryx.com/t5/Engine-Works-Blog/Presto-will-it-Alteryx/ba-p/420935
I'll start this series with Presto, a technology that many customers have been asking about recently. Presto is a distributed SQL query engine that can work with many data source systems. It uses a MPP (massively parallel processing) design where SQL queries can be processed in parallel across a cluster of machines. This may sound identical to other MPP style data sources like Hadoop or Teradata, however the major difference here is Presto is not storing the data. In fact, Presto is much more flexible and pluggable as it can be configured to query multiple back end source systems through "connectors." Many organizations are adopting Presto as a single engine to query against all available data sources.
In order to find out if Alteryx can work with Presto, I needed an environment to test. Presto follows a familiar architecture where you have one Master and multiple Workers. In Presto terms the master is called the "Coordinator" and is responsible for handling client connections and distributing work to the workers. The workers retrieve the data from the source systems defined by the connectors, and work together to process the requests directed by the Coordinator.
This fits perfectly with Hadoop architecture so I collocated the Presto Coordinator on a Hadoop NameNode and configured Presto Workers on 3 Hadoop DataNodes. After configuring the workers and coordinator to all talk to each other, I was able to validate my new Presto cluster using the Web Interface:
Connectors and Catalogs and Schemas! Oh My!
With Presto there are a couple new terms to be aware of in addition to the database schema that define how to connect to data sources. A "connector" defines how Presto connects to the back end data source system, including the server hostname, port, and credentials. A "catalog" contains schema information and references a specific connector. All of these are easily configured in properties files, and in no time at all I had connections to Hive, MongoDB, PostgreSQL, Amazon Redshift, and Microsoft SQL Server:
[root@presto-coordinator catalog]# ls -alh
-rw-r--r-- 1 root root 100 May 22 15:41 hive.properties
-rw-r--r-- 1 root root 82 May 22 14:38 mongodb.properties
-rw-r--r-- 1 root root 133 May 22 14:25 postgresql.properties
-rw-r--r-- 1 root root 131 May 22 15:19 redshift.properties
-rw-r--r-- 1 root root 146 May 22 14:48 sqlserver.properties
Querying Your Data
Presto includes a command line tool for running queries against specific data sources that have been configured in your Presto cluster. It can be launched against a catalog and schema to allow users to run queries interactively:
[root@presto-coordinator presto]# ./presto --catalog hive --schema default
presto:default> show tables;
Query 20190529_124127_00007_k5hzu, FINISHED, 4 nodes
Splits: 53 total, 53 done (100.00%)
0:00 [4 rows, 513B] [71 rows/s, 2.57KB/s]
Or, the query can be specified directly on the command:
[root@presto-coordinator presto]# ./presto --catalog hive --schema default -- execute "SELECT * FROM acmebank_stress_test WHERE loan_type='Term loan' AND collateral>=100000;"
"2008-06-30 00:00:00.000","100012891","36","32","BB","Pool5","1952021.0", "Term loan","","1764250.0","112457.708875471","506059.689939621", "-13899.7728170082","-13775.9730136383","-110529.743351998","","","", "1764250.0","0.007967812730301","0.0","0","253029.84496981","197363.279076452", "168686.563313207","0.151388441875723","267087.058579244","154629.349703773", "0.087645940033313","0.26293782009994","510786.287443657","506059.689939621", "0.990746428359126","0.15","47","2","WA","West","Retail Lending","9","Mortgage" ,"Consumer","Bank B","Collateral"
Will it Alteryx?
With an environment up and running, it's time to answer the question on everyone's mind, "Presto, will it Alteryx?" The answer is yes, it will! Although not listed as a Supported Alteryx Data Source, Presto data can be read in through the standard Input tool using a Generic ODBC connection. To support this, the machine where Alteryx is running (server, laptop, desktop) needs the Presto ODBC Driver installed.
Additionally, an ODBC configuration must exist for each data source you intend to query. The Presto Coordinator hostname and port must be specified, along with any credentials required to make the connection:
Catalog and Schema selections allow you to specify the intended data source to connect to:
Once the ODBC Driver has been configured, the Input tool can be easily configured to pull data from Presto into Alteryx Designer for processing:
While this blog is not meant to provide performance details, I did observe consistently faster results reading a test table from Presto (~12 seconds) compared to reading from Hive (14 sec) or Impala (24 sec). Additionally, I was able to use the In-Database tools with Presto by specifying a Generic ODBC Connection and the processing completed in under 3 seconds!
Alteryx working with Presto opens up many opportunities for data analysis as Presto continues to adopt more connectors to other data sources. Currently not listed as a supported data source for the standard input tools or In-Database tools, Presto can still be queried using a generic ODBC connection. With Presto's growing popularity it is likely to be considered as a future addition to the supported data sources list.
If you have any technologies you would like to see explored in future installments of the "Will it Alteryx" series, please leave a comment below!
- "presto" is Italian for quickly.
- Presto was originally developed at Facebook to provide faster queries against data in Hive.