* This article was written by Cristian Scutaru and originally published here: All about the new Snowflake Hybrid Tables | by Cristian Scutaru | Snowflake | Jul, 2022 | Medium
Here is a collection of all we know at this point in time about the new feature announced at the Snowflake Summit 2022: Hybrid Tables, for OLTP workloads. The feature is still in private preview at this moment.
Unistore will support three different workloads
- Snowflake is still primarily a data warehouse, where the storage is optimized for OLAP (OnLine Analytical Processing). Different entries within each micro-partition, and even JSON data within a VARIANT cell, are stored compressed in a columnar format, to allow this kind of optimization.
- The second workload already supported by Snowflake is the Data Lake. We can access structured and semi-structured data from cloud object storage through external tables. References can point to unstructured data, such as videos, images or PDFs.
- The new “Unistore” will include a third type of workload, optimized for OLTP (OnLine Transaction Processing), through these new “hybrid tables”.
Why this is totally feasible, and it is here to stay?
All data warehouses — including Snowflake, Amazon Redshift or Google BigQuery — have a compressed columnar data storage, optimized for analytical workloads. Amazon Redshift was initially a PostgreSQL relational database changed to store data by column, instead of by row. OLAP queries may take seconds or minutes to process huge quantities of data, usually through grouping and aggregations. A typical example is the month-to-month report for all company sales.
The new hybrid tables will simply store data by rows instead, just like the typical OLTP relational databases — Oracle, Microsoft SQL Server, or PostgreSQL. This by itself will automatically allow faster individual reads and writes. OLTP transactions must have a sub-second latency, from potentially large numbers of users, in parallel. Typical examples are getting all past orders for a customer (e.g. lookups) and adding one new order through an ACID transaction, with all the related data attached. Proper ACID transactions are a must, with row-level locking.
For hybrid tables, query performance might be affected when performing large range scans or complex analytic calculations. Consider preprocessing some data to reduce the load.
What else will make the hybrid tables faster?
Most data warehouses, Snowflake included, do not enforce the referential integrity to speed up the analytical queries. OLTP transactions are rather for fast access to individual records, or to a small number of records, by a key. To do so, they need the primary keys enforced. To support different access patterns, they may also need to support secondary indexes.
Remember that, until now, Snowflake did not support indexes. You’ll have SHOW HYBRID INDEXES now to display them. The new hybrid tables will require primary keys — with the PRIMARY KEY clause— to prevent inserting duplicates. And they will also support optional secondary indexes — with the INDEX keyword, or the UNIQUE keyword for one single column with unique values.
Referential integrity is also usually enforced. There is speculation they could go even further and distribute data horizontally using partition keys, the way most key-value NoSQL stores do.
Transaction scope will be limited to the database where the hybrid table resides.
How to create a hybrid table
With the typical DDL command, and with a new HYBRID keyword:
CREATE [OR REPLACE] HYBRID TABLE [IF NOT EXISTS] <table_name> ...
The command will obviously declare one or more columns as primary key, and optional (unique or not) secondary indexes. CTAS (CREATE TABLE … AS SELECT) and CREATE TABLE … LIKE can also be used. In the metadata views, the new table type will appear with an is_hybrid flag set to True:
Materialized views, streams, replication, cloning and data sharing, may not be supported right away by the hybrid tables. But in time support will be added, as it is right now for time travel, data governance or user-defined functions.
Billing the new OLTP workloads the same way could pose some problems. Expect different pricing for the new hybrid tables.
Can we combine the different workloads?
The ideal would be to have one single table supporting both OLTP and OLAP workloads. After all, Snowflake tries to avoid the data silos. But we know this is very difficult to achieve.
What Snowflake does already is when data is stored in a hybrid table to support fast single-row transactions, it is also copied automatically into a columnar format. This means that transactional data will be also immediately available for analytics. When used to query large quantities of data, passing through the columnar format can make the query up to 50x faster.
This approach is called HTAP (Hybrid Transactional and Analytical Processing) and has been used already in the past for OAP (Operational Analytical Processing).
Writing to two different data stores internally, supporting ACID transactions, and also making sure you have strong consistency for the internal replication are obvious challenges and we may in fact pay a price in performance. It’s still to be seen if the operations on these tables are as fast as in Oracle or other similar database systems.
What about the competition?
With these types of supported workloads — OLAP, OLTP and data lakes — looks like Snowflake can target different markets:
- Other similar cloud data warehouses — like Amazon Redshift and Google BigQuery — do not support yet, at such a level, transactional workloads. The data is stored compressed and optimized by columns.
- Relational database systems like Oracle, Microsoft SQL-Server or even PostgreSQL are primarily transactional-oriented. But they are also frequently used for analytics, or have OLAP-specialized products with tight integration. Expect more incentives now for people to migrate from their RDBMSs to Snowflake.
- I see Apache Spark and in particular Databricks as major competitors to Snowflake in the area of their compute engines. While Databricks can use different existing big data storage to perform analytics on top, Snowflake has much better data governance. They both work on data lakes and data lakehouses, but Snowflake extended now the possibilities with this mix between OLAP and OLTP support.
- It’s safe to assume that hybrid tables are here to stay, even if they are still in private preview. I see no reason why we could not already start selling this big important feature to our own clients.
- Transactions on the new hybrid tables should be fully ACID-compliant. This means that nested transactions should roll back as well when the containing parent transaction fails. On standard tables, this used to be a problem, as nested transactions were always committed separately no matter what.
- We should start expanding our Oracle/SQL-Server/PostgreSQL to Snowflake migration offers with the transfer of OLTP workloads. However, some PoCs and benchmarks are needed to truly see and understand how fast some data access can be compared to the established products.
- HTAP is a great technique, but writing the same data twice — in both a row-based and columnar format — may introduce some delays. Once again, I see one generic need to have some valid universal benchmarks. Followed by a case-by-case analysis of our client’s workloads.