From data warehousing to data lakes – now what?

This article is by Victor Ng and originally appeared on the Frontier Enterprise Blog here: https://www.digiconasia.net/features/from-data-warehousing-to-data-lakes-now-what

 

Processing, sharing and securing the immense amount of data in organizations today require a strategic platform built for the cloud.

Data warehouses are simply not optimized for processing unstructured data. That’s where data lakes made their entrance.

But organizations today require systems for diverse data applications for both the cloud and on-premise – including analytics, real-time monitoring, data science, machine learning and AI – to derive value from structured and unstructured data, and everything else that falls in between.

A common approach has been to use multiple systems – a data lake, several data warehouses, and other specialized systems such as streaming, time-series, graph and image databases. However, this introduces complexity and delay as data professionals invariably need to move, copy or share data from system to system.

DigiconAsia sought out some insights from Peter O’Connor, Vice President APJ at Snowflake, into the challenges and strategies to ensure the right data is available to the right people at the right time.

Why are traditional data warehouses not suitable for the data needs of organizations today?

O’Connor: Legacy or traditional data warehouses have too many limitations and overheads compared to the modern ‘built for cloud’ SaaS alternatives that are available today.

These limitations include lack of elasticity and scalability to cater for the increasing volumes of workloads and users (either for concurrency or data science), security deficiencies, they struggle to natively ingest all data types such as semi structured data (JSON, XML, Avro, Parquet, and Orc), are excessively expensive, are operationally challenging, and lack modern features and functionality. This translates into system outages for upgrades, increasing operational expense, general lack of flexibility, delays in commissioning new users and workloads, and an overall inefficient environment.

There is a growing customer need for consumption/OpEx cost models, instant resource availability, the ability to securely share data in real time, to natively ingest and join all data types, perform faster, and be available on multiple public cloud platforms allowing choice and availability.

Snowflake resolves all of the above challenges.

What about data lakes? What are its advantages and disadvantages in the cloud era?

O’Connor: Siloed data doesn’t allow users to gain timely insight into different data collectives. The cloud allows companies to securely store all of their data in a single low-cost location. Once there it’s available to a wider user base to query. Having all your data in a single data lake makes administration simpler and more efficient. On-premise data lakes have considerable limitations such as Hadoop. They are complex and costly to manage. No so difficult getting data in but challenging to get meaningful data insight out.

Innovative SaaS companies like Snowflake offer features like zero copy clones for Dev/Test, reducing the need to store multiple database copies saving storage and operational costs. These can be provisioned instantly saving days or weeks of time which speeds up testing or application development etc. There is no limit to the volume of clones users can create.

Time travel allows users to query data as it was at a point in time up to 90 days. This effectively removes the need to complete daily backups which also saves infrastructure cost and operational overhead.

Enterprises need more and more data, for better applications of AI and analytics, besides governance and compliance requirements. But how could infrastructure keep up with the data explosion?

O’Connor: All enterprises have growing amounts of data both structured and semi-structured. The boom in IoT data, weblog data, sensor data etc. creates opportunities to join these different data sources together with structured data into single queries allowing deeper insight.

On-premise infrastructure can’t keep up. Customer data centers are costly (power, cooling, real estate etc.) plus on-premise technology is expensive compared to cloud alternatives. Customer data center technologies are becoming obsolete faster than ever.

Growing amounts of data require infrastructure and cost models that can accommodate fluctuations in demand including compute or storage independently of each other. Snowflake is the most economic and flexible platform to adjust to changing user needs. It is infinitely scalable i.e. any amount of resource is instantly available when customers need them and can then be scaled down as demand dissipates.

Being in the cloud allows companies a more efficient way of sourcing third-party data from outside sources which can be instantly added to enrich their own data. Once in Snowflake, companies have the ability to share data with partners, suppliers, regulators etc. in real time and allows users to monetize data should this be of interest.

We see the need for better and faster data sharing across cloud and on-premise infrastructures. How could this be achieved today?

O’Connor: Data sharing methods for on-premise solutions haven’t changed in decades. Its FDP or EDI which is often done infrequently and in an unsecure way. Often the requesting or target entity receives out of date data simply because transmission methods are slow and cumbersome.

Today, there are more efficient and secure ways of sharing data when in the cloud. Snowflake users can share data in real time without the data moving from where it’s stored. For example, users are enabled to securely share any amount of data in real time with whomever they chose. Data sharing can be executed across different public clouds and across multiple global regions. Real time data sharing has significant benefits including optimizing supply chains, faster decision making, increased productivity, and potential data monetization through public and private data exchanges.

What about data security, with so much data being shared and transmitted across a wider network of systems?

O’Connor: Data security is mandatory for sensitive data. Regulators and other industry governing bodies have strict guidelines around data security. Data stored in the cloud is no exception.

Customers should explore all levels of data security before committing data to the cloud or to any platform for that matter. Data should be encrypted at rest and in flight at all times. Some vendors charge for encryption; others like Snowflake provide this as a standard built-in feature.

Snowflake is the industry leader in cloud security for data warehousing/data lakes. Security was a foundational pillar of Snowflake’s architecture from day one.

Features include site access controlled through IP whitelisting and blacklisting, managed through network policies, multi-factor authentication (MFA) for increased security for account access by users, OAuth for authorized account access without sharing or storing user login credentials, support for user SSO (single sign-on) through federated authentication, controlled access to all objects in an account (users, warehouses, databases, tables, etc.) through a hybrid model of DAC (discretionary access control) and RBAC (role-based access control), files stored in stages (for data loading/unloading), periodic rekeying of encrypted data, and support for encrypting data using customer-managed keys. Finally, customers should be requesting for vender security validations such as Soc 1 Type 2 compliance, Soc 2 Type 2 compliance, HIPAA compliance (where necessary), and PCI DSS compliance.