The One Thing Every Data Warehouse Should Consider

August 17th, 2020

This article is by Clive Bearman and originally appeared on the Qlik Blog here: https://blog.qlik.com/the-one-thing-every-data-warehouse-should-consider

 

For the past year, I’ve encountered hundreds of companies that have embarked upon a cloud data warehouse initiative. Although each had different business drivers and motives for adoption, one similarity kept occurring – a commonality so strong that it appeared regardless of use case. Consequently, it didn’t matter whether they were migrating from a legacy data warehouse like IBM Netezza or Teradata, or whether they were starting a fresh implementation with Amazon Redshift, Azure Synapse, Google BigQuery or Snowflake. The common trait that arose was that virtually all, except the odd outlier, didn’t use a data warehouse design methodology.

 

Nothing to See Here, Move Along

“OK, boomer!” I hear you shout. “Data warehouse methodologies are old hat, Daddio!” In fact, I heard much of the same pushback from many prospects. My favorite responses were as follows:

  • “We don’t have time for that. We need to build the warehouse today.”
  • “A data warehouse methodology is too complicated.”
  • “Data warehouse design is too slow.”
  • “We don’t have that sort of expertise.”
  • “We’re only doing a quick project. That’s too bloated.”
  • “We’ll just write SQL for ourselves.”

At first glance, much of these reasons seem fair. After all, legacy data warehouses that relied on methodologies have a less-than-stellar reputation for being responsive to business needs – especially when you compare them to today’s cloud offerings. So, it makes sense to cut out the middleman and pipe the data straight from the source to the cloud. Also, much of the appeal of a cloud data warehouse is that it is infinitely more flexible than a traditional solution. So, what’s the problem?

Getting Out Over Our Skis

It’s true: Cloud data warehouses offer much greater flexibility, because they scale compute and storage to match demand. They are also extremely easy to adopt. In many instances, you can just slide a credit card to get started and provision a data warehouse with a few simple clicks of a mouse. However, it’s this seductive simplicity that encourages us to cut corners, leading us to build and fill multiple data warehouses without a thought.

The more warehouses we build, the more tables, fields and data we need to manage. As our warehouses grow, our focus shifts to wrangling the SQL that manages the data. As a result, what began as a few simple scripts balloons to a mass of spaghetti code. The quick and nimble cloud data warehouse is now difficult to change, and we’ve come full circle.

A Framework For Agility

This is where a data warehouse design methodology can help. The three most common design methodologies in use today are as follows:

1. Inmon – Corporate Information Factory

  • Bill Inmon’s top-down approach models the data warehouse in a hub and spoke pattern, where entities and relationships are in third normal form (3NF). Data marts are created as needed for reporting and analytics.

2. Kimball – Dimensional Modelling

  • Ralph Kimball’s bottom-up approach starts with the metrics and measurements for reporting and creates marts as star schemas. Dimensional modelling optimizes the data warehouse for fast retrieval of data.

3. Linstedt – Data Vault

  • Dan Linstedt’s Data Vault architecture is a hybrid approach that combines the best of 3NF and dimension modeling. This technique enables historical storage of data, integration of data from different operational systems, and lineage tracing of incoming data. The Data Vault approach is based on the concept of Hubs, Links and Satellites.

Although a detailed description of each methodology goes beyond the scope of this blog post, it’s important to recognize the benefits of employing any of the methods with a cloud data warehouse. The chief benefit of using a warehouse design methodology is that it can be automated with tooling. Most automation solutions actually generate the SQL code required to implement the methodology and provide a framework for the entire data warehouse lifecycle. Ultimately, this can save you hours of SQL coding and reduce code management pain, which means you can spend more time on designing and delivering the data for your analytics projects.