This article was written by Derek Slater and originally appeared on the Snowflake Blog here: https://www.snowflake.com/blog/reducing-mass-data-fragmentation/
A few years ago, Kent Graziano joined a big organization to work on its data. The first problem was that nobody really knew what and where all the data was. Graziano took his first three months on the job investigating data sources and targets, ultimately creating an enterprise data map to illustrate all the flows. It wasn’t pretty.
“In the end, I discovered that the same data was being sent to three or four places,” he said. In one case raw data was transformed and stored in a data warehouse, then moved from there into another warehouse—which was also pulling in the original raw data.
Graziano, who recently retired from his post as Chief Technical Evangelist at Snowflake, said this scenario is entirely common. Data scattered and copied in lakes, warehouses, data marts, SaaS platforms, spreadsheets, test systems, and more. That’s mass data fragmentation, or, more colloquially, data sprawl or data puddles.
Indeed, 75% of organizations do not have a complete architecture in place to manage an end-to-end set of data activities including integration, access, governance, and protection, according to IDC’s State of the CDO research, December 2021. This lack of governance combines with legacy systems, shadow IT, and good intentions to pave the road to a lot of fragmentation.
While achieving a single source of truth isn’t always immediately realistic, it’s increasingly vital to reduce the number of data puddles everywhere in order to increase the efficiency, accuracy, consistency, and value of an organization’s analytics work.
So What? How Data Sprawl Hurts Business
To understand the full potential payoff for getting it right, it’s worth diving further into the causes and impacts of today’s fragmented state.
Graziano cited another company that he found storing the same several hundred terabytes of data in three different places. “They had an Oracle data warehouse that was normalized, but that server didn’t have enough power so they put their dimensional models on another one, and then a Hadoop infrastructure for the data scientists” to analyze the same information, he said.
Mergers and acquisitions are certainly one source of the problem.
“The problems are technical debt and shadow IT,” said Wayne Sadin, an analyst for Acceleration Economy with three decades of experience in CIO, CTO, and CDO roles. “You buy 12 companies and you have 172 databases—14 aren’t made anymore, 6 don’t have an owner anymore—and then 500 spreadsheets…” He related the story of a big client whose largest database was connected to a PC under someone’s desk. This fact was only discovered by luck when the IT department moved to a different location.
Sadin said bringing IT in at the very end of M&A discussions means there’s no opportunity to start building a truly thoughtful integration plan. “General George Shultz once said, ‘If you want me in on the landing, put me in on the takeoff,’” said Sadin.
Beyond mergers, Graziano contends that data silos often proliferate because the business is trying to solve performance problems with a one-off approach, rather than creating an overarching go-forward architecture. These solutions can help address the need of the day, but the total scale and impact on cost, performance, and data fragmentation can be hard to calculate.
“There isn’t a really good evaluation methodology other than to challenge the vendors on their story, talk to validatable references, and find your peer—‘Can my architect talk to your architect?’” Graziano said.
“Query acceleration software, data virtualization, in-memory analytics software … all that is trying to solve underlying performance problems with your architecture,” he said. “If you’re writing queries in a SQL layer, but it pushes those queries down to be executed on the source systems, it will inevitably have a performance impact.”
Sadin pointed out that shadow IT reproduces essentially the same problem. And line-of-business employees who turn to public cloud storage or unauthorized applications aren’t the root issue. Instead, the problem often stems from the way IT budgets are governed.
“The business has a problem to solve, so they go to the investment committee and say, ‘We need X dollars.’ Then the typical answer from IT is ‘You can have 80% of what you’re asking for.’” he said.
“But the business still has to solve the last 20%. So they go find a solution. These days it’s very inexpensive to get into a data solution at a low cost. So just like we had app sprawl, now we have data sprawl.”
The most obvious impact of all this fragmentation is massive overspending on duplicate data storage, but Graziano and Sadin both agreed that’s actually just the tip of the iceberg.
Worse, “it leads to the infamous ‘competing results’ in executive meetings,” said Graziano. Different groups performing similar analysis on different data sets—perhaps working off data pulled a few hours apart—claim different outcomes and advocate for different decisions.
Conflicting reports, business decisions based on outdated data, predictive models built on incomplete data—the list of negative effects goes on.
The Road to Unified Data
So how can organizations solve mass data fragmentation? Ultimately, the answer will lie in unified architecture and governance.
Graziano advocates for a three-tier data architecture, consisting of:
- the raw data
- the transformed, cleaned, normalized data
- and a presentation layer
The first layer has to be persistent wherever the organization needs to maintain traceability and auditability, Graziano said, what “in the old days we would have called a persistent staging area.”
The second layer is a “curated” or golden layer, borrowing terms from master data management: the historical, time-stamped repository that becomes the single source of facts.
Last, there’s the consumption layer, “where you put the picture together that makes sense to the business,” he said. While the data scientists might be looking at that semi-raw, stage-two data,
“the business doesn’t need to see that. They want multidimensional views, the ability to find data in a format they can understand.”
This approach, said Graziano, effectively reworks the traditional ETL (extract, transform, load) process into ELT: “The goal is to move the data once, then use it many times.”
Of course, just as Rome wasn’t built in a day, a new end-to-end architecture doesn’t spring up overnight.
From a practical point of view, Sadin describes a process that unfolds in three different stages, which may happen sequentially or simultaneously. “I call it ‘patch, polish, perfect.’”
Patching is for obvious and acute delivery problems. If a system is failing or out of compliance, and has to be fixed immediately, the correct step may be to simply fix the local spreadsheet or database. It’s not a permanent solution, but you can’t wait for a permanent solution.
Sadin’s “polish” step may involve robotic process automation or other larger-scale work. “It’s not an underlying, architected solution,” he said, but it involves finding additional places to deliver improved business performance and value.
“Now I take a breath and architect, or consolidate data sprawl” in the third “perfect” stage, he said. In the messy reality of CIOs and data professionals, however, most need to work in all three modes at the same time.
The key to succeeding is not to start with the data, but with the business needs. “If I have a few minutes to talk to the CEO, I’d say, ‘Here’s a crayon and a piece of paper, draw me the report you want,’” he laughed. “With each business line, you ask ‘what do you need?’, so you start with the processes.”
Governance and Sandboxes Keep Fragmentation In Check
Even as the consolidated and architected vision starts to take shape, Graziano said there are typically some individuals or groups, sometimes with a good deal of organizational clout, who will insist, “Look, just give me [a copy of] the data so I can transform it on my desktop.”
The organization that gives an unqualified yes to that request is right back on the road to fragmentation. “Governance rules have to stop that,” Graziano said. However, for those with a legitimate business need, a sandbox can be appropriate. “You’re not going to continually update that [copied] data, but you set up a sandbox, let them play around in it, and when they figure out what they need done, only then do you put it into production.” Then the sandbox can be removed instead of perpetuated.
Whatever phases, projects, or decisions your own company requires to get mass data fragmentation under control, the determination and discipline to do so has never been more important. After all, this is a business problem, not a technical problem.
“You need to use data not only better than anyone in your industry, but better than anybody who might enter your industry from the outside,” Sadin said. “The value of managing your data thoughtfully is even greater than ever.”