This article is by Sean Adams and originally appeared on the Alteryx Engine Works Blog here: https://community.alteryx.com/t5/Engine-Works-Blog/Beyond-Single-Pass-Analytics-What-Changes-With-Scale/ba-p/513519
When we begin with analytics, either as a new practice within your firm, or starting a team in a new area–much of your work may initially be what I call "single-pass analytics," where the data goes from source, to prep, straight to the end-point.
This is natural and normal in the beginning. The question for this article is, "does this change as you scale, and if so, then how?" From my experience, our work changes significantly with scale across many different axes, and this series of articles will tackle each of these one by one:
- How we think about data in the analytics process
- Where to apply data & analytics
- Reusable tools and widgets (accelerators)
- Skills & Shared methods
Part 1: How we think about data
For this first article, we will cover how we think about data, and how this grows as your analytics efforts scale.
Broadly speaking, data analysis has a predictable set of steps. On different projects, these may be done in parallel, in another order, or you may group together (or some may have been done for you by a previous project or a central team).
- Understanding: What questions are we trying to answer?
- Sourcing: Finding and getting access to the data you need. Often there are several different pieces you need.
- Cleaning: Fixing the data so that you can analyze it - think about removing duplications, fixing missing values, correcting dates & text.
- Enriching: Adding onto the data - for example, if I give you the 2-letter state code (NJ or NY), adding on the full state name, or possibly state population.
- Preparing: Joining to other data, summarizing, etc.
- Validating: Does the data make sense? Can we confirm that we haven't lost or invented anything along the way?
- Analysis: Figuring out what the data is telling you.
What are type 1 and type 2 analytics (or Mode 1 and Mode 2)?
- Type 1 analytics is where you are focusing on a consistent and repeatable reporting/tracking. Examples may be tracking key risk indicators for your team (attrition, client churn), or performance measures (customer conversion rate, manual touches on a production process, etc.). The key here is that the reporting/analysis is somewhat consistent over time, so you need a degree of stability and can often define requirements up-front.
- Type 2 analytics are exploratory/discovery-based. You may not know upfront what you're going to find or what exactly you're looking for, so it's harder to define a plan with a fixed timeline. When you're doing exploratory analysis (internally I call this "rapid tear-down analysis"), you may not benefit from doing a large amount of data-engineering or building for long-term maintenance because you're still discovering what's important. Moreover, often once you find the important signals, you'll build these into a Type 1 pipeline.
Why is this distinction useful? Well, you will think differently about type 1 analytics than type 2 and manage the project differently. The classic type 2 project in our world is when the boss says, “please can you figure out why X happened?” Where classic type 1 would be “please can you give us P&L by department every month?"
Where do we spend our time?
Our experience is that at least 80-90% of the time spent in analytics is spent on sourcing, cleaning, enriching, and preparing data, with the remainder spent on insight. So, once you've done your first project, you may already be starting to think about how to store this data so that you don't have to re-do all this work for the next project.
How does this change as you scale? This will never be the exact stages that you and your team/organisation go through, but it may be useful to help us follow the story. For this article, we will focus on how your treatment of data changes, and later we will cover culture, skills, and reusable components/widgets, etc.
Stage 1: The Beginning
You may start with network drive with spreadsheets (try to create a good folder-structure with names, for example, client data in one folder and product data in another). The limitation is that this doesn't scale well (more than a million rows becomes slow) and it's not great for a team of 2 or more.
Every team is doing single-path analytics – the data starts in raw sources, is processed in Alteryx, and then output into a report, alert, or another file.
Stage 2: Starting to Store Your Results
From there you can graduate to a file-based database like SQL Lite or MS Access to store some work-in-progress
- At this stage, you may need an Alteryx Server to automate these kinds of tasks.
Stage 3: Bigger Data = Bigger Database
If you continue to be successful, you may need to get a database so that you can store your data, and to process larger volumes. Note: as soon as someone says, "how does this measure compare to last week?" - you probably need a database to start keeping history.
At this stage, your data picture above has changed:
- From Source → prep → consumption
- Now it goes from source → prep → database → consumption
Stage 4: Do we all need the same ingredients?
Once you have 3 or 4 people working on related datasets, it's important to talk about data and how you can reuse. There are tools to help like Alteryx Connect or even a simple data dictionary. Don't overlook the power of a chat room, too. ("Does anyone have data about Giraffes?") You begin to have multiple different Alteryx jobs running from the same data and you need to think about notifying folks before changing datasets that may be used by other people. This requires data lineage (again, Alteryx Connect can help).
You will also need to think about data governance. California just passed the California Consumer Protection Act, which gives consumers rights over their data. You will need specific focus on data retention, accuracy, and access control.
Stage 5: Data Engineering becomes a role
If you are successful in stage 4 and your team continues to grow, you may find that you are starting to need several different versions of a piece of data. One person needs sales summarised by month, another person needs only sales for stores in western states where a promotion was underway. Rather than having five different people connecting to your sales system and all of you pulling subsets of the sales data into your reporting tables, your data picture now changes again:
Sources → Raw/Staging → Clean, Enrich, Conforming, Prep → Publication to end metrics env
Source: Microsoft Enterprise Data
3 key takeaways
- Source the data once, clean it, and enrich it. This cleaned data becomes a team asset
- Engineer the data for easy analytics, open this up to a broad audience of people who can consume it
- This accelerates your analytics in 3 ways:
- Less time looking for/sourcing/cleaning data
- Quicker time to insight because more people can access the cleaned data using various end-user analysis tools (you can spend more time on data and less time adjusting a report)
- Shared data assets (Enterprise shared dimensions) mean that you immediately get consistent analytics without having to manage raw data
NOTE: we’re not advocating that every project needs the full force of the Stage 5 model – there will be many type 2 analytics projects that need a quick turnaround and are done in a straight-through analytics pipe. However, as your data assets become cleaner and more managed, your entire organization will feel the benefit of faster and cleaner analytics.