Inheriting SAS jobs that are held together with duct tape and bubble gum
A plumber gets a call from two different customers, both need help with a leaky pipe. The plumber gets to the first house, goes into the basement finds the pipe that’s leaking just needs to be tightened, but other than that the plumbing is in perfect condition.
When the plumber arrives at the second house, in the basement all of the pipes are held together with duct tape, bubble gum and cloth rags. The plumber informs the customer the entire plumbing needs to be ripped out and redone.
Both customers have the exact same problem, a leaky pipe. The first job costs $50 bucks and takes 10 minutes to fix. For the second job, the plumber creates a quote for tens of thousands of dollars to replace all of the pipes.
What’s does plumbing have to do with Data??
I know you’re thinking, why the heck are you talking about plumbing? I came here to learn about data, not plumbing. It’s weird I know… but I love analogies and metaphors, it’s why my favorite song writer is Paul Westerberg. As I continue to work with people that are in the process of modernizing old SAS, SQL or MS Access projects, I keep coming back to this metaphor of duct tape and bubble gum.
Here’s a common scenario: You get a new role at your organization. As a result, you inherit a pre existing SAS job. There are 2 ways this can go:
#1: The SAS code was built by a very skilled developer and they left detailed notes on how the program code works.
#2: The code doesn’t make a lot of sense and over time has become slower to execute on growing data volumes. You ask your colleagues about the project and nobody knows anything about it. It turns out it was created about 10 years ago by some guy named Bob. Also, Bob deleted his Facebook page, threw his phone in the trash, moved to an island somewhere in the Pacific and nobody’s heard from him since.
If there is a problem with the downstream reporting in either scenario above, with #1 it can be an easy fix. With #2 you might as well rebuild from scratch. Does the metaphor begin to make sense now?
How can Trifacta help?
Trifacta is a data preparation platform that utilizes visualization and intelligent guidance powered by machine learning to accelerate how data is prepared for analytics or reporting. Because you interact with the data in real time with your mouse, you don’t need to learn any new code in order to prepare it. Also, because each preparation step is recorded in natural language, it’s very easy for someone new to step in and understand exactly what is happening in a workflow. There is also the ability to add comments to provide more context to anyone who may pick up the project after you. Here’s an example of a Trifacta recipe with four different steps:
Let’s compare SAS code to Trifacta recipes. In the example below, both SAS and Trifacta are removing blank values from 2 columns (var1 and var2) and then combining those columns together. The SAS code on the left is exactly what you’d expect, it looks like coding language and is understandable if you are proficient in SAS. The Trifacta Recipe on the right is logged in natural language, but accomplishes the same result. Almost anyone would be able to open this recipe and see what the steps in Trifacta are doing.
When using SAS code, you typically need to know exactly what you need to do before you start. In Trifacta’s grid interface you can interact with your data, which will generate suggested transformations that you can use on your data. This allows you to discover possibilities within your data while you work. Below we’ve simply highlighted over the dash in the user_id column. To the right, Trifacta has provided suggestions of possible transformations.
Predictive transformations in Trifacta is really helpful for 2 reasons:
- It allows you to see how your transformations are impacting your data in real time with a preview. You can see the result before you actually add it to your flow. If you look at the SAS example above, there is the run command (run;) at the end. You have to run this code at scale in order to see what it does. In common scenarios where SAS scripts are 100s of lines long, this can be an incredibly painful process. Trifacta utilizes data sampling in the grid interface, which allows you to generate the preview of results in real-time.
- The Trifacta suggestions allow you to discover functions that might be more efficient or effective to use. The recipe can be edited at any point, so you have the freedom to try out different things before impacting your overall job.
The Moral of the Story
SAS code is like a large, traditional tool box that every plumber carried around in the past. The tool box was large and cumbersome, but gave the plumber the ability to pretty much fit the pipes under every condition. It got the job done, but required a huge amount of manual effort. Modern day plumbers are learning new techniques and have better equipment, allowing them to make less faulty pipe systems,that scale to large volumes of water more quickly and with a smaller toolbox.
Just like plumbing, if your data analytics stack is old and clunky, there’s going to come a point where you will need to modernize and possibly rebuild. The question will be do you want to continue to use the jury-rigged solution and throw some duct tape on the problem so it lasts a little while longer, or do you want to properly fix it?