All the hype around companies becoming ‘data-driven’ might give one the impression that organizations’ analytics engines are well-oiled machines. In fact, for most organizations, the images of analysts rapid-firing data-backed prescriptions to the latest developments couldn’t be further from the truth. More often than not, going from business problem to data-driven solution is a lengthy, complicated process--so much so that often by the time an answer is provided, the question itself has changed.
So what really goes on? Here’s a snapshot of how the proverbial data ‘sausage’ is made.
Finding the data
The analyst must first determine what tables hold the data required to answer the question, and where they reside. At best this involves extensive keyword searching, and if they’re lucky, the right keywords are attached to the dataset. Often, however, the right keywords aren’t even attached to the dataset, and so the analyst has to do significant detective work to discover the exact name of the tables they're hunting for.
Compounding this issue is the fact that many companies employ multiple databases, data warehouses and data lakes. If the data sprawls OracleDB, Teradata and Hadoop, it may be difficult to compile the information to answer what should be a straightforward question.
Moving the data
The next step typically involves creating a copy of all the required sources of data in a repository that your BI tools have access to. Aside from the time and resources this requires, as well as the computing power required to forklift larger datasets, this creates issues of version control. If you’re working on a copy of the data, an update to the original won’t make it into your analysis.
Deciding which version of ‘the truth’ is true
Assuming you locate all the data you need, there now arise issues of duplicates and overlapping datasets. At this point the analyst has to ask: which tables are the most accurate and up-to-date? Which will require the least preprocessing before analysis can be performed?
Compounding the difficulty are issues with regulatory compliance or customer data privacy that need to be resolved before the data is analyzed. As GDPR, CCPA and other government and industry regulations pose heavy penalties for lack of governance and insight, data that isn’t properly governed can bring a project to a grinding halt.
Mapping out the assembly of the data
In cases where data must be combined from multiple tables, the analyst then has to map out how to assemble the data. This involves, among other things, deciding which columns can be ‘joined’. As the analyst often doesn’t have complete insight into the contents of each table, this can be like a matchmaker trying to spark romantic relationships in couples that they haven’t even met yet. In other words, getting it right requires a lot of trial and error.
Writing a SQL query
The previous step only generates the map. To get the data, the analyst has to write a SQL statement based on the map that pulls together all the pieces. If they’re lucky, and the data is already neatly collected in one table, this isn’t too hard. However, in the more common cases where data resides in multiple tables, it can require serious SQL expertise. And when the data happens to be in multiple systems, the required SQL expertise is severely ratcheted up.
Executing the SQL query
Even when the SQL statement is assembled, if it involves multiple data repositories, it can’t be executed without software like Presto, which is engineered to execute a single query across federated systems. Such software, however, requires a high degree of technical expertise to implement in working environments.
Pre-processing the data
OK--now you’ve located and assembled your data, and retrieved it across your federated data environment. If you thought you were ready to analyze, think again. Without significant pre-processing, most datasets can’t be run through computer models. Hopefully the problems are as simple as inconsistent formatting, missing data or outliers. Quite frankly, even those problems take time some time to address. More serious ones can add weeks to the project.
Analyzing the data
After much travail we’ve arrived at the fun part, where the analyst actually gets to analyze. Of course, if a month (or two or three) has elapsed, the original problem that prompted the whole process may have evolved significantly. The executive that requested the analysis may have changed their mind and now wants something different, placing the analyst in a game of real-life Chutes & Ladders where they’re constantly backtracking to hit a continually moving target. But, leaving that aside, at least we’ve finally delivered the sausage to the sandwich maker!
As bad as all this is, it’s only going to get worse as the variety, volume and complexity of new data sources increases and governance demands grow. At this point you’re probably asking, why hasn’t someone automated this pain-in-the-neck process? Let me tell you about a little platform called Data Navigation System (DNS)...