Data preparation: A necessary step, an annoying hurdle?

When discussing data analytics, professionals working with the technology often refer to the process of collecting raw information and turning it into "actionable intelligence" – a phrase my colleagues and I find to be overused.

Liberal use of marketing jargon aside, it's important to delve into how organic data is translated into the beautiful data visualizations professionals across the globe use on a regular basis. The process itself involves preparing data for concrete analysis. 

"Data preparation is a screening protocol that ultimately produces "clean" data sets that are easy for analytics programs to scrutinize."

Data preparation: A step-by-step overview 
First off, why is data prep necessary? Failing to peruse a a wide breadth of information can produce insights that may be misleading, disjointed or incomplete. Just to clarify: It's not about aggregating reports, surveys, comments or other data that supports a scientist's or business leader's assertions about a marketing trend, industry best practice or so forth. Rather, the process involves finding data that is factual, accurate and adds value to analysis. 

In other words, it's a screening protocol that ultimately produces "clean" data sets that are easy for analytics programs to scrutinize. While different specialists would likely give you a different number of steps required to properly prep data for analysis, the following take is a generally good method to employ:

  1. Collect data: This consists of vetting, or reviewing the validity of disparate sources of information. For instance, if you're aggregating results from multiple surveys, it's important to assess how query answers were submitted and under what circumstances.
  2. Coding: Once concrete, accurate datasets are created, it's important to assign them alpha or numeric codes so that an analytics solution will be able to scrutinize them efficiently.
  3. Define structures: This is where database assignment comes into play. While NoSQL environments can store unstructured data, SQL ecosystems can hold data that has inherent values. Overall, this step is quite complex, because defining unstructured information can produce ambiguous answers – an issue I'll discuss later in this post. 
  4. Enter data: From there, the refined or "clean" data can be entered into whatever system is conducting a comprehensive analysis. 
  5. Further assessment: After the former step is complete, the analytics solution can further assess information for inconsistencies, errors, faulty logic or extreme values. 

"It's an absolute myth that you can send an algorithm over raw data and have insights pop up," – Jeffrey Heer, University of Washington

A time-consuming process 
A common complaint among many data scientists is that data prep takes far too long. For business professionals who want to be able to freely conduct business intelligence initiatives whenever it's most convenient for them, it's frustrating when a team of analysts informs them that they require more time to parse through the collected information. 

The New York Times noted that most data scientists report spending between 50 and 80 percent of their time preparing digital data for further analysis. Obviously, this hinders quick action from the enterprise perspective, but University of Washington professor of computer science Jeffrey Heer maintained that the idea of an analytics solution being able to scrutinize raw information with no manual intervention is somewhat absurd. 

"It's an absolute myth that you can send an algorithm over raw data and have insights pop up," Heer told the news source. 

It's easy for a CEO to tell a team of data scientists to "work faster," but this won't change the fact that analysts need to spend the time interpreting ambiguous human language and translating it to a format that an algorithm can understand. 

How can technology help? 
Heer's conclusion that big data analytics solutions are incapable of the interpreting raw information is largely correct, but developers of the technology aren't ignoring the painstaking process of data prep. Software companies focusing heavily on data visualization tools and other such programs are looking at their products from a new perspective – that of the dedicated, hard-working, overwhelmed data scientist.