Incomplete data: What it is, what to do about it

How can data aggregated in real time be incomplete? Ask any data scientist this question and he or she will probably list a dozen ways in which poorly designed data mining tools and practices can produce fragmented information.

"If you're analyzing incomplete data, your insights may only be half-correct, if at all."

Unless you know for sure that your data aggregation systems are providing you with concrete information, there's a chance the "actionable insights" you're hoping to glean are only half-correct, if at all. So, let's discuss what defines incomplete data, how it impacts business decisions and what you can do about it. 

What does it mean to have incomplete data?
According to the Inter-university Consortium for Political and Social Research, incomplete data sets are created when a subject under scrutiny does not have information regarding "one or more of the relevant variables." For example, if you want to predict how many honey bees will exist in North America 10 years from now, but fail to aggregate reliable data regarding depopulation, then your estimations will be deemed untrustworthy.

The perspective of the ICPSR is obviously one of academics, or those that manually collect information for the purposes of a study. So, data aggregation tools are exempt from discussions regarding incomplete data, correct? Absolutely not. 

It's reasonable to deduce that a data aggregation solution may not be perfect."

Ultimately, data collection software is created by people. However talented these programmers may be, they're prone to making mistakes – it's simply a result of being human. Therefore, it's reasonable to deduce that the algorithms within an information aggregation solution may not be air-tight. Someone who may not be aware of this factor may formulate vague objectives, such as "find information about honey bees." Although very few data collection applications work this way, it's an example of how ambiguous queries will produce ambiguous, unsubstantiated results. 

An example of incomplete information 
Speak with any marketing guru about the information social media platform Twitter can provide you, and he or she would likely convince you that connecting your data collection tool to one of its application programming interfaces will be the best business decision you ever make. However, for all the insight the social media platform provides enterprises into specific consumer behavior, some may suggest that manually reading tweets delivers more reliable data.

"The Twitter Streaming API only provides access to 1 percent of the social media platform's total data stream."

ProgrammableWeb acknowledged that Twitter Streaming API fails to supply analysts with as much information as they need. In some cases "more data" doesn't necessarily equate to "complete data," but in this instance it usually does. First, the API only provides access to a "peculiar" 1 percent of the social media platform's total data stream. When a disparate "spritzer" is taken from the larger "firehose" the API doesn't allow access to, researchers from Arizona State University and Carnegie Mellon University discovered biased results. 

Another program, the Twitter Search API, doesn't enable researchers to query a specific date from the past – the solution only allows them to regard posts made from the previous seven days. 

Essentially, both of these APIs will fail to produce the kind of results businesses are looking for: validated, authentic conclusions.

What can be done about incomplete data? 
In regard to the Twitter-specific challenges, ProgrammableWeb recommended either partnering with the social media platform to gain full access to its data, or design a program that continuously harvests information at the maximum request threshold, but this latter option is just one out of my obstacles a company would have to work through.

From a broader perspective, two steps can be taken to ensure complete data is collected:

  1. Assess your sources: Don't assume anything. Make sure the places from where you aggregate information are accredited and thorough. 
  2. Use a reliable application: Purchasing a data collection and analysis program from a developer that focuses solely on constructing analytics solutions will ensure the solution has received the appropriate amount of attention.