When discussing any IT-related topic, it's easy to use terms that are related to certain subjects, but don't actually describe the function or technology you're talking about.
For example, I've heard people often use "server provisioning" to describe "server virtualization," when in fact these two technologies are quite different. However, basic similarities between the two can mislead or confuse people who are new to IT.
In light of this issue, I've taken the liberty of identifying six big data terms you may either misuse or do not fully understand.
"Data profiling initiates a diagnosis of a data set and produces rules and anomalies within that data set."
1. Data profiling
Often confused with data quality assessment (which is explained later on), data profiling involves scrutinizing the information available within a specific source for the sake of identifying consistencies, logic and uniqueness. TechTarget's Margaret Rouse noted that it can present analysts with metrics that allow them to make data quality assessments at a later time. Ultimately, data profiling initiates a diagnosis of a data set and produces rules and anomalies within that data set.
2. Data quality assessment
Making a data quality assessment consists of applying the rules produced during the data profiling stage and testing the information to see which sources pass and fail. This approach is crucial to the success of any data analysis project, as it informs professionals as to which data sources are reliable, irrelevant or inaccurate.
3. Natural language processing (NLP)
Often labeled as one of the "sexier" components of data visualization tools, NLP is powered by algorithms (sets of mathematical instructions designed to analyze data) constructed for computers to better understand human speech and writing. NLP is what powers social analytics programs and visualization software that focuses on written documents, audio and video.
Bernard Marr, a big data expert and frequent contributor to Smart Data Collective, described gamification as the process of making a game out of a scenario or task that typically does not involve such activity.
"In big data terms, gamification is often a powerful way of incentivizing data collection," wrote Marr.
Implementing gamification throughout data collection can actually be quite simple. For instance, suppose you're the head of a team in charge of aggregating information about different species of shark. You could impose a rule stating that the person who finds data sources on the subject that have the highest value win two days of paid time off.
"Biometrics uses analytics to distinguish people based on one or more physical characteristics."
Speaking of the biological differences among sharks, biometrics is one facet of big data that is typically favored by botanists, zoologists and other scientists studying animate beings. Marr noted that biometrics uses analytics to distinguish people based on one or more physical characteristics. A biometric can power facial, iris or fingerprint technology, for example.
6. Distributed file system
Distributed file systems are built to house large volumes of data across two or more storage devices, typically servers. This method is employed for the sake of reducing the expenses associated with storing large amounts of information on a single computer. Hadoop is an example of a distributed file system, although the software comes with additional features that can support sophisticated data analysis programs.