What Is Data Preparation?

Data Preparation is the process of collecting, cleaning, and consolidating data into one file or data table, primarily for use in analysis.

Data preparation is most often used when:

Data Preparation definition
  • Handling messy, inconsistent, or un-standardized data
  • Trying to combine data from multiple sources
  • Reporting on data that was entered manually
  • Dealing with data that was scraped from an unstructured source such as PDF documents

Learn More


Altair Monarch

structuring unstructured data

Altair Monarch is the industry’s leading solution for self-service data preparation.

  • Built for business users not rocket scientists
  • Automatically extract from reports & web pages
  • Combine, clean and use with your favorite tools

Learn More     See it in Action


The key steps to your data preparation:

  • Data analysis – The data is audited for errors and anomalies to be corrected. For large datasets, data preparation applications prove helpful in producing metadata and uncovering problems.
  • Creating an intuitive workflow – A workflow consisting of a sequence of data prep operations for addressing the data errors is then formulated.
  • Validation – The correctness of the workflow is next evaluated against a representative sample of the dataset. This process may call for adjustments to the workflow as previously undetected errors are found.
  • Transformation – Once convinced of the effectiveness of the workflow, transformation may now be carried out, and the actual data prep process takes place.
  • Backflow of cleaned data – Finally, steps must also be taken for the clean data to replace the original dirty data sources.

Try Now


Here’s an example:

There are multiple values that are commonly used to represent the same U.S. state. A state like California could be represented by ‘CA’, ‘Cal.’, ‘Cal’ or ‘California’ to name a few.

A data preparation tool could be used in this scenario to identify an incorrect number of unique values (in the case of U.S. states, a unique count greater than 50 would raise a flag, as there are only 50 states in the U.S.). These values would then need to be standardized to use only an abbreviation or only full spelling in every row.


Want to learn more? Check out these whitepapers:

Gartner Report: Embrace Self-Service Data Prep

Bridge the Gap Between Business Agility and Governance

Extending Self-Service Data Preparation Through Automation

Gartner Data Preparation Thumbnail Data Preparation Series - Bridge The Gap Data Preparation Series - Automation