Data preparation is a fundamental aspect of the modeling process. In fact, it is the most important part of the process since it occupies up to 80% of the total time of the project. The objective of data preparation is to prepare what is known as the modeling view or mining view.
The modeling view is the table that contain all the variables needed for the development and validation of the predictive model. It contains the target variable (dependent variable) and the candidate predictors (independent variables). The modeling view also contains other fields not needed specifically for modeling, but are used for other related purposes such as ID variables and fields used in reporting or model deployment.
The input to the data preparation step is the raw data coming from operational systems, external data, survey data, and data generated within the organization from other analytical platforms such as BI and OLAP tools.
Data preparation for modeling is not a one-step process, it is an iterative process that interacts with the data exploration, and modeling and validation. As a result, there is no standardized process for data preparation. However, there are some common categories of operations that are used. These are:
- Sourcing and combining the data: these operations include data import, merging, appending, sampling, and combining data from different sources.
- Data cleansing and transformations: these include filtering variables, treatment of missing values, summarization and aggregation, generation of new variables, binning and reducing number of categories, ranks, and reshaping variable distributions (such as Box-Cox transformations), and weight of evidence transformations.
- Variable reduction: calculation of predictive power and variable selection, variable reduction using principal component analysis, factor analysis, and variable clustering.
Data preparation operations are usually performed using a combination of scripting languages and point-and-click tools. Common platforms for data preparation today include the languages of SQL, R, Python, and the Language of SAS. In most cases the process involves using more than one of these languages. Point-and-click tools that operate on top of these languages, and as stand-alone tools, are also used to create a repeatable processes. However, the need to write hand-crafted scripts always exist to adapt to the special needs of the particular data of the problem. For that reason most modern modeling platforms, such as that of Angoss, include more than one (or all) of these languages.