blog.1.image
Data science
March 31st 2021

Data Mining: statistical analysis

Raffaele Bua

Introduction

The main goal of data mining analysis is to extract knowledge from raw dataset, exploiting clustering, association rules and classification methods. Before performing these methods, a deep data understanding is mandatory. With data understanding we mean those processes aimed at: data semantic analysis: bringing out specific dataset feature semantics, statistical analysis: extracting from dataset columns statistical indexes and charts, assessing data quality: management of outliers and missing values, exploiting outcomes from statistical analysis, variables transformations: conversion of categorical data into numeric data type and normalization of numerical features, pairwise correlations and eventual elimination of redundant variables: elimination of redundant features, with high correlation index.

Data semantic

In this case study, the given dataset contains car purchase entries of auctions held in United States. Entries contain an entry identifier, classification of purchase as good or bad, date, auction provider, year of the vehicle, age of vehicle, vehicle manufacturer, vehicle model, vehicle trim level, vehicle sub model, vehicle color, vehicle transmission, wheel type id, wheel type name, vehicle odometer reading, manufacturer’s country, vehicle size, vehicle belonging to one of the top three American manufacturers, acquisition and current prices (distinct in auction/retail and clean/average), a demand indicator, level guarantee, unique buyer identifier, seller zip-code, state code where the car was purchased, vehicle cost at purchase time, online purchase flag and warranty price. Dataset has been downloaded from Kaggle competition Data Mining 2019/2020 - Unipi. Knowing data semantics helps drive the dataset dimensionality reduction, because pairwise correlation indices is not enough to decide which columns to drop. This analysis is also useful to transform data in the proper way. Data feature types are: numerical, boolean, date time, categorical and enumerable. Numerical features contains real numbers as values. Boolean features contains only True/False values. Date time features contain date values. Categorical features contain values belonging to a set of classes (e.g. colors, model, etc..). Enumerable features correspond to ordinal categorical data items. The feature’s data type determines the type of statistical analysis and data transformation performed for the given feature.

Distribution of the variables and statistics

The analysis of variables distribution and statistics brings out hidden properties and relationship of features. This section exposes these analysis, reporting notable results. In this analysis a distinction was made between categorical/boolean data and numerical data, because distribution and statistical analysis are different. RefId column, representing rows id is ignored in this work, because haven’t any information related the purchase, but it is just an identifier.

Numerical features analysis

Numerical features of the dataset are the columns: VehicleAge, VehOdo, MMRAcquisitionAuctionAveragePrice, MMRAcquisitionAuctionCleanPrice, MMRAcquisitionRetailAveragePrice, MMRAcquisitonRetailCleanPrice, MMRCurrentAuctionAveragePrice, MMRCurrentAuctionCleanPrice, MMRCurrentRetailAveragePrice, MMRCurrentRetailCleanPrice, VehBCost, WarrantyCost.

Distribution of numerical data

Figure 1 - Distribution of numerical data.

Figure 1 shows distributions of numerical features as a histogram with 300 bins. Observing the charts carefully it is clear that feature ranges are different, so a standardization of values will help for the next tasks. It is also clear that the shape of distributions are similar to the normal distribution. It is also possible to notice peaks in zero for the features with prefix MMR in their name. These peaks represents missing values, that will be managed in the next sections. The VehicleAge feature has a range very different from other numerical ones, its distribution (not shown in this report) can be approximated to a Gaussian distribution with values from 0 to 9.

Distribution of numerical data

Figure 2 - Statistics indexes of numerical data.

Figure 2 contains the table with statistical indices. The count column is useful to determine the volume of missing values present in a specific feature; as the reported counts are for none missing values only. The total rows present in the dataset are 58386. Mean and Standard deviation (std) are useful to remove outliers and to normalize values. Min and max columns define feature ranges. The columns with name 25%, 50% and 75% show feature quartiles, and are useful to understand the distribution of the data, with the goal to remove the outliers. A better representation of quartiles is the box plot, shown in the figure 3. The box plot diagram shows that there are a lot of outliers to remove, for numerical columns.

Box plot of numerical data of dataset without outliers removal.

Figure 3 - Box plot of numerical data of dataset without outliers removal.

Categorical and boolean features analysis

Categorical and boolean features will be treated differently than the numerical features. The mean and standard deviation are not meaningful when analysing categorical/boolean features. Outliers are identified by looking at frequencies rather than using box plots for these features. So histogram with distribution is very useful and statistical indexes like: count, unique values (classes) top (mode) and frequent of top. Purchase date (PurchDate column) must be processed to be used effectively, converting this feature to a number and considered as an ordered category. Categorical data histograms are exposed in the figure 5, in this image some labels of x-axis has been removed, when they overlap each other. Distribution of categorical features are useful to detect outliers, because if class frequency in a feature is lower than a threshold, values of that class will be considered outliers. In these statistics, top represents the most frequent class (mode) and freq the frequency of the top class. Count contains the number of values for feature in the dataset and unique define the number of classes for a specific feature.

Assessing data quality (missing values, outliers)

Missing values and outliers negatively affect the performances of clustering, association rules and classification tasks. This section exposes the criteria followed to manage missing values and outliers. Missing values are defined as NaN, or zero for price features (a price cannot be equal to zero). There are many approaches to manage missing values; in this work when there is a missing value, whole row will be deleted from dataset. For this reason the features PRIMEUNIT and AUCGUART will be removed from dataset, having only 2683 elements over 58386 (only 4.6% compared to the total). For outliers, a distinction has been made between numerical and categorical features. For numerical data: items with values over 3σ and −3σ have been considered outliers, therefore each row with outliers has been removed from the dataset. Figure 4 shows box plots of numerical features after outliers removal. While for categorical features: values with low frequency has been considered as outliers, therefore their rows has been removed. This approach includes a trade-off, defining the frequency threshold. In this work the threshold was set as the mean of frequency minus 20% (of mean). Removing many rows from dataset, cleaning the missing values and outliers, some features may contain the same value for each row. In this case the feature results useless and it is better to remove it from the dataset. Transmission feature falls into this case: it contains only AUTO values after missing values and outliers removal, so this column has been dropped.

Box plot of numerical data afteroutliers removal.

Figure 4 - Box plot of numerical data after outliers removal.

Variables transformations

Variable transformation involves data pre-processing, in order to have best performances of next tasks: clustering, association rules and classification. This pre-processing phase includes transformation of categorical feature in numbers, standardization of numerical features and then discretization of numerical features.

Categorical features

Categorical features in given dataset contains strings which represent classes. Those strings are converted into numbers (ordered only if the category is orderable). This processing task allows to compute the pairwise correlations and to apply clustering algorithms.

Categorical data histograms.

Figure 5 - Categorical data histograms.

Standardization

Numeric features are standardized exploiting z-score normalization, to facilitate the pairwise correlation computation and machine learning applications.

Discretization

Numeric features are discretized after standardization transformation. Exploiting z-score standardization, numeric values are distionAuctionAveragePrictributed in the range from -3 and +3. In this work numeric values are discretized via quantization, with a step equal to 0.1. So in our case, discretization bins are 30. Also RefId feature has been removed from the dataset because it is just the row identifier. Date feature (PurchDate) has been removed from given dataset, for dimensionality reduction reasons. This feature could be split into day, month and year features, but for simplicity this approach was discarded.

Pairwise correlations and eventual elimination of redundant variables

Transformed dataset pairwise correlations highlight some strictly dependent variables. Figure 6 exposes correlation index of dataset features. Correlation matrix highlights some important correlations between dataset features. VehYear and VehicleAge features are strictly correlated, so we will keep only VehYear feature, dropping VehicleAge. WheelTypeID and WheelType are the same feature, having a correlation index equal to one. We will keep only WheelType dropping WheelTypeID feature. The features: MMRAcquisitionAuctionAveragePrice, MMRAcquisitionAuctionCleanPrice, MMRAcquisitionRetailAveragePrice, MMRAcquisitonRetailCleanPrice, MMRCurrentAuctionAveragePrice,MMRCurrentAuctionCleanPrice, MMRCurrentRetailAveragePrice and MMRCurrentRetailCleanPrice are correlated. These features are correlated with the Vehicle cost too. So we will keep only the feature with the minimum correlation index with vehicle cost. MMRAs features with the minimum of correlation index with Vehicle cost are: MMRAcquisitionRetailAveragePrice, MMRCurrentRetailAveragePrice and MMRCurrentRetailCleanPrice. We keep only MMRAcquisitionRetailAveragePrice, that represents price acquisition for this vehicle in the retail market, in average condition at time of purchase, because represents an average price at purchase time. Removed features will be: MMRAcquisitionAuctionAveragePrice, MMRAcquisitionAuctionCleanPrice, MMRAcquisitonRetailCleanPrice, MMRCurrentAuctionAveragePrice, MMRCurrentAuctionCleanPrice, MMRCurrentRetailAveragePrice, MMRCurrentRetailCleanPrice, WheelTypeID, VehicleAge.

Pairwise correlation index of dataset features.

Figure 6 - Pairwise correlation index of dataset features.

Pre-processed dataset

This Data Understanding phase of the work has as outcome a pre-processed dataset, better usable by following algorithms, to perform tasks like clustering, association rules and classification. Summarizing the outcome of this process, obtained dataset contains 21 features and 38485 rows.