![]() |
Feature scaling is an important step in data preprocessing. Several machine learning algorithms like linear regression, logistic regression, and neural networks rely on the fine-tuning of weights and biases to generalize better. However, features with different scales could prevent such models from performing well as features with higher values dominate the ones with lower values. Scikit Learn offers a variety of scalers to perform feature scaling. This article briefly describes the effect of different scalers on data with outliers. For demonstration purposes, the California Housing Dataset of Scikit Learn is considered. This dataset consists of 8 attributes and the task is to predict the median house value is $100,000. Importing required librariesPython3
Load the datasetPython3
Output:
Data informationPython3
Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MedInc 20640 non-null float64 1 HouseAge 20640 non-null float64 2 AveRooms 20640 non-null float64 3 AveBedrms 20640 non-null float64 4 Population 20640 non-null float64 5 AveOccup 20640 non-null float64 6 Latitude 20640 non-null float64 7 Longitude 20640 non-null float64 8 MedHouseVal 20640 non-null float64 dtypes: float64(9) memory usage: 1.4 MB Now, let us analyze the distribution of each feature. Python3
Output: ![]() Distribution plots We can observe that median income has a long tail distribution, while average occupants have a few large outliers. Hence, we consider these 2 features for studying the impact of different scalers on outliers. Furthermore, let us see the impact of latitude and longitude on the target variable. Plot Median House value with respect to Latitude and longitudePython3
Output: ![]() Effect of latitude and longitude on median house value We can observe that houses from a particular region have high values. Hence, latitude and longitude are also considered for training the model. Ultimately, latitude, longitude, and median income are used to train the model. Now the dataset is split into training and test sets. Split Train and Test DataPython3
Define the Joint Distribution Plot functionHere we define the function which can plot joint distributions plot for data. Python3
Define the Regression modelHere we define the Stochastic Gradient Descent regressor to build the model, then fit the model and predict on X_test new data, then apply mean squared error to find the accuracy of the model. Python3
Without scaling the dataPython3
Output: ![]() Without Scaling the data MSE : 1.3528478529652859e+29 Effect of Standard ScalerScikit Learn offers StandardScaler() for performing standardization. Python3
Output: ![]() Effect of Standard Scaler Accuracy with standard scaler : 2.927062092029926e-07 We can observe that the scales of the 2 features are quite different. The median income is ranging from -2 to 6, whereas the average occupants is ranging from -0.1 to 0.6. Therefore, standard scalers cannot guarantee balanced feature scales. Effect of MinMax ScalerScikit Learn offers MinMaxScaler() for normalization. Python3
Output: ![]() Effect of MinMax Scaler Accuracy with minmax scaler : 0.005800840622553663 MinMax scaler reduces the range of feature values to [0, 1]. However, the range is quite different for the 2 features under consideration. Average occupants have a considerably smaller value compared to median income. Effect of MaxAbs ScalerMaxAbsScaler() is similar to MinMaxScaler(). However, the range of scaled values may vary depending on the feature values. The range is [0, 1] if only positive values are present, [-1, 0] if only negative values are present, and [-1, 1] if both values are present. Python3
Output: ![]() Effect of Max Absolute Scaler Accuracy with maxabs scaler : 0.006259459937552328 In this case, it produces results similar to MinMaxScaler(). Effect of Robust ScalerScikit Learn provides RobustScaler() which makes use of inter-quartile range for feature scaling. Python3
Output: ![]() Effect of Robust Scaler Accuracy with robust scaler : 1.931800745764342e+20 Effect of Power TransformPython3
Output: ![]() Effect of Power Transform Accuracy with power transform : 0.09313345304679063 Effect of Quantile TransformerPython3
Output: ![]() Effect of Quantile Transformer Effect of NormalizerPython3
Output: ![]() Effect of Normalizer Accuracy with Normalizer : 1.3216531944583318 Model AccuracyLet’s print the accuracy of each model Python3
Output: General Accuracy : 2.6731951662759754e+28 Accuracy with standard scaler : 6.1103137019148144e-06 Accuracy with minmax scaler : 0.005842587280784744 Accuracy with maxabs scaler : 0.006149996463712555 Accuracy with robust scaler : 1.0374102250716266e+23 Accuracy with power transform : 0.09243653923790224 Accuracy with QuantileTransformer: 0.12639279065095635 Accuracy with Normalizer : 1.322663822919674 It is evident that feature scaling improves accuracy significantly. In the current scenario, feature scaling with standard scaler and min-max scaler provides the best accuracy. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 11 |