![]() |
Multicollinearity is a common issue in regression models, where predictor variables are highly correlated. This can lead to unstable estimates of regression coefficients, making it difficult to determine the effect of each predictor on the response variable. Principal Component Analysis (PCA) is a powerful technique to address this issue by transforming the original correlated variables into a set of uncorrelated variables called principal components. This article explores how PCA can be applied to logistic regression to remove multicollinearity and improve model performance. Table of Content Understanding MulticollinearityMulticollinearity occurs when two or more predictor variables in a regression model are highly correlated, meaning they contain similar information about the variance in the response variable. This can lead to several problems:
Principal Component Analysis (PCA) for MulticollinearityPCA can eliminate multicollinearity between features by merging highly correlated variables into a set of uncorrelated variables. It is an unsupervised pre-processing task and makes use of the orthogonal linear transformation technique.
How PCA deals with Multicollinearity?PCA constructs new features (principal components) as a linear combination or mixture of initial variables. Here, each component is uncorrelated, thereby avoiding multicollinearity. The majority of the information within the initial variables is projected onto the first principal component.
One can find the principal component of a training set by using a standard matrix factorization technique called Singular Value Decomposition (SVD). PCA assumes that the dataset is centered around the origin. It finds a zero-centered unit vector pointing in the direction of the principal component. Detecting and Visualizing MultiCollinearityTo better understand multicollinearity, we can make use of the iris dataset. It consists of 3 different types of irises (Setosa, Versicolour, and Virginica) and has 4 features: sepal length, sepal width, petal length, and petal width. Let’s load the iris dataset. The code is as follows:
Output: ['sepal length (cm)', 'sepal width (cm)',
'petal length (cm)', 'petal width (cm)'] Visualizing Correlation with a Scatter Plot DiagramWe can make use of a scatter plot diagram to plot every numerical attribute against every other numerical attribute. If there is a linear upward or downward trend (correlation) for more than one combination, we can conclude that there is multicollinearity. Let’s visually understand which features are highly correlated. The code is as follows:
Output: ![]() scatter plot for every numerical attribute against every other numerical attribute Here, we converted the iris dataset set from a numpy array to a Pandas dataframe, which made plotting and correlation calculation easier. The dataframe is passed to the pairplot() method to plot every numerical attribute against every other numerical attribute. Clearly, there is a higher correlation between petal length and petal width. We can also notice a strong correlation between sepal length and petal length. Calculating the Correlation ValueWith a scatter plot, we will not be able to identify how much each feature is correlated. To identify the value, we can make use of the Pandas dataframe method. Let’s look at how much each attribute correlates with the other attributes.
Output: sepal length (cm) sepal width (cm) petal length (cm) \
sepal length (cm) 1.000000 -0.117570 0.871754
sepal width (cm) -0.117570 1.000000 -0.428440
petal length (cm) 0.871754 -0.428440 1.000000
petal width (cm) 0.817941 -0.366126 0.962865
petal width (cm)
sepal length (cm) 0.817941
sepal width (cm) -0.366126
petal length (cm) 0.962865
petal width (cm) 1.000000 Using the corr() method from the Pandas dataframe, we identified how much each attribute correlates with the other attributes.
Note: A correlation value close to 1 means that it has a strong positive correlation. If the value is close to -1, it means that there is a strong negative correlation. Let’s draw a heatmap for a better visualization experience.
Output: ![]() heatmap Steps to Perform PCA for Removing MulticollinearityWe can go through the steps needed to implement PCA. They are as follows:
1. Implementing PCA to Remove MulticollinearitySklearn provides a handy class to implement PCA, so we don’t need to implement the above steps. Let’s apply principal component analysis (PCA) to the iris dataset.
Scikit-Learn’s PCA class makes use of SVD decomposition to implement PCA. Since we mentioned n_components as 3, the PCA will create 3 new features that are a linear combination of the 4 original features. Let’s plot the irises across the three PCA dimensions. The code is as follows:
Output: ![]() PCA Feature plot Here we plotted the three PCA components against each other using Matplotlib’s scatter plot method. We can check the explained variance ratio of each principal component.
Output: [0.92461872 0.05306648 0.01710261] From the output, we can conclude that 92.4% of the dataset’s variance lies along the first PC, 5.3% lies along the 2nd PC, and 1.7% lies along the third PC. The 2nd and 3rd PCs carry very little information. Let’s plot the explained variance ratio as a bar graph for each principal component.
Output: ![]() Explained Variance for each PC 2. Training Logistic Regression with PCAWe already applied PCA to the iris dataset. Now we can train a model using logistic regression. Let’s split the PCA-applied iris dataset into training and test sets. The code is as follows:
Here we used the train_test_split() method from Skelarn to split the dataset into a train and a test dataset. Now we use this dataset to train a logistic regression model. The code is as follows:
Using logistic regression, we trained our model. Let’s check the prediction score of our model using test data.
Output: The accuracy score: 0.9777777777777777 ConclusionPCA is a dimensionality reduction algorithm that transforms a set of correlated variables into uncorrelated components. It effectively addresses multicollinearity by creating orthogonal variables that capture most of the data variance. However, the dimensionality reduction approach may remove potentially relevant information, and the principal components can be challenging to interpret when compared with the original features. Despite everything, PCA provides a robust solution to multicollinearity while preserving the maximum variance in the data. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 15 |