Applying PCA to Logistic Regression to remove Multicollinearity - Coding

Multicollinearity is a common issue in regression models, where predictor variables are highly correlated. This can lead to unstable estimates of regression coefficients, making it difficult to determine the effect of each predictor on the response variable. Principal Component Analysis (PCA) is a powerful technique to address this issue by transforming the original correlated variables into a set of uncorrelated variables called principal components. This article explores how PCA can be applied to logistic regression to remove multicollinearity and improve model performance.

Table of Content

Understanding Multicollinearity
Principal Component Analysis (PCA) for Multicollinearity
Detecting and Visualizing MultiCollinearity

Visualizing Correlation with a Scatter Plot Diagram
Calculating the Correlation Value

Steps to Perform PCA for Removing Multicollinearity

1. Implementing PCA to Remove Multicollinearity
2. Training Logistic Regression with PCA

Understanding Multicollinearity

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, meaning they contain similar information about the variance in the response variable. This can lead to several problems:

Unstable Coefficient Estimates: Small changes in the data can lead to large changes in the estimated coefficients.
Reduced Model Interpretability: It becomes difficult to determine the individual effect of each predictor.
Inflated Standard Errors: This reduces the statistical power of hypothesis tests for the coefficients.

Principal Component Analysis (PCA) for Multicollinearity

PCA can eliminate multicollinearity between features by merging highly correlated variables into a set of uncorrelated variables. It is an unsupervised pre-processing task and makes use of the orthogonal linear transformation technique.

The greatest variance of data lies on the first coordinate (first principal component), the second greatest variance on the second coordinate (second principal component), and so on.
Here, each principal component is linearly uncorrelated.

How PCA deals with Multicollinearity?

PCA constructs new features (principal components) as a linear combination or mixture of initial variables. Here, each component is uncorrelated, thereby avoiding multicollinearity. The majority of the information within the initial variables is projected onto the first principal component.

PCA finds the principal components by recognizing the axis that preserves the maximum amount of variance in the training set. It can also be explained as selecting the axis that reduces the mean squared distance between the original dataset and its projection onto the axis.
Here the first principal component accounts for the maximum amount of variance. It also identifies a second axis, orthogonal to the first principal component, that has the highest amount of remaining variance, and so on (axes based on the number of dimensions in the dataset). Since each component (the new features) is orthogonal to each other, PCA can effectively deal with multicollinearity.

One can find the principal component of a training set by using a standard matrix factorization technique called Singular Value Decomposition (SVD). PCA assumes that the dataset is centered around the origin. It finds a zero-centered unit vector pointing in the direction of the principal component.

Detecting and Visualizing MultiCollinearity

To better understand multicollinearity, we can make use of the iris dataset. It consists of 3 different types of irises (Setosa, Versicolour, and Virginica) and has 4 features: sepal length, sepal width, petal length, and petal width.

Let’s load the iris dataset. The code is as follows:

Python

from sklearn import datasets

# load iris dataset
iris = datasets.load_iris()
# features
iris.feature_names

Output:

['sepal length (cm)', 'sepal width (cm)', 
'petal length (cm)', 'petal width (cm)']

Visualizing Correlation with a Scatter Plot Diagram

We can make use of a scatter plot diagram to plot every numerical attribute against every other numerical attribute. If there is a linear upward or downward trend (correlation) for more than one combination, we can conclude that there is multicollinearity. Let’s visually understand which features are highly correlated. The code is as follows:

Python

import pandas as pd
import seaborn as sns

# convert iris data from numpy array to pandas dataframe 
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
# plot scatter plot to identify correlated features
sns.pairplot(data=iris_df)

Output:

scatter plot for every numerical attribute against every other numerical attribute

Here, we converted the iris dataset set from a numpy array to a Pandas dataframe, which made plotting and correlation calculation easier. The dataframe is passed to the pairplot() method to plot every numerical attribute against every other numerical attribute.

Clearly, there is a higher correlation between petal length and petal width. We can also notice a strong correlation between sepal length and petal length.

Calculating the Correlation Value

With a scatter plot, we will not be able to identify how much each feature is correlated. To identify the value, we can make use of the Pandas dataframe method. Let’s look at how much each attribute correlates with the other attributes.

Python

iris_corr = iris_df.corr()
print(iris_corr)

Output:

                   sepal length (cm)  sepal width (cm)  petal length (cm)  \
sepal length (cm)           1.000000         -0.117570           0.871754   
sepal width (cm)           -0.117570          1.000000          -0.428440   
petal length (cm)           0.871754         -0.428440           1.000000   
petal width (cm)            0.817941         -0.366126           0.962865   

                   petal width (cm)  
sepal length (cm)          0.817941  
sepal width (cm)          -0.366126  
petal length (cm)          0.962865  
petal width (cm)           1.000000

Using the corr() method from the Pandas dataframe, we identified how much each attribute correlates with the other attributes.

A strong correlation between petal length and petal width with a value of 0.96;
sepal length and petal length show a positive correlation with a value of 0.87,
and sepal length and petal width have a positive correlation with a value of 0.81.

Note: A correlation value close to 1 means that it has a strong positive correlation. If the value is close to -1, it means that there is a strong negative correlation.

Let’s draw a heatmap for a better visualization experience.

Python

sns.heatmap(iris_corr, vmin=-1, vmax=1, annot=True)

Output:

heatmap

Steps to Perform PCA for Removing Multicollinearity

We can go through the steps needed to implement PCA. They are as follows:

Mean centering or normalizing the data so that each feature contributes equally to the analysis.
Compute the covariance matrix to understand how the input data sets are varying from the mean w.r.t. each other (to identify correlation or inverse correlation).
Compute the eigen vectors and eigen values of the covariance matrix to identify the principal components. Eigen vectors provide the direction of the axes with the most variance, and eigen values give the amount of variance in each PC.
Identify the feature vector by discarding low eigen values (less significant).
Reorient the data based on the feature vector (from the original axes to the principal component).

1. Implementing PCA to Remove Multicollinearity

Sklearn provides a handy class to implement PCA, so we don’t need to implement the above steps. Let’s apply principal component analysis (PCA) to the iris dataset.

Python

from sklearn.decomposition import PCA

# applying PCA
pca_iris = PCA(n_components=3)
X_pca = pca_iris.fit_transform(iris.data)

Scikit-Learn’s PCA class makes use of SVD decomposition to implement PCA. Since we mentioned n_components as 3, the PCA will create 3 new features that are a linear combination of the 4 original features.

Let’s plot the irises across the three PCA dimensions. The code is as follows:

Python

import matplotlib.pyplot as plt

fig = plt.figure(1, figsize=(7, 6))
ax = fig.add_subplot(111, projection="3d", elev=-155, azim=112)
# scatter plot of eigen vectors
sctr = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2],
           c=iris.target, s=45,)
ax.legend(sctr.legend_elements()[0], iris.target_names,
          loc="upper right", title="Classes")

ax.set_xlabel("PC1 (1st Eigenvector)")
ax.xaxis.set_ticklabels([])
ax.set_ylabel("PC2 (2nd Eigenvector)")
ax.yaxis.set_ticklabels([])
ax.set_zlabel("PC3 (3rd Eigenvector)")
ax.zaxis.set_ticklabels([])

plt.show()

Output:

PCA Feature plot

Here we plotted the three PCA components against each other using Matplotlib’s scatter plot method.

We can check the explained variance ratio of each principal component.

Python

exp_var_ratio = pca_iris.explained_variance_ratio_
print(exp_var_ratio)

Output:

[0.92461872 0.05306648 0.01710261]

From the output, we can conclude that 92.4% of the dataset’s variance lies along the first PC, 5.3% lies along the 2nd PC, and 1.7% lies along the third PC. The 2nd and 3rd PCs carry very little information.

Let’s plot the explained variance ratio as a bar graph for each principal component.

Python

plt.figure(figsize=(6, 4))
plt.bar(range(3), exp_var_ratio, alpha=0.8, 
        align='center')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.tight_layout()

Output:

Explained Variance for each PC

2. Training Logistic Regression with PCA

We already applied PCA to the iris dataset. Now we can train a model using logistic regression. Let’s split the PCA-applied iris dataset into training and test sets. The code is as follows:

Python

from sklearn.model_selection import train_test_split

# split training and test set
X_train, X_test, y_train, y_test = train_test_split(
    X_pca, iris.target, test_size = 0.3, 
    random_state=20, stratify=iris.target)

Here we used the train_test_split() method from Skelarn to split the dataset into a train and a test dataset. Now we use this dataset to train a logistic regression model. The code is as follows:

Python

from sklearn.linear_model import LogisticRegression

# logistic regression model
log = LogisticRegression()
log.fit(X_train,y_train)

Using logistic regression, we trained our model. Let’s check the prediction score of our model using test data.

Python

from sklearn.metrics import accuracy_score

# predict using test data
prediction=log.predict(X_test)
# calculate score using accuracy metric
ac_score = accuracy_score(prediction,y_test)
print('The accuracy score:', ac_score)

Output:

The accuracy score: 0.9777777777777777

Conclusion

PCA is a dimensionality reduction algorithm that transforms a set of correlated variables into uncorrelated components. It effectively addresses multicollinearity by creating orthogonal variables that capture most of the data variance.

However, the dimensionality reduction approach may remove potentially relevant information, and the principal components can be challenging to interpret when compared with the original features. Despite everything, PCA provides a robust solution to multicollinearity while preserving the maximum variance in the data.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Transform Text Features to Numerical Features with CatBoost
Top 9 Data Science Trends in 2024-2025
Top Computer Vision Companies and Startups
LLAMA 3 vs GPT 4
Ethical Considerations in AI Development

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	15