Horje
How to find which columns affect a prediction in R

Understanding which features (columns) in a dataset most influence a model’s predictions is crucial for interpreting and trusting the model’s results. This process, known as feature importance or model interpretability, helps in identifying the key factors driving predictions and can provide valuable insights into the data. In R, several methods can be used to determine feature importance, including using built-in model functions, permutation importance, and more. This article will guide you through these methods using the R Programming Language.

Methods to Determine Feature Importance

Here are the main Methods to Determine Feature Importance.

  • Using Built-in Functions for Specific Models
  • Permutation Importance
  • Partial Dependence Plots
  • SHAP Values

Method 1: Using Built-in Functions for Specific Models

Many machine learning models in R have built-in functions to calculate feature importance. For example, the randomForest package provides the importance function. For example let’s use the mtcars dataset and build a Random Forest model to predict miles per gallon (mpg). We will then determine which features are most important in predicting mpg.

Step 1: Load the Required Libraries

First we will install and load the Required Libraries.

R
# Install necessary packages if not already installed
install.packages("randomForest")
install.packages("caret")
install.packages("vip")
install.packages("pdp")
install.packages("iml")

# Load the libraries
library(randomForest)
library(caret)
library(vip)
library(pdp)
library(iml)

Step 2: Load and Inspect the Data

Now we will load the inbuilt dataset.

R
# Load the dataset
data(mtcars)

# Inspect the first few rows of the dataset
head(mtcars)

Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Train a Random Forest Model

Now we will Train a Random Forest Model.

R
# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(mpg ~ ., data = mtcars, importance = TRUE)

# View the feature importance
importance(rf_model)

Output:

       %IncMSE IncNodePurity
cyl  11.681887    170.967443
disp 13.707322    247.133710
hp   13.408444    185.917131
drat  5.254248     73.121467
wt   13.966207    261.456359
qsec  3.668855     35.936897
vs    3.566838     28.874233
am    2.484465      9.418205
gear  3.195970     16.836503
carb  5.561653     24.701147

Step 4: Plot Feature Importance

Now we will Plot Feature Importance.

R
# Plot feature importance using the vip package
vip(rf_model)

Output:

gh

Columns affect a prediction in R

Method 2: Using Permutation Importance

Permutation importance involves shuffling the values of each feature and measuring the decrease in the model’s accuracy. This method is model-agnostic and can be applied to any machine learning model.

R
# Use the caret package to calculate permutation importance
set.seed(123)
control <- trainControl(method = "cv", number = 5)
model <- train(mpg ~ ., data = mtcars, method = "rf", trControl = control, 
                 importance = TRUE)

# Extract and plot variable importance
varImp(model)

Output:

rf variable importance

     Overall
wt    100.00
disp   93.73
hp     88.86
cyl    70.58
carb   31.61
drat   31.02
am     20.94
qsec   20.89
vs     13.04
gear    0.00

Method 3: Partial Dependence Plots

Partial Dependence Plots (PDPs) show the marginal effect of a feature on the predicted outcome, averaged over the values of all other features.

R
# Create a PDP for a specific feature
pdp::partial(model, pred.var = "wt", plot = TRUE)

Output:

gh

Columns affect a prediction in R

Method 4: SHAP Values

SHAP (SHapley Additive exPlanations) values provide a unified measure of feature importance and offer insights into how each feature contributes to a prediction.

R
# Create a predictor object using the iml package
predictor <- Predictor$new(model, data = mtcars[, -1], y = mtcars$mpg)

# Calculate SHAP values
shap_values <- Shapley$new(predictor, x.interest = mtcars[1, -1])

# Plot SHAP values
shap_values$plot()

Output:

gh

Columns affect a prediction in R

Conclusion

Determining which features affect a prediction is a critical aspect of model interpretability. R provides several methods to assess feature importance, including built-in functions for specific models, permutation importance, partial dependence plots, and SHAP values. Each method offers unique insights into the contribution of features to the model’s predictions. By leveraging these techniques, you can gain a deeper understanding of your models and the underlying data.




Reffered: https://www.geeksforgeeks.org


AI ML DS

Related
Different Robust Standard Errors of Logit Regression in Stata and R Different Robust Standard Errors of Logit Regression in Stata and R
Passing Parameters to Scikit-Learn Keras Model Functions Passing Parameters to Scikit-Learn Keras Model Functions
Fuzzy Optimization Techniques: An Overview Fuzzy Optimization Techniques: An Overview
R Programming 101 R Programming 101
Scaling Seaborn&#039;s y-axis with a Bar Plot Scaling Seaborn&#039;s y-axis with a Bar Plot

Type:
Geek
Category:
Coding
Sub Category:
Tutorial
Uploaded by:
Admin
Views:
16