Bee Swarm with SHAP Values for Random Forest in R - Coding

SHAP values offer a potent technique for the interpretability of predictions and shed light on where each feature is guiding the outcome. One can better understand the importance and interactions of features by visualizing these SHAP values using bee swarm plots. The following article is a step-by-step guide on how to use SHAP values in the interpretation of Random Forest models, focusing on the creation of Bee Swarm plots in R Programming Language.

Understanding SHAP Values and Bee Swarm Plots

SHAP values offer one unified measure to attribute the contribution of each feature in a system toward a machine learning prediction. Critical properties of SHAP values include:

Additivity: The sum of all SHAP values for all the features is equal to the actual prediction, less the mean prediction.
Fairness: It fundamentally relates to how each feature must somewhat contribute due to its cooperation with other features.
Consistency: If a model changes, increasing the marginal contribution of a feature, then the SHAP value for that feature does not decrease.

Bee Swarm Plots

A bee swarm plot visualizes the distribution of SHAP values for each feature across all samples. This visualization takes elements from both scatter and violin plots, noticing either single points or their density distribution. On a bee swarm plot.

Each point corresponds to the SHAP value of that instance for the feature.
The color often indicates the value of the feature from lowest to highest.
The horizontal position shows the magnitude and the direction in terms of sign for the SHAP value.

Bee swarm plots can allow insights into which features seem to be driving model predictions and how their values impact these predictions.

Step 1: Preparing Data for SHAP Analysis

We will first have to do some data preparation and train a Random Forest model before we can make a bee swarm plot. Here is how to do this in steps:

Load the data: Import the dataset and then perform each preprocessing step required in any scenario: from handling missing values to encoding categorical variables and scaling numerical features.
Train Model: Fit a random forest model on the prepared dataset.
Compute SHAP Values: Compute the SHAP values for a trained model with the support of some library with the functionality of SHAP calculation.

In R, the randomForest package can be used to train the model, and the iml package can be utilized for SHAP values computation.

# Load the libraries
library(randomForest)
library(iml)
library(ggplot2)
library(dplyr)
library(ggbeeswarm)
library(tidyr)

# Load the Iris dataset
data(iris)

# Split the data into training and testing sets
set.seed(123)
train_indices <- sample(seq_len(nrow(iris)), size = 0.7 * nrow(iris))
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

# Train a Random Forest model
rf_model <- randomForest(Species ~ ., data = train_data, importance = TRUE)
print(rf_model)

Output:

Call:
 randomForest(formula = Species ~ ., data = train_data, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.71%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         36          0         0  0.00000000
versicolor      0         29         3  0.09375000
virginica       0          3        34  0.08108108

Step 2 : Computing SHAP values

Now we will Computing SHAP values .

# Create a Predictor object for the iml package
predictor <- Predictor$new(rf_model, data = test_data %>% select(-Species),
                                                              y = test_data$Species)

# Compute SHAP values for the first instance in the test set
shapley <- iml::Shapley$new(predictor, x.interest = test_data[1, -5])

# Extract SHAP values
shap_values <- shapley$results

# Ensure column names are unique
shap_values <- as.data.frame(shap_values)
colnames(shap_values) <- make.unique(colnames(shap_values))
print(shap_values)

Output:

        feature      class   phi    phi.var    feature.value
1  Sepal.Length     setosa  0.05 0.04797980 Sepal.Length=5.1
2   Sepal.Width     setosa  0.04 0.03878788  Sepal.Width=3.5
3  Petal.Length     setosa  0.35 0.22979798 Petal.Length=1.4
4   Petal.Width     setosa  0.32 0.21979798  Petal.Width=0.2
5  Sepal.Length versicolor -0.05 0.04797980 Sepal.Length=5.1
6   Sepal.Width versicolor -0.04 0.03878788  Sepal.Width=3.5
7  Petal.Length versicolor -0.12 0.14707071 Petal.Length=1.4
8   Petal.Width versicolor -0.12 0.10666667  Petal.Width=0.2
9  Sepal.Length  virginica  0.00 0.00000000 Sepal.Length=5.1
10  Sepal.Width  virginica  0.00 0.00000000  Sepal.Width=3.5
11 Petal.Length  virginica -0.23 0.17888889 Petal.Length=1.4
12  Petal.Width  virginica -0.20 0.16161616  Petal.Width=0.2

This table summarizes the phi coefficient (phi) and its variance (phi.var) for different features across three classes (setosa, versicolor, virginica) in a dataset, with specific feature values indicated. Phi measures the association strength between categorical variables, here indicating associations between features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and classes. Positive phi values suggest a positive association, while negative values indicate a negative association.

Step 3 : Separating SHAP Values Creating a Bee Swarm Plot

We use the randomForest package to train the model. the SHAP values will be calculated using the iml package. Plotting the Bee Swarm Plot: Construct this plot using ggplot2.

shap_values_long <- shap_values %>%
  mutate(feature = rownames(shap_values)) %>%
  pivot_longer(cols = starts_with("phi"), names_to = "shap_value", values_to = "value")

# Create the bee swarm plot
ggplot(shap_values_long, aes(x = value, y = feature, color = value)) +
  geom_beeswarm() +
  labs(title = "Bee Swarm Plot of SHAP Values",
       x = "SHAP Value",
       y = "Feature",
       color = "SHAP Value") +
  theme_minimal()

Output:

Bee Swarm with SHAP Values for Random Forest

Conclusion

SHAP values and their bee swarm plots form a significant leap in interpretable machine learning because they provide a better way of understanding model predictions. These techniques intuitively make sense and explain exactly how each feature proportionately contributes to the prediction in complex models as random forests. Using SHAP values, it becomes possible to gain deep insight into how individual features impact the outcome of predictions for data scientists, increasing model transparency and hence trust.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Siamese Neural Network in Deep Learning
Optimizing Chunk Size in tsfresh for Enhanced Processing Speed
15 OpenCV Projects Ideas for Beginners to Practice in 2024
Ranking Duplicate Values of a Column in Incremental Order in PySpark
Using Bessel Functions from scipy.special with Numba

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	16