Feature importance is a critical concept in machine learning, particularly when using ensemble methods like RandomForestClassifier. It helps in understanding which features contribute the most to the prediction of the target variable. This article delves into how feature importances are determined in RandomForestClassifier, the methods used, and their significance.
Understanding RandomForestClassifierRandomForestClassifier is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees. Each tree in the forest is trained on a different random subset of the data, where the final prediction is taken by averaging predictions coming from every tree in the case of regression or by majority voting in the case of classification. It is known for its robustness and ability to handle large datasets with higher dimensionality.
How Feature Importances are Determined for RandomForestClassifier?Feature importance in RandomForestClassifier is obtained from several different methods, all of which capture the amount by which each feature contributes to the predictions made by the model.
1. Gini Importance (Mean Decrease in Impurity)Gini importance also referred to as mean decrease in impurity, or MDI’ provides the total reduction in node impurity, weighted by the probability of reaching that node, averaged over all trees within the forest. In the case of classification-type problems, impurity is measured via the Gini index. Steps to Calculate MDI:
- Compute Node Impurity: Calculate, for each node, a decrease in Gini impurity for each feature when the feature is used to split the node.
- Weight by Node Probability: Reduction in impurity multiplied by the probability of reaching that node.
- Sum Across Trees: Now sum these values across all the trees in the forest for each feature.
2. Mean Decrease in Accuracy Permutation ImportanceThe Perm importance, also called MDA, means decrease in accuracy, is the method that evaluates feature contributions by measuring the feature impact on the model’s accuracy by permuting feature values and measuring the performance changes.
- Train Model: Train the RandomForestClassifier on your original dataset and observe the accuracy out of the box.
- Permute Feature Values: Randomly permute values of a single feature across all samples.
- Measure Performance Drop: Test the model in the permuted dataset and measure the performance drop.
- Repeat: Do the above for all the features and take an average.
3. TreeSHAP (SHapley Additive exPlanations)TreeSHAP is a technique that calculates consistent and accurate feature importance values based on game theory. It provides the Shapley value for each feature; it reflects that feature’s contribution to the prediction.
- Coalitional Game Theory: All features are treated as if they were players in a cooperative game, where the target variable has to be predicted.
- Shapley Values: All possible subsets of features are considered and the corresponding gain in the predictions is determined for each feature; that is, its marginal contribution is computed.
- Aggregate: The contributions are aggregated to get the overall feature importance.
Determining Feature Importance for RandomforestClassifier: ImplementationThe Iris dataset is a typical example of machine learning. A RandomForestClassifier can be applied to classify different species of Iris flowers using features like sepal length, sepal width, petal length, and petal width. The analysis or feature importance may then point to petal length and width as most influential for the classification. This gives more understanding regarding what features matter most in the differentiation among each species. Let’s implement different techniques to determine the feature importance in RandomForestClassifier.
We can train iris model using RandomForestClassifier from sklearn. The code is as follows:
Python
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Train RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X, y)
Calculate Feature Importance using Gini ImportanceFrom the fitted random forest classifier we can compute the feature_importance. The code is as follows:
- For each tree in the forest:
- Compute the decrease in Gini impurity for each node split using each feature.
- Weight the impurity decrease by the probability of reaching that node
- Sum these weighted impurity decreases across all nodes and trees for each feature.
- Aggregate the importance values across all trees to get the Gini Importance for each feature.
Python
import pandas as pd
# Calculate Gini Importance (Mean Decrease in Impurity)
feature_importances = clf.feature_importances_
feature_imp_df = pd.DataFrame({'Feature': feature_names,
'Gini Importance': feature_importances})
print(feature_imp_df)
Output:
Feature Gini Importance 0 sepal length (cm) 0.106128 1 sepal width (cm) 0.021678 2 petal length (cm) 0.436130 3 petal width (cm) 0.436065 Calculate Feature Importance Using Accuracy Prediction Importance To calculate mean decrease in accuracy permutation importance let’s make use of permutation_importance method from sklearn. The code is as follows:
- Train the RandomForestClassifier on the original dataset and compute its accuracy.
- For each feature:
- Randomly permute the values of that feature across all samples.
- Evaluate the model accuracy on the permuted dataset.
- Compute the drop in accuracy caused by the permutation for that feature.
- Aggregate the accuracy drops across multiple permutations to get the Permutation Importance for each feature.
Python
from sklearn.inspection import permutation_importance
import pandas as pd
# Calculate Mean Decrease in Accuracy Permutation Importance
result = permutation_importance(clf, X, y, n_repeats=10,
random_state=42)
perm_importances = result.importances_mean
perm_imp_df = pd.DataFrame({'Feature': feature_names,
'Permutation Importance': perm_importances})
print(perm_imp_df)
Output:
Feature Permutation Importance 0 sepal length (cm) 0.014667 1 sepal width (cm) 0.012667 2 petal length (cm) 0.222667 3 petal width (cm) 0.180667 The highest permutation importance value indicates that shuffling the values of this feature leads to a significant decrease in the model’s performance.
Calculate Feature Importance Using TreeShapLet’s calculate TreeSHAP using TreeExplainer class. The code is as follows:
- Compute SHAP values for each feature using TreeSHAP:
- Treat features as players in a cooperative game where the goal is to predict the target variable.
- Calculate the marginal contribution (SHAP value) of each feature to the prediction.
- 2. Aggregate the SHAP values across all samples to determine the TreeSHAP importance for each feature.
Python
import shap
import numpy as np
import pandas as pd
# Calculate TreeSHAP (SHapley Additive exPlanations)
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X)
# Calculate mean absolute SHAP values
shap_summary = np.abs(shap_values).mean(axis=0)
# Create a list to store SHAP values for each feature
shap_summary_list = [np.mean(shap_summary[i]) \
for i in range(shap_summary.shape[0])]
# Create DataFrame using 1D list of SHAP values
shap_summary_df = pd.DataFrame({'Feature': feature_names,
'SHAP values': shap_summary_list})
print(shap_summary_df)
Output:
Feature SHAP values 0 sepal length (cm) 0.038240 1 sepal width (cm) 0.006965 2 petal length (cm) 0.202161 3 petal width (cm) 0.199517 Positive SHAP value indicate a positive contribution to the prediction, whereas negative value indicates a negative contribution.
Classifying feature_importances: Applications and Examples Importance of knowing feature importance:
- Feature Selection: Also, by determining the most relevant features, it reduces the dimensionality of the model, in which the model would work much better with way less computation complexity; for example, in medical diagnostics, the important biomarkers are pointed out to have efficient as well as exact disease prediction models.
- Model Interpretability: In interpreting the model, feature importance is helpful for one to know which features most affect the predictions. This becomes essential in areas like finance and healthcare, in which understanding the decision-making process is equally important as the predictions themselves.
- Enhancing Domain Knowledge : Finally, often, the analysis of feature importance provides insight into the latent patterns in data and contributes to domain knowledge. For example, knowing the impact of various pollutants on air quality would inform policy and mitigation strategies in environmental science.
Conclusion In the case of a RandomForestClassifier, there exist ways by which the importance of features can be determined-like: Gini importance, permutation importance, TreeSHAP, etcetera. The interpretation of all three brings to light crucial information regarding what individual contribution sharing each feature has towards the performance or results of the model. This, in effect, pushes the heuristic discernment of the model and, hence, feature selection and add-ons to the domain knowledge. When understood and used correctly, for many applications, feature importance is the tool leading to more accurate and practical but, above all, accessible models to interpret.
|