Feature selection using Decision Tree - Coding

Feature selection using decision trees involves identifying the most important features in a dataset based on their contribution to the decision tree’s performance. The article aims to explore feature selection using decision trees and how decision trees evaluate feature importance.

What is feature selection?

Feature selection involves choosing a subset of important features for building a model. It aims to enhance model performance by reducing overfitting, improving interpretability, and cutting computational complexity.

Need for Feature Selection

Datasets can have hundreds, thousands, or sometimes millions of features in the case of image- or text-based models. If we build an ML model using all the given features, it will lead to model overfitting and ultimately a low-performance rate.

Feature selection helps in:

Increasing model performance.
Increasing model interpretability.
Reducing model complexity.
Enhancing data visualization.

What are decision trees ?

Decision trees are a popular machine learning algorithm used for both classification and regression tasks. They model decisions based on the features of the data and their outcomes.

How do decision trees play a role in feature selection?

Decision trees select the ‘best’ feature for splitting at each node based on information gain.
Information gain measures the reduction in entropy (disorder) in a set of data points.
Features with higher information gain are considered more important for splitting, thus aiding in feature selection.
By recursively selecting features for splitting, decision trees inherently prioritize the most relevant features for the model.

Implementation: Feature Selection using Decision Tree

In this implementation, we are going to discuss a practical approach to feature selection using decision trees, allowing for more efficient and interpretable models by focusing on the most relevant features. You can download the dataset from here.

Step 1: Importing Libraries

We need to import the below libraries for implementing decision trees.

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

Step 2: Dataset Description

Getting data descriptions by df.info().

Python

df = pd.read_csv('apple_quality.csv')
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4001 entries, 0 to 4000
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   A_id         4000 non-null   float64
 1   Size         4000 non-null   float64
 2   Weight       4000 non-null   float64
 3   Sweetness    4000 non-null   float64
 4   Crunchiness  4000 non-null   float64
 5   Juiciness    4000 non-null   float64
 6   Ripeness     4000 non-null   float64
 7   Acidity      4001 non-null   object 
 8   Quality      4000 non-null   object 
dtypes: float64(7), object(2)
memory usage: 281.4+ KB

Step 3: Data Preproccessing

Python3

df = df.dropna()
df.isnull().sum()

Output:

A_id           0
Size           0
Weight         0
Sweetness      0
Crunchiness    0
Juiciness      0
Ripeness       0
Acidity        0
Quality        0
dtype: int64

Step 4: Splitting the data

Splitting the dataset into train and test sets.

Python

X = df.drop(['Quality'], axis = 1)
y = df['Quality']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 8, test_size = 0.3)

Step 5: Scaling the data

Python

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 6: Training the Decision Tree Classifier

The DecisionTreeClassifier is trained with a maximum depth of 16 and a random state of 8, which helps control the randomness for reproducibility.

Python3

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=16, random_state=8)
clf.fit(X_train_scaled, y_train)
y_pred = clf.predict(X_test_scaled)

Step 7: Feature selection

The feature importances are calculated using the trained classifier, indicating the relative importance of each feature in the model’s decision-making process
A threshold of 0.1 is set to select features with importance greater than this value, potentially reducing the number of features considered for the final model.
The selected_features variable contains the names of the features that meet the importance threshold, which can be used for further analysis or model refinement.

And, then we use only the selected columns.

Python3

# Get feature importances
importances = clf.feature_importances_
 
# Select features with importance greater than a threshold
threshold = 0.1  # Adjust as needed
selected_features = X.columns[importances > threshold]
 
# Use only the selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

Step 8: Train a model using the selected features

Python3

# Train a new model using the selected features
clf_selected = DecisionTreeClassifier(max_depth=16, random_state=8)
clf_selected.fit(X_train_selected, y_train)

Step 9: Comparing the accuracies

Python3

# Make predictions on the test set using the model trained with all features
y_pred_all_features = clf.predict(X_test_scaled)
 
# Calculate the accuracy of the model with all features
accuracy_all_features = accuracy_score(y_test, y_pred_all_features)
print(f"Accuracy with all features: {accuracy_all_features}")
 
# Make predictions on the test set using the model trained with selected features
y_pred_selected_features = clf_selected.predict(X_test_selected)
 
# Calculate the accuracy of the model with selected features
accuracy_selected_features = accuracy_score(y_test, y_pred_selected_features)
print(f"Accuracy with selected features: {accuracy_selected_features}")

Output:

Accuracy with all features: 0.7983333333333333 
Accuracy with selected features: 0.8241666666666667

These accuracy scores provide insights into the performance of the models. The accuracy score represents the proportion of correctly classified instances out of the total instances in the test set.

Comparing the two accuracies:

The model trained with all features achieved an accuracy of approximately 79.83%.
However, after feature selection, the model trained with selected features achieved a higher accuracy of approximately 82.42%.

Conclusion

Feature selection using decision trees offers a powerful and intuitive approach to enhancing model performance and interpretability. Following the outlined steps, we can easily select features using decision trees to build more robust and efficient models for various applications.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Phishing Classification using Ensemble model
Weighted Logistic Regression for Imbalanced Dataset
Custom gradients in TensorFlow
Heart Disease Prediction using Support Vector Machine
Ensemble Learning with SVM and Decision Trees

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	15