Horje
Predict default payments using decision tree in R

Predicting default payments is a common task in finance, where we aim to identify whether a customer is likely to default on their loan based on various attributes. Decision trees are a popular choice for this task due to their interpretability and simplicity. In this article, we will demonstrate how to use decision trees in R to predict default payments.

Decision Tree in R

Decision trees are non-parametric supervised learning models used for classification and regression tasks. They predict the value of a target variable by learning simple decision rules inferred from the data features.

Now we will discuss step-by-step implementation for Predicting default payments using decision trees in R Programming Language.

Step 1. Load Necessary Libraries

We need the rpart and rpart.plot packages for building and visualizing decision trees.

R
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

Step 2. Load the Dataset

We will take the dataset is saved as UCI_Credit_Card.csv. Load it into R.

Dataset link: UCI_Credit_Card

R
# Load the dataset
file_path <- "Your path/UCI_Credit_Card.csv"
default_data <- read.csv(file_path)

# Preview the data
head(default_data)

Output:

  ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
1 1 20000 2 2 1 24 2 2 -1 -1 -2 -2
2 2 120000 2 2 2 26 -1 2 0 0 0 2
3 3 90000 2 2 2 34 0 0 0 0 0 0
4 4 50000 2 2 1 37 0 0 0 0 0 0
5 5 50000 1 2 1 57 -1 0 -1 0 0 0
6 6 50000 1 1 2 37 0 0 0 0 0 0
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
1 3913 3102 689 0 0 0 0 689
2 2682 1725 2682 3272 3455 3261 0 1000
3 29239 14027 13559 14331 14948 15549 1518 1500
4 46990 48233 49291 28314 28959 29547 2000 2019
5 8617 5670 35835 20940 19146 19131 2000 36681
6 64400 57069 57608 19394 19619 20024 2500 1815
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
1 0 0 0 0 1
2 1000 1000 0 2000 1
3 1000 1000 1000 5000 0
4 1200 1100 1069 1000 0
5 10000 9000 689 679 0
6 657 1000 1000 800 0

Step 3. Prepare the Data

Ensure that the target variable (default.payment.next.month) is a factor since it’s a classification problem.

R
# Convert the target variable to a factor
default_data$default.payment.next.month <- as.factor(default_data$default.payment.next.month)

Step 4. Split the Data

Split the dataset into training and testing sets to evaluate the model’s performance.

R
# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_indices <- sample(1:nrow(default_data), size = 0.7 * nrow(default_data))
train_data <- default_data[train_indices, ]
test_data <- default_data[-train_indices, ]

Step 5. Build the Decision Tree Model

Use the rpart function to build the decision tree model.

R
# Build the decision tree model
tree_model <- rpart(default.payment.next.month ~ ., data = train_data, method = "class")

# Print the model summary
summary(tree_model)

Output:

Call:
rpart(formula = default.payment.next.month ~ ., data = train_data,
method = "class")
n= 21000

CP nsplit rel error xerror xstd
1 0.1893567 0 1.0000000 1.0000000 0.01288805
2 0.0100000 1 0.8106433 0.8106433 0.01191465

Variable importance
PAY_0 PAY_4 PAY_5 PAY_3 PAY_6 PAY_2
86 4 3 2 2 2

Node number 1: 21000 observations, complexity param=0.1893567
predicted class=0 expected loss=0.2228095 P(node) =1
class counts: 16321 4679
probabilities: 0.777 0.223
left son=2 (18790 obs) right son=3 (2210 obs)
Primary splits:
PAY_0 < 1.5 to the left, improve=1126.9940, (0 missing)
PAY_2 < 1.5 to the left, improve= 854.6981, (0 missing)
PAY_3 < 1.5 to the left, improve= 638.6677, (0 missing)
PAY_4 < 1.5 to the left, improve= 565.9855, (0 missing)
PAY_5 < 1 to the left, improve= 518.8927, (0 missing)
Surrogate splits:
PAY_4 < 2.5 to the left, agree=0.899, adj=0.042, (0 split)
PAY_5 < 2.5 to the left, agree=0.899, adj=0.040, (0 split)
PAY_3 < 3.5 to the left, agree=0.898, adj=0.028, (0 split)
PAY_6 < 2.5 to the left, agree=0.898, adj=0.026, (0 split)
PAY_2 < 2.5 to the left, agree=0.897, adj=0.023, (0 split)

Node number 2: 18790 observations
predicted class=0 expected loss=0.1666312 P(node) =0.8947619
class counts: 15659 3131
probabilities: 0.833 0.167

Node number 3: 2210 observations
predicted class=1 expected loss=0.2995475 P(node) =0.1052381
class counts: 662 1548
probabilities: 0.300 0.700

Step 6. Visualize the Decision Tree

Visualize the decision tree using the rpart.plot package.

R
# Visualize the decision tree
rpart.plot(tree_model, type = 2, extra = 104, fallen.leaves = TRUE, 
           main = "Decision Tree for Default Prediction")

Output:

gh

Predict default payments using decision tree in R

Step 7. Make Predictions and Evaluate the Model

Use the trained model to make predictions on the test set. Create a confusion matrix and calculate accuracy to evaluate the model.

R
# Make predictions on the test set
predictions <- predict(tree_model, newdata = test_data, type = "class")

# Create a confusion matrix
confusion_matrix <- table(test_data$default.payment.next.month, predictions)
print(confusion_matrix)

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))

Output:

   predictions
0 1
0 6752 291
1 1328 629

[1] "Accuracy: 0.820111111111111"

Conclusion

In conclusion, decision trees in R provide a powerful and interpretable framework for predicting default payments, offering insights into the factors influencing credit risk assessment. Their ability to handle both numerical and categorical data makes them versatile tools in financial analytics, supporting informed decision-making in risk management and lending practices.




Reffered: https://www.geeksforgeeks.org


AI ML DS

Related
Implementing Generalized Least Squares (GLS) in Python Implementing Generalized Least Squares (GLS) in Python
HyperParameter Tuning: Fixing Overfitting in Neural Networks HyperParameter Tuning: Fixing Overfitting in Neural Networks
Structural Equation Modeling: A Comprehensive Overview Structural Equation Modeling: A Comprehensive Overview
Multivariable Calculus for Machine Learning Multivariable Calculus for Machine Learning
Distributed Applications with PyTorch Distributed Applications with PyTorch

Type:
Geek
Category:
Coding
Sub Category:
Tutorial
Uploaded by:
Admin
Views:
20