Predict default payments using decision tree in R - Coding

Predicting default payments is a common task in finance, where we aim to identify whether a customer is likely to default on their loan based on various attributes. Decision trees are a popular choice for this task due to their interpretability and simplicity. In this article, we will demonstrate how to use decision trees in R to predict default payments.

Decision Tree in R

Decision trees are non-parametric supervised learning models used for classification and regression tasks. They predict the value of a target variable by learning simple decision rules inferred from the data features.

Now we will discuss step-by-step implementation for Predicting default payments using decision trees in R Programming Language.

Step 1. Load Necessary Libraries

We need the rpart and rpart.plot packages for building and visualizing decision trees.

install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

Step 2. Load the Dataset

We will take the dataset is saved as UCI_Credit_Card.csv. Load it into R.

Dataset link: UCI_Credit_Card

# Load the dataset
file_path <- "Your path/UCI_Credit_Card.csv"
default_data <- read.csv(file_path)

# Preview the data
head(default_data)

Output:

  ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
1  1     20000   2         2        1  24     2     2    -1    -1    -2    -2
2  2    120000   2         2        2  26    -1     2     0     0     0     2
3  3     90000   2         2        2  34     0     0     0     0     0     0
4  4     50000   2         2        1  37     0     0     0     0     0     0
5  5     50000   1         2        1  57    -1     0    -1     0     0     0
6  6     50000   1         1        2  37     0     0     0     0     0     0
  BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
1      3913      3102       689         0         0         0        0      689
2      2682      1725      2682      3272      3455      3261        0     1000
3     29239     14027     13559     14331     14948     15549     1518     1500
4     46990     48233     49291     28314     28959     29547     2000     2019
5      8617      5670     35835     20940     19146     19131     2000    36681
6     64400     57069     57608     19394     19619     20024     2500     1815
  PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
1        0        0        0        0                          1
2     1000     1000        0     2000                          1
3     1000     1000     1000     5000                          0
4     1200     1100     1069     1000                          0
5    10000     9000      689      679                          0
6      657     1000     1000      800                          0

Step 3. Prepare the Data

Ensure that the target variable (default.payment.next.month) is a factor since it’s a classification problem.

# Convert the target variable to a factor
default_data$default.payment.next.month <- as.factor(default_data$default.payment.next.month)

Step 4. Split the Data

Split the dataset into training and testing sets to evaluate the model’s performance.

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_indices <- sample(1:nrow(default_data), size = 0.7 * nrow(default_data))
train_data <- default_data[train_indices, ]
test_data <- default_data[-train_indices, ]

Step 5. Build the Decision Tree Model

Use the rpart function to build the decision tree model.

# Build the decision tree model
tree_model <- rpart(default.payment.next.month ~ ., data = train_data, method = "class")

# Print the model summary
summary(tree_model)

Output:

Call:
rpart(formula = default.payment.next.month ~ ., data = train_data, 
    method = "class")
  n= 21000 

         CP nsplit rel error    xerror       xstd
1 0.1893567      0 1.0000000 1.0000000 0.01288805
2 0.0100000      1 0.8106433 0.8106433 0.01191465

Variable importance
PAY_0 PAY_4 PAY_5 PAY_3 PAY_6 PAY_2 
   86     4     3     2     2     2 

Node number 1: 21000 observations,    complexity param=0.1893567
  predicted class=0  expected loss=0.2228095  P(node) =1
    class counts: 16321  4679
   probabilities: 0.777 0.223 
  left son=2 (18790 obs) right son=3 (2210 obs)
  Primary splits:
      PAY_0 < 1.5 to the left,  improve=1126.9940, (0 missing)
      PAY_2 < 1.5 to the left,  improve= 854.6981, (0 missing)
      PAY_3 < 1.5 to the left,  improve= 638.6677, (0 missing)
      PAY_4 < 1.5 to the left,  improve= 565.9855, (0 missing)
      PAY_5 < 1   to the left,  improve= 518.8927, (0 missing)
  Surrogate splits:
      PAY_4 < 2.5 to the left,  agree=0.899, adj=0.042, (0 split)
      PAY_5 < 2.5 to the left,  agree=0.899, adj=0.040, (0 split)
      PAY_3 < 3.5 to the left,  agree=0.898, adj=0.028, (0 split)
      PAY_6 < 2.5 to the left,  agree=0.898, adj=0.026, (0 split)
      PAY_2 < 2.5 to the left,  agree=0.897, adj=0.023, (0 split)

Node number 2: 18790 observations
  predicted class=0  expected loss=0.1666312  P(node) =0.8947619
    class counts: 15659  3131
   probabilities: 0.833 0.167 

Node number 3: 2210 observations
  predicted class=1  expected loss=0.2995475  P(node) =0.1052381
    class counts:   662  1548
   probabilities: 0.300 0.700

Step 6. Visualize the Decision Tree

Visualize the decision tree using the rpart.plot package.

# Visualize the decision tree
rpart.plot(tree_model, type = 2, extra = 104, fallen.leaves = TRUE, 
           main = "Decision Tree for Default Prediction")

Output:

Predict default payments using decision tree in R

Step 7. Make Predictions and Evaluate the Model

Use the trained model to make predictions on the test set. Create a confusion matrix and calculate accuracy to evaluate the model.

# Make predictions on the test set
predictions <- predict(tree_model, newdata = test_data, type = "class")

# Create a confusion matrix
confusion_matrix <- table(test_data$default.payment.next.month, predictions)
print(confusion_matrix)

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", accuracy))

Output:

   predictions
       0    1
  0 6752  291
  1 1328  629

[1] "Accuracy: 0.820111111111111"

Conclusion

In conclusion, decision trees in R provide a powerful and interpretable framework for predicting default payments, offering insights into the factors influencing credit risk assessment. Their ability to handle both numerical and categorical data makes them versatile tools in financial analytics, supporting informed decision-making in risk management and lending practices.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Implementing Generalized Least Squares (GLS) in Python
HyperParameter Tuning: Fixing Overfitting in Neural Networks
Structural Equation Modeling: A Comprehensive Overview
Multivariable Calculus for Machine Learning
Distributed Applications with PyTorch

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	20