What is the Difference Between Rel Error and X Error in an rpart Decision Tree? - Coding

Decision trees are a fundamental tool in the machine learning arsenal, favored for their interpretability and efficacy in handling both numerical and categorical data. The rpart package in R is a popular choice for constructing decision trees, offering a range of functionalities, including performance metrics that assess model quality. Among these, “Relative Error” (Rel Error) and “Cross-Validation Error” (X Error) are pivotal yet often lead to confusion. This article aims to demystify these terms, providing a detailed comparison to enhance understanding.

Understanding rpart Decision Trees

rpart, which stands for Recursive Partitioning, creates decision trees by splitting data into subsets based on variables that result in the highest homogeneity or similarity. These trees are built through a process of selecting the best split at each node until a stopping criterion is reached or further splitting is no longer beneficial. The complexity of decision trees often leads to overfitting, where the model performs well on training data but poorly on unseen data. To mitigate this, rpart incorporates several strategies, such as tree pruning, which simplifies the model by removing nodes that provide little predictive power.

What is Rel Error?

Rel Error in the context of decision trees created by rpart refers to the error relative to the root node. It is calculated as the error of a node divided by the error at the root node and is expressed as a proportion. This metric is particularly useful in evaluating how much each split improves the model’s accuracy. In essence, it helps trace the error reduction trajectory as the analysis moves from the root to any given node in the tree, providing insights into the effectiveness of each split.

What is X Error?

X Error, or Cross-Validation Error, measures the model’s ability to generalize to an independent data set not used during the model training process. It is estimated through k-fold cross-validation—typically 10-fold—which involves dividing the data into k subsets. The model is trained on k-1 subsets, and the remaining subset is used as the test set. This process is repeated such that each subset is used as the test set once. The X Error is the average of the errors from each of these test iterations. This metric is crucial for avoiding overfitting by providing a realistic estimate of the model’s performance on new, unseen data.

Comparison and Differences

While Rel Error and X Error both provide valuable insights into a decision tree’s performance, they serve different purposes and are calculated through distinct methods. Below is a detailed comparison:

Purpose: Rel Error assesses the incremental accuracy improvements within the tree, useful for understanding how each node contributes to reducing the overall error. X Error evaluates the model’s predictive performance on data not used during training, essential for assessing its practical applicability.
Calculation: Rel Error is straightforward, calculated using the error rates observed within the training data. In contrast, X Error involves a more rigorous and computationally intensive process using cross-validation to simulate model performance on unseen data.
Usage: Rel Error is instrumental in the tree-building phase, guiding decisions on where to make splits. X Error is pivotal in the pruning phase, helping to select the model complexity that best generalizes beyond the training data.

Practical Application and Example

To demonstrate the practical application of these metrics, consider the following R code snippet that employs the rpart package to build a decision tree:

# Load the necessary library
library(rpart)

# Example data
data(kyphosis)

# Build the decision tree
fit <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, method="class")

# Summarize the fit to view Rel Error and X Error
summary(fit)

Output:

Call:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, 
    method = "class")
  n= 81 

          CP nsplit rel error xerror      xstd
1 0.17647059      0 1.0000000      1 0.2155872
2 0.01960784      1 0.8235294      1 0.2155872
3 0.01000000      4 0.7647059      1 0.2155872

Variable importance
 Start    Age Number 
    64     24     12 

Node number 1: 81 observations,    complexity param=0.1764706
  predicted class=absent   expected loss=0.2098765  P(node) =1
    class counts:    64    17
   probabilities: 0.790 0.210 
  left son=2 (62 obs) right son=3 (19 obs)
  Primary splits:
      Start  < 8.5  to the right, improve=6.762330, (0 missing)
      Number < 5.5  to the left,  improve=2.866795, (0 missing)
      Age    < 39.5 to the left,  improve=2.250212, (0 missing)
  Surrogate splits:
      Number < 6.5  to the left,  agree=0.802, adj=0.158, (0 split)

Node number 2: 62 observations,    complexity param=0.01960784
  predicted class=absent   expected loss=0.09677419  P(node) =0.7654321
    class counts:    56     6
   probabilities: 0.903 0.097 
  left son=4 (29 obs) right son=5 (33 obs)
  Primary splits:
      Start  < 14.5 to the right, improve=1.0205280, (0 missing)
      Age    < 55   to the left,  improve=0.6848635, (0 missing)
      Number < 4.5  to the left,  improve=0.2975332, (0 missing)
  Surrogate splits:
      Number < 3.5  to the left,  agree=0.645, adj=0.241, (0 split)
      Age    < 16   to the left,  agree=0.597, adj=0.138, (0 split)

Node number 3: 19 observations
  predicted class=present  expected loss=0.4210526  P(node) =0.2345679
    class counts:     8    11
   probabilities: 0.421 0.579 

Node number 4: 29 observations
  predicted class=absent   expected loss=0  P(node) =0.3580247
    class counts:    29     0
   probabilities: 1.000 0.000 

Node number 5: 33 observations,    complexity param=0.01960784
  predicted class=absent   expected loss=0.1818182  P(node) =0.4074074
    class counts:    27     6
   probabilities: 0.818 0.182 
  left son=10 (12 obs) right son=11 (21 obs)
  Primary splits:
      Age    < 55   to the left,  improve=1.2467530, (0 missing)
      Start  < 12.5 to the right, improve=0.2887701, (0 missing)
      Number < 3.5  to the right, improve=0.1753247, (0 missing)
  Surrogate splits:
      Start  < 9.5  to the left,  agree=0.758, adj=0.333, (0 split)
      Number < 5.5  to the right, agree=0.697, adj=0.167, (0 split)

Node number 10: 12 observations
  predicted class=absent   expected loss=0  P(node) =0.1481481
    class counts:    12     0
   probabilities: 1.000 0.000 

Node number 11: 21 observations,    complexity param=0.01960784
  predicted class=absent   expected loss=0.2857143  P(node) =0.2592593
    class counts:    15     6
   probabilities: 0.714 0.286 
  left son=22 (14 obs) right son=23 (7 obs)
  Primary splits:
      Age    < 111  to the right, improve=1.71428600, (0 missing)
      Start  < 12.5 to the right, improve=0.79365080, (0 missing)
      Number < 3.5  to the right, improve=0.07142857, (0 missing)

Node number 22: 14 observations
  predicted class=absent   expected loss=0.1428571  P(node) =0.1728395
    class counts:    12     2
   probabilities: 0.857 0.143 

Node number 23: 7 observations
  predicted class=present  expected loss=0.4285714  P(node) =0.08641975
    class counts:     3     4
   probabilities: 0.429 0.571

This code builds a decision tree predicting the occurrence of kyphosis based on age, number of vertebrae involved, and the topmost vertebrae operated on. The summary function outputs various metrics including Rel Error and X Error, enabling an evaluation of both internal and external model performance.

Conclusion

Understanding the differences between Rel Error and X Error in decision trees built using rpart is crucial for constructing robust models. These metrics not only reflect the internal dynamics and effectiveness of model splits but also safeguard against overfitting by emphasizing generalization. By carefully monitoring both Rel Error and X Error, data scientists can optimize decision trees to achieve the best predictive performance while maintaining model simplicity and interpretability.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
RMSProp Optimizer in Deep Learning
Why One-Hot Encoding Improves Machine Learning Performance?
Fixing Accuracy Score ValueError: Can't Handle Mix of Binary and Continuous Target
Creating a Legend for Google Heatmap
NLP in customer service

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	18