![]() |
Decision trees are a fundamental tool in the machine learning arsenal, favored for their interpretability and efficacy in handling both numerical and categorical data. The rpart package in R is a popular choice for constructing decision trees, offering a range of functionalities, including performance metrics that assess model quality. Among these, “Relative Error” (Rel Error) and “Cross-Validation Error” (X Error) are pivotal yet often lead to confusion. This article aims to demystify these terms, providing a detailed comparison to enhance understanding. Understanding rpart Decision Treesrpart, which stands for Recursive Partitioning, creates decision trees by splitting data into subsets based on variables that result in the highest homogeneity or similarity. These trees are built through a process of selecting the best split at each node until a stopping criterion is reached or further splitting is no longer beneficial. The complexity of decision trees often leads to overfitting, where the model performs well on training data but poorly on unseen data. To mitigate this, rpart incorporates several strategies, such as tree pruning, which simplifies the model by removing nodes that provide little predictive power. What is Rel Error?Rel Error in the context of decision trees created by rpart refers to the error relative to the root node. It is calculated as the error of a node divided by the error at the root node and is expressed as a proportion. This metric is particularly useful in evaluating how much each split improves the model’s accuracy. In essence, it helps trace the error reduction trajectory as the analysis moves from the root to any given node in the tree, providing insights into the effectiveness of each split. What is X Error?X Error, or Cross-Validation Error, measures the model’s ability to generalize to an independent data set not used during the model training process. It is estimated through k-fold cross-validation—typically 10-fold—which involves dividing the data into k subsets. The model is trained on k-1 subsets, and the remaining subset is used as the test set. This process is repeated such that each subset is used as the test set once. The X Error is the average of the errors from each of these test iterations. This metric is crucial for avoiding overfitting by providing a realistic estimate of the model’s performance on new, unseen data. Comparison and DifferencesWhile Rel Error and X Error both provide valuable insights into a decision tree’s performance, they serve different purposes and are calculated through distinct methods. Below is a detailed comparison:
Practical Application and ExampleTo demonstrate the practical application of these metrics, consider the following R code snippet that employs the rpart package to build a decision tree:
Output: Call:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
method = "class")
n= 81
CP nsplit rel error xerror xstd
1 0.17647059 0 1.0000000 1 0.2155872
2 0.01960784 1 0.8235294 1 0.2155872
3 0.01000000 4 0.7647059 1 0.2155872
Variable importance
Start Age Number
64 24 12
Node number 1: 81 observations, complexity param=0.1764706
predicted class=absent expected loss=0.2098765 P(node) =1
class counts: 64 17
probabilities: 0.790 0.210
left son=2 (62 obs) right son=3 (19 obs)
Primary splits:
Start < 8.5 to the right, improve=6.762330, (0 missing)
Number < 5.5 to the left, improve=2.866795, (0 missing)
Age < 39.5 to the left, improve=2.250212, (0 missing)
Surrogate splits:
Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split)
Node number 2: 62 observations, complexity param=0.01960784
predicted class=absent expected loss=0.09677419 P(node) =0.7654321
class counts: 56 6
probabilities: 0.903 0.097
left son=4 (29 obs) right son=5 (33 obs)
Primary splits:
Start < 14.5 to the right, improve=1.0205280, (0 missing)
Age < 55 to the left, improve=0.6848635, (0 missing)
Number < 4.5 to the left, improve=0.2975332, (0 missing)
Surrogate splits:
Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split)
Age < 16 to the left, agree=0.597, adj=0.138, (0 split)
Node number 3: 19 observations
predicted class=present expected loss=0.4210526 P(node) =0.2345679
class counts: 8 11
probabilities: 0.421 0.579
Node number 4: 29 observations
predicted class=absent expected loss=0 P(node) =0.3580247
class counts: 29 0
probabilities: 1.000 0.000
Node number 5: 33 observations, complexity param=0.01960784
predicted class=absent expected loss=0.1818182 P(node) =0.4074074
class counts: 27 6
probabilities: 0.818 0.182
left son=10 (12 obs) right son=11 (21 obs)
Primary splits:
Age < 55 to the left, improve=1.2467530, (0 missing)
Start < 12.5 to the right, improve=0.2887701, (0 missing)
Number < 3.5 to the right, improve=0.1753247, (0 missing)
Surrogate splits:
Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split)
Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split)
Node number 10: 12 observations
predicted class=absent expected loss=0 P(node) =0.1481481
class counts: 12 0
probabilities: 1.000 0.000
Node number 11: 21 observations, complexity param=0.01960784
predicted class=absent expected loss=0.2857143 P(node) =0.2592593
class counts: 15 6
probabilities: 0.714 0.286
left son=22 (14 obs) right son=23 (7 obs)
Primary splits:
Age < 111 to the right, improve=1.71428600, (0 missing)
Start < 12.5 to the right, improve=0.79365080, (0 missing)
Number < 3.5 to the right, improve=0.07142857, (0 missing)
Node number 22: 14 observations
predicted class=absent expected loss=0.1428571 P(node) =0.1728395
class counts: 12 2
probabilities: 0.857 0.143
Node number 23: 7 observations
predicted class=present expected loss=0.4285714 P(node) =0.08641975
class counts: 3 4
probabilities: 0.429 0.571 This code builds a decision tree predicting the occurrence of kyphosis based on age, number of vertebrae involved, and the topmost vertebrae operated on. The summary function outputs various metrics including Rel Error and X Error, enabling an evaluation of both internal and external model performance. ConclusionUnderstanding the differences between Rel Error and X Error in decision trees built using rpart is crucial for constructing robust models. These metrics not only reflect the internal dynamics and effectiveness of model splits but also safeguard against overfitting by emphasizing generalization. By carefully monitoring both Rel Error and X Error, data scientists can optimize decision trees to achieve the best predictive performance while maintaining model simplicity and interpretability. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 18 |