How to Use Repeated Random Training/Test Splits Inside train() Function in R - Coding

When building machine learning models, it’s crucial to have reliable estimates of model performance. One effective way to achieve this is by using repeated random training/test splits, also known as repeated holdout validation. The caret package in R Programming Language provides a convenient function called train() to facilitate this process.

This article will guide you through the steps to use repeated random training/test splits inside the train() function, with a practical example.

Introduction to train()

The train() function from the caret package is a powerful tool for training and tuning machine learning models. It supports various resampling methods, including repeated holdout validation, k-fold cross-validation, and bootstrapping. This flexibility allows you to select the most appropriate resampling technique for your specific needs.

Repeated Random Training and Test Splits

Repeated Random Training and Test Splits known as holdout validation involves splitting the dataset into training and test sets multiple times, training the model on the training set, and evaluating it on the test set for each split. This approach provides a more robust estimate of model performance by reducing the variance associated with a single train/test split.

Implemention of Repeated Random Training and Test Splits in R

Consider a scenario where you have a dataset containing information about houses, including features like the number of bedrooms, bathrooms, and square footage, and you want to predict the house price. We’ll use the train() function with repeated random training/test splits to evaluate a linear regression model.

Step 1: Load Necessary Packages

First, load the necessary packages, including caret and dplyr.

# Install and load necessary packages
install.packages("caret")
install.packages("dplyr")
library(caret)
library(dplyr)

Step 2: Generate Example Dataset

Create a synthetic dataset with features such as the number of bedrooms, bathrooms, and square footage, as well as the corresponding house prices.

# Set seed for reproducibility
set.seed(123)

# Number of observations
n <- 100

# Generate synthetic data
bedrooms <- sample(1:5, n, replace = TRUE)
bathrooms <- sample(1:3, n, replace = TRUE)
sq_footage <- rnorm(n, mean = 2000, sd = 500)
price <- 100000 + 50000 * bedrooms + 30000 * bathrooms + 100 * sq_footage +
                                                 rnorm(n, mean = 0, sd = 50000)

# Create dataset
house_data <- data.frame(Bedrooms = bedrooms, Bathrooms = bathrooms, 
                         SqFootage = sq_footage, Price = price)
head(house_data)

Output:

  Bedrooms Bathrooms SqFootage    Price
1        3         1  2281.495 562189.5
2        3         3  1813.781 552915.8
3        2         1  2488.487 473166.7
4        2         3  1812.710 394625.9
5        3         2  2526.356 536579.7
6        5         2  1475.411 533047.6

Step 3: Define the Model and Train Control

Define the model you want to fit and specify the train control parameters, including the resampling method and number of repeats.

# Define the model
model <- trainControl(
  method = "repeatedcv",  # Repeated cross-validation
  number = 10,            # Number of folds
  repeats = 5,            # Number of repeats
  verboseIter = TRUE      # Print training log
)

Step 4: Train the Model

Use the train() function to train the model with repeated random training/test splits.

# Train the model using the train() function
set.seed(123)
fit <- train(
  Price ~ .,                # Model formula
  data = house_data,        # Dataset
  method = "lm",            # Linear regression model
  trControl = model,        # Train control parameters
  metric = "RMSE"           # Performance metric
)

Output:

+ Fold01.Rep1: intercept=TRUE 
- Fold01.Rep1: intercept=TRUE 
+ Fold02.Rep1: intercept=TRUE 
- Fold02.Rep1: intercept=TRUE 
+ Fold03.Rep1: intercept=TRUE 
- Fold03.Rep1: intercept=TRUE 
+ Fold04.Rep1: intercept=TRUE 
- Fold04.Rep1: intercept=TRUE 
+ Fold05.Rep1: intercept=TRU

Step 5: Evaluate the Model

Review the model performance metrics, including the RMSE and MAE, to evaluate how well the model performs across different train/test splits.

# Print model summary
print(fit)

# Get model performance metrics
results <- fit$results
print(results)

Output:

Linear Regression 

100 samples
  3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 88, 89, 90, 92, 90, 92, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  50347.17  0.7764902  40486.79

Tuning parameter 'intercept' was held constant at a value of TRUE

  intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
1      TRUE 50347.17 0.7764902 40486.79 10739.16  0.1106781 10816.58

The output of the train() function includes a summary of the model and performance metrics. The results object contains detailed performance metrics for each resampling iteration, including the mean and standard deviation of the RMSE and MAE.

Conclusion

Using repeated random training/test splits inside the train() function in R allows for robust performance estimation of machine learning models. This approach reduces the variance associated with a single train/test split and provides more reliable insights into model performance. By following the step-by-step guide provided in this article, you can effectively use repeated holdout validation in your machine learning projects, ensuring accurate and trustworthy model evaluations.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Polynomial Contrasts for Regression Using R
How to Use fit_resamples with Custom Split Data in R
STL Trend of Time Series Using R
Can Degrees of Freedom be a Non-Integer Number in R?
How to Save Machine Learning Models in R

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	17