How to create Naive Bayes in R for numerical and categorical variables - Coding

Naive Bayes classifiers are simple yet powerful probabilistic classifiers based on Bayes’ theorem. They are particularly useful for large datasets and have applications in various domains, including text classification, spam detection, and medical diagnosis. This article will guide you through the process of creating a Naive Bayes classifier in R that can handle both numerical and categorical variables.

Understanding Naive Bayes

Naive Bayes classifiers assume that the features (predictors) are conditionally independent given the class label. Despite this “naive” assumption, they often perform surprisingly well in practice. The key idea is to calculate the posterior probability for each class and then select the class with the highest probability.

Now we will discuss the Step-by-Step Guide to Creating a Naive Bayes Classifier for numerical and categorical variables in R Programming Language.

Step 1: Install and Load Required Packages

First, ensure that you have the e1071 package installed, as it provides an implementation of the Naive Bayes classifier.

install.packages("e1071")
library(e1071)

Step 2: Prepare Your Data

For demonstration purposes, we’ll use a sample dataset. Here’s an example dataset that includes both numerical and categorical variables:

# Sample data
data <- data.frame(
  age = c(25, 30, 45, 35, 50, 23, 37, 61, 22, 42),
  income = c('high', 'high', 'medium', 'low', 'low', 'medium', 'medium', 'high', 'low', 
                                                                              'medium'),
  student = c('no', 'no', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes'),
  credit_rating = c('fair', 'excellent', 'fair', 'fair', 'excellent', 'excellent', 'fair',
                                                             'fair', 'excellent', 'fair'),
  buys_computer = c('no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes')
)
head(data)
# Convert categorical variables to factors
data$income <- as.factor(data$income)
data$student <- as.factor(data$student)
data$credit_rating <- as.factor(data$credit_rating)
data$buys_computer <- as.factor(data$buys_computer)

Output:

  age income student credit_rating buys_computer
1  25   high      no          fair            no
2  30   high      no     excellent            no
3  45 medium      no          fair           yes
4  35    low     yes          fair           yes
5  50    low      no     excellent           yes
6  23 medium     yes     excellent            no

Step 3: Split the Data into Training and Testing Sets

Splitting the data helps in evaluating the performance of the model. We’ll use 70% of the data for training and 30% for testing.

set.seed(123)  # For reproducibility
train_index <- sample(1:nrow(data), 0.7 * nrow(data))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

Step 4: Train the Naive Bayes Model

Use the naiveBayes function from the e1071 package to train the model.

model <- naiveBayes(buys_computer ~ ., data = train_data)
print(model)

Output:

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:
Y
       no       yes 
0.5714286 0.4285714 

Conditional probabilities:
     age
Y         [,1]     [,2]
  no  34.75000 17.74589
  yes 36.33333 12.50333

     income
Y          high       low    medium
  no  0.7500000 0.0000000 0.2500000
  yes 0.0000000 0.3333333 0.6666667

     student
Y            no       yes
  no  0.7500000 0.2500000
  yes 0.3333333 0.6666667

     credit_rating
Y     excellent      fair
  no  0.5000000 0.5000000
  yes 0.3333333 0.6666667

Step 5: Make Predictions

Use the trained model to make predictions on the test data.

predictions <- predict(model, test_data)
print(predictions)

Output:

[1] yes yes yes
Levels: no yes

Step 6: Evaluate the Model

Evaluate the performance of the model by comparing the predictions with the actual class labels.

# Confusion matrix
confusion_matrix <- table(predictions, test_data$buys_computer)
print(confusion_matrix)

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))

Output:

predictions no yes
        no   0   0
        yes  0   3

[1] "Accuracy: 1"

The naiveBayes function in the e1071 package can handle both numerical and categorical variables. Numerical variables are assumed to follow a Gaussian (normal) distribution, while categorical variables are handled by calculating the frequency of each category given the class label.

Conclusion

Creating a Naive Bayes classifier in R to handle both numerical and categorical variables involves:

Installing and loading the e1071 package.
Preparing your data by converting categorical variables to factors.
Splitting the data into training and testing sets.
Training the model using the naiveBayes function.
Making predictions and evaluating the model’s performance.

Naive Bayes classifiers are robust and efficient, making them a great choice for various classification tasks. By following the steps outlined in this article, you can implement a Naive Bayes classifier in R for datasets with mixed types of variables.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Multinomial Naive Bayes Classifier in R
How to Perform a Cramer-Von Mises Test in R
How to Achieve Disc Shape in D3 Force Simulation?
What are the design schemas of data modelling?
14 Things AI Can — and Can't Do

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	17