Multinomial Naive Bayes Classifier in R - Coding

The Multinomial Naive Bayes (MNB) classifier is a popular machine learning algorithm, especially useful for text classification tasks such as spam detection, sentiment analysis, and document categorization. In this article, we discuss about the basics of the MNB classifier and how to implement it in R.

What is Naive Bayes?

Naive Bayes is a family of simple probabilistic classifiers based on Bayes’ theorem with the “naive” assumption of independence between every pair of features. Despite its simplicity, it often performs surprisingly well for many tasks, particularly those involving text.

Multinomial Naive Bayes

The Multinomial Naive Bayes classifier is specifically designed for handling discrete data. It is most commonly used for document classification problems, where the frequency of each word (i.e., a discrete count of the words) is used as a feature.

How Does Multinomial Naive Bayes Work?

The algorithm calculates the probability of each word given each class from the training data. It also calculates the prior probability of each class.
For a new document, it calculates the product of the probabilities of the words in the document given each class and the prior probability of each class. The class with the highest product is chosen as the predicted class.

Now we will discuss step by step implementation of Multinomial Naive Bayes Classifier in R Programming Language.

Dataset Link -: Spam

Step 1: Load Necessary Packages and Dataset

e1071: Provides functions for Naive Bayes classification (naiveBayes function).
tm: Used for text mining tasks, such as creating a corpus and preprocessing text data.
caret: Provides functions for data splitting (createDataPartition) and model evaluation (confusionMatrix).
Loads the dataset from the specified path.

library(e1071)
library(tm)
library(caret)

sms_data <- read.csv("C:\\Users\\Tonmoy\\Downloads\\Dataset\\spam.csv", 
                                                stringsAsFactors = FALSE)

Step 2: Prepare the Data

Renames the columns of the dataset to “type” and “text” for easier reference.

colnames(sms_data) <- c("type", "text")

Step 3: Text Preprocessing

Converts any invalid UTF-8 characters in the “text” column from “latin1” encoding to “UTF-8”.
Create Corpus: Converts the “text” column into a text corpus using tm::Corpus.
Preprocess Text: Applies several transformations (tolower, removePunctuation, removeNumbers, removeWords, stripWhitespace) to clean and standardize the text data for analysis.
Converts the preprocessed corpus into a Document-Term Matrix (DTM) where rows represent documents (text messages) and columns represent terms (words).
Converts the Document-Term Matrix (DTM) into a regular matrix (dtm_matrix) suitable for modeling purposes.

sms_data$text <- iconv(sms_data$text, from = "latin1", to = "UTF-8", sub = "")

corpus <- Corpus(VectorSource(sms_data$text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

dtm <- DocumentTermMatrix(corpus)
dtm_matrix <- as.matrix(dtm)

Step 4: Split the Data

Splits the dataset into training and testing sets using caret::createDataPartition.
Selects rows from dtm_matrix based on train_index to create train_data and test_data.
Converts train_labels and test_labels to factors to ensure they have the same levels for classification.

set.seed(123)
train_index <- createDataPartition(sms_data$type, p = 0.75, list = FALSE)
train_data <- dtm_matrix[train_index, ]
test_data <- dtm_matrix[-train_index, ]
train_labels <- as.factor(sms_data$type[train_index])
test_labels <- as.factor(sms_data$type[-train_index])

Step 5: Train the Multinomial Naive Bayes Classifier

Trains the Multinomial Naive Bayes classifier (mnb_model) using the training data (train_data) and corresponding labels (train_labels), with Laplace smoothing parameter (laplace = 1).

mnb_model <- naiveBayes(train_data, train_labels, laplace = 1)

Step 6: Make Predictions

Uses the trained model (mnb_model) to make predictions (test_pred) on the test data (test_data).

test_pred <- predict(mnb_model, test_data)

Step 7: Evaluate the Model

Computes the confusion matrix (conf_matrix) to evaluate the performance of the Naive Bayes classifier on the test data (test_pred vs. test_labels).
Prints the confusion matrix to display metrics such as accuracy, precision, recall, and F1-score.

conf_matrix <- confusionMatrix(test_pred, test_labels)
print(conf_matrix)

Output:

Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham     0    0
      spam 1206  186
                                          
               Accuracy : 0.7336          
                 95% CI : (0.1162, 0.1526)
    No Information Rate : 0.8664          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.0000          
            Specificity : 1.0000          
         Pos Pred Value :    NaN          
         Neg Pred Value : 0.1336          
             Prevalence : 0.8664          
         Detection Rate : 0.0000          
   Detection Prevalence : 0.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : ham

Correctly predicted 0 instances of “ham” (true negatives). Correctly predicted 186 instances of “spam” (true positives). Incorrectly predicted 1206 instances of “spam” as “ham” (false negatives).

Accuracy: 73.36% of all predictions were correct.
95% Confidence Interval (CI): True accuracy likely falls between 11.62% and 15.26%.
No Information Rate (NIR): Model would achieve 86.64% accuracy by always predicting the majority class (“ham”).
Kappa Statistic: Indicates no agreement beyond chance (Kappa = 0).
Sensitivity (True Positive Rate): 0% sensitivity indicates no correct identification of “spam”.
Specificity (True Negative Rate): 100% specificity indicates all “ham” messages were correctly identified.
Positive Predictive Value (Precision): Not a Number (NaN) for PPV due to no positive predictions (“spam”).
Negative Predictive Value (NPV): 13.36% of predicted “ham” messages were correct.
Prevalence: 86.64% of messages were “ham”.
Detection Rate: 0% of “spam” messages were correctly identified.
Balanced Accuracy: 50% suggests classifier performs no better than random chance.
Positive Class: “ham” was considered as the positive class for metrics calculation.

Conclusion

Multinomial Naive Bayes classifier in R provides a straightforward approach for text classification tasks like spam detection, its performance heavily depends on the quality and balance of the dataset. In this case, the classifier struggled with accurately identifying spam messages, largely due to the imbalance where “ham” messages dominated. Improving its effectiveness might involve addressing this imbalance, refining text preprocessing techniques, or exploring more advanced algorithms.

Multinomial Naive Bayes Classifier in R-FAQs

What is the Multinomial Naive Bayes classifier used for?

The Multinomial Naive Bayes classifier is primarily used for text classification tasks, such as spam email detection, sentiment analysis, and categorizing news articles.

How does the Multinomial Naive Bayes classifier handle text data?

It treats each word’s frequency as a feature and assumes independence between the occurrences of different words, making it efficient for large text datasets.

What are the key advantages of using the Multinomial Naive Bayes classifier?

It is simple, fast, and works well with high-dimensional data like text. It also performs efficiently even with a relatively small amount of training data.

What are the main challenges or limitations of the Multinomial Naive Bayes classifier?

It assumes independence between features (words), which may not always hold true in real-world text data. It can also be sensitive to the presence of irrelevant or redundant features.

How can I improve the performance of a Multinomial Naive Bayes classifier in R?

You can improve performance by preprocessing text data effectively (e.g., removing stop words, stemming/lemmatizing), addressing class imbalance if present, and tuning parameters such as Laplace smoothing.

Why does the confusion matrix show NaN for Positive Predictive Value (Precision)?

NaN (Not a Number) appears when there are no positive predictions made by the classifier. This often occurs when there are no instances predicted as the minority class (e.g., “spam” messages in a spam detection task).

How should I interpret the accuracy and other metrics in the confusion matrix output?

Accuracy measures the overall correctness of predictions. However, in imbalanced datasets, metrics like sensitivity (true positive rate) and specificity (true negative rate) are more informative about the classifier’s performance.

What are some alternative classifiers I can use for text classification in R?

Besides Multinomial Naive Bayes, other popular classifiers for text include Support Vector Machines (SVM), Logistic Regression, and ensemble methods like Random Forests or Gradient Boosting Machines.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
How to Perform a Cramer-Von Mises Test in R
How to Achieve Disc Shape in D3 Force Simulation?
What are the design schemas of data modelling?
14 Things AI Can — and Can't Do
Nlp Algorithms

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	20