Email Spam Detection using Catboost - Coding

In today’s digital age, email remains one of the most widely used communication mediums. However, the prevalence of spam emails poses significant challenges for both individuals and organizations. Spam emails not only clutter inboxes but can also contain malicious links or attachments that pose security risks. To combat this, machine learning techniques have been increasingly employed to develop effective spam detection systems. One such powerful tool is CatBoost, a gradient boosting algorithm that excels in handling categorical data. This article delves into the application of CatBoost for email spam detection, highlighting its features, advantages, and implementation.

Table of Content

Understanding CatBoost
Email Spam Detection with CatBoost

1. Data Preparation
2. Train a Catboost Model
3. Evaluate the Model

Advantages of Using CatBoost for Spam Detection

Understanding CatBoost

CatBoost, short for “Categorical Boosting,” is an open-source gradient boosting library developed by Yandex. It is designed to handle both numerical and categorical features efficiently, making it particularly suitable for real-world datasets that often contain categorical variables. Unlike other gradient boosting algorithms, CatBoost does not require extensive preprocessing of categorical data, such as one-hot encoding, which simplifies the data preparation process and enhances model performance.

Key Features of CatBoost

Handling Categorical Features: CatBoost automatically processes categorical features using a technique called Ordered Boosting. This method generates numerical representations by permuting categorical variables, preserving their natural ordering and improving model accuracy.
Gradient Boosting: CatBoost employs gradient boosting, an ensemble learning technique that combines weak prediction models, typically decision trees, to create a robust predictive model. This iterative process focuses on correcting errors made by previous models, enhancing overall accuracy.
Regularization: To prevent overfitting, CatBoost incorporates L2 regularization, which adds a penalty term to the loss function. This helps control the complexity of the model and improves its generalization ability.
Symmetric Decision Trees: CatBoost builds balanced trees that are symmetric in structure, reducing prediction time and acting as a form of regularization to prevent overfitting.
High Performance: CatBoost is optimized for speed and memory efficiency, making it suitable for large datasets. It also supports distributed training on multiple machines and GPUs, enabling quick model training on big data

Email Spam Detection with CatBoost

Spam email detection is nothing else but identifying and sorting out wrong and irrelevant emails from genuine ones so that subscribers and customers are not able to receive those unnecessary ones again and again. It is nothing else but text classification, where machine learning algorithms are fed with data in the form of internet messages and given two classes of intended and unintended.

Email Spam Features:

Presence of urgency or all caps: Spammers oftentimes have methods, such as “urgent” or writing the entire email in uppercase case, to compel the recipients to click on the links.
Sender and recipient information: Observing the sender and recipient address configurations may disclose irregularities that are common denominators in spam campaigns.
Body text analysis: To check if an email’s body includes typical spam wording, proceed to run through the body and check for words that are usually blended in spam emails (e.g., prescription, free, click, access, etc.). The term “free marketing method” (via placards, leaflets, or competitions) can be used to differentiate it.
Attachment presence: It is also very important to note that some of the most spammy messages will come from unknown senders.

Installing Catboost

pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.2/98.2 MB 7.5 MB/s eta 0:00:00
Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost) (0.20.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost) (3.7.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.25.2)
Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.10/dist-packages (from catboost) (2.0.3)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from catboost) (1.11.4)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost) (5.15.0)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24->catboost) (2024.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (4.51.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (24.0)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (3.1.2)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost) (8.3.0)
Installing collected packages: catboost
Successfully installed catboost-1.2.5

let’s build a model for spam detection with catboost in following steps:

1. Data Preparation

The first step in building an email spam detection model is data preparation. This involves collecting a dataset of emails labeled as spam or non-spam. The dataset should include various features such as the email’s subject, body, sender information, and metadata.

Text Preprocessing: Emails are primarily text-based, so preprocessing steps like tokenization, stop word removal, and stemming are essential. Tokenization breaks down the email content into individual words or tokens. Stop word removal eliminates common words that do not contribute to the classification, and stemming reduces words to their root forms.
Feature Extraction: Extract relevant features from the email content. This can include the frequency of certain words, the presence of specific keywords, and metadata such as the sender’s domain. CatBoost can handle these features directly without extensive preprocessing,

Loading the Dataset:

Once the data is prepared, the next step is to train the CatBoost model. The training process involves the following steps:

Splitting the Data: Divide the dataset into training and validation sets. The training set is used to build the model, while the validation set evaluates its performance.
Initializing CatBoost: Install the CatBoost library and initialize the CatBoostClassifier. Set the necessary parameters such as the number of iterations, learning rate, and depth of the trees.
Convert Data to CatBoost Format: Catboost needs data in the format of a pool, representing each training or test instance that comprises a pool containing different parameters and predictions. Convert your data into Catboost pools by using the Pool interface of the library.

Python

import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin-1', names=['label', 'message'], on_bad_lines='skip')

Output:

            label    message
v1    v2    NaN    NaN    NaN
ham    Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...    NaN    NaN    NaN
Ok lar... Joking wif u oni...    NaN    NaN    NaN
spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's    NaN    NaN    NaN
ham    U dun say so early hor... U c already then say...    NaN    NaN    NaN

1.2 Text Preprocessing

The CatBoostClassifier class and calling the fit method is performed in this phase. The training pool will be used as an argument. One can manipulate a range of hyperparameters pertaining to the model specifically, the learning rate, tree depth, and regularization parameters, to fine-tune its performance.

Python

# Assuming the CSV has columns 'v1' for labels and 'v2' for text
# Rename columns for convenience
if data.columns[0] == 'v1' and data.columns[1] == 'v2':
    data = data.rename(columns={'v1': 'label', 'v2': 'text'})
else:
    # Handle the case where the column names might be different or have extra spaces
    data.columns = ['label', 'text']
# Drop any rows with missing values
data = data.dropna(subset=['label', 'text'])
# Encode labels: 'ham' -> 0, 'spam' -> 1
data['label'] = data['label'].str.strip().map({'ham': 0, 'spam': 1})

1.3 Feature extraction, and splitting into training and Test

Python

# Extract features and labels
X = data['text']
y = data['label']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print the shapes of the training and testing sets
print("Shape of the training set:", X_train.shape, y_train.shape)
print("Shape of the testing set:", X_test.shape, y_test.shape)

Output:

Shape of the training set: (4,) (4,)
Shape of the testing set: (2,) (2,)

Python

# Vectorize the text data using TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

2. Train a Catboost Model

Python

# Create CatBoost model
model = CatBoostClassifier(iterations=1000, depth=6, learning_rate=0.1, loss_function='Logloss', verbose=100)
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)

3. Evaluate the Model

Utilize a fitting algorithm for the further supervised learning set and make predictions on the test pool via the prediction method. Estimate the model’s performance by using metrics like accuracy, precision, recall, and F1-score evaluating the model by comparing its predicted labels with the true labels in the testing set.

Python

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-score: {f1}')

Output:

Accuracy: 0.8333
Precision: 1.0000
Recall: 0.6667
F1 Score: 0.8000

Advantages of Using CatBoost for Spam Detection

Efficiency: CatBoost’s ability to handle categorical features without extensive preprocessing saves time and computational resources, making it efficient for large datasets.
Accuracy: The gradient boosting framework and regularization techniques employed by CatBoost result in highly accurate models that generalize well to unseen data.
Ease of Use: CatBoost’s automatic handling of categorical features and built-in support for missing values simplify the model-building process, making it accessible to users with varying levels of expertise.
Scalability: CatBoost’s support for distributed training and its optimized performance make it suitable for big data applications, ensuring scalability for large-scale spam detection systems.

Conclusion

Email spam detection is a critical task in maintaining the security and efficiency of digital communication. CatBoost, with its robust handling of categorical data and powerful gradient boosting framework, offers an effective solution for building accurate and efficient spam detection models. By leveraging CatBoost’s features, organizations can enhance their email security, reduce the risk of malicious content, and improve overall productivity. As spam detection techniques continue to evolve, CatBoost remains a valuable tool in the fight against unwanted and harmful emails.By following the steps outlined in this article, you can implement a CatBoost-based spam detection system that not only improves email management but also provides a robust defense against the ever-growing threat of spam.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Dataset for Image Classification
Cora Dataset
What is the Tilde (~) Symbol for in R?
Linear Mixed-Effects Models (LME) In R
How to Calculate AIC in R?

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	13