Underrated and Implementing Nu-Support Vector Classification (NuSVC) - Coding

Machine learning (ML) is a vast field, with certain algorithms often overshadowing others due to their widespread use and popularity. However, some lesser-known algorithms offer unique advantages and deserve more attention. This article delves into some underrated ML algorithms, with a special focus on Nu-Support Vector Classification (NuSVC). We will explore its theoretical foundations, practical implementations, and provide a complete code example.

Table of Content

Understanding Nu-Support Vector Classification (NuSVC)

Advantages of NuSVC
When to Use NuSVC

Implementing Nu-Support Vector Classification in Python

Step 1: load the dataset and import necessary libraries
Step 2: Introspect the data frame
Step 3: Checking for outliers in dataset
Step 4: Feature Engineering
Step 5: Feature Split followed by Imbalanced dataset
Step 6: Implementing NuSVC Classification Algorithm

Choosing the Right Parameters

Understanding Nu-Support Vector Classification (NuSVC)

The Nu-Support Vector Classification (NuSVC) algorithm is a variant of the traditional Support Vector Classification (SVC). Unlike SVC, which uses a parameter C to control the trade-off between margin maximization and classification error, NuSVC uses the parameter ν (nu), which represents an upper bound on the fraction of training errors and a lower bound on the fraction of support vectors.

This provides a more intuitive way to control the number of support vectors and errors.
NuSVC can be more flexible in adjusting the model complexity and is useful when the number of training samples is small.
It also allows for the construction of non-linear decision boundaries using kernel functions, such as linear, polynomial, and RBF (Radial Basis Function) kernels.

This makes NuSVC suitable for a wide range of classification problems, particularly those involving complex and high-dimensional data.

Advantages of NuSVC

Intuitive Control: The ? (nu) parameter in NuSVC provides a straightforward way to control the fraction of support vectors and training errors. This allows for more flexibility in handling datasets with varying levels of noise and outliers.
Robustness to Outliers: NuSVC is more robust to outliers and noisy data due to its ability to adjust the number of support vectors based on the parameter ?. This makes it a better choice when dealing with datasets that have a large number of outliers.

When to Use NuSVC

Handling Noisy Data: Use NuSVC when dealing with datasets that have a significant amount of noisy data or outliers. Its ability to adjust the number of support vectors makes it more effective in handling such datasets.
Flexibility in Margin Control: Use NuSVC when you need more flexibility in controlling the margin. The ? parameter allows for a more nuanced control over the margin, which can be beneficial in certain scenarios.

Implementing Nu-Support Vector Classification in Python

To implement NuSVC, we will use the scikit-learn library, a popular ML library in Python. Below is a step-by-step guide to implementing NuSVC for a classification task.

Step 1: load the dataset and import necessary libraries

Python

#importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# creating a dataset 
df = pd.read_csv('/content/disease_growth_level.csv')

#first view of the dataset
df.head(10)

Output:

first preview of the data

Step 2: Introspect the data frame

We’ll analyze the dataset with respect to below techniques:

Shape of the dataframe.
checking for null values.
columns of the dataframe.
information of the columns.
summary statistics of the dataframe.
target variable distribution.

Python

# shape
df.shape

# dropping null values
df.isnull().sum()
df.dropna(inplace=True)

#dropping duplicates
df.duplicated().sum()
df.drop_duplicates(keep='first',inplace=True)

#columns
df.columns

#information of dataframe
df.info()

#summary statistics
df.describe()

Output:

Here we drop the null values as the ratio of null values are less and its a real world data.

Step 3: Checking for outliers in dataset

For more, please refer to article published on :

Handling Outliers

Outlier-treatment

Python

# boxplot
sns.boxplot(df)

Output:

Step 4: Feature Engineering

For more, please refer to article published on : Feature Engineering

Extracting time into hours and minutes using pandas.
Dropping uneccesary columns.
Encoding the type of columns having object into numerical values.

Python

# extracting time column into separate columns for hours and minutes
df.time = pd.to_datetime(df.time)
df['hour'] = df.time.dt.hour
df['minute'] = df.time.dt.minute

#hence dropping time column
df.drop('time',axis=1,inplace=True)

#dropping the date function from dataframe(not relevant in this scenario)
df.drop('date',axis=1,inplace=True)

#Feature encoding using Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['ventilation'] = le.fit_transform(df['ventilation'])
df['light_intensity'] = le.fit_transform(df['light_intensity'])
df['disease growth possibility level'] = le.fit_transform(df['disease growth possibility level'])

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7124 entries, 0 to 7123
Data columns (total 8 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   date                              7124 non-null   object 
 1   time                              7124 non-null   object 
 2   temperature                       7124 non-null   float64
 3   humidity                          7124 non-null   int64  
 4   ventilation                       7118 non-null   object 
 5   light_intensity                   7118 non-null   object 
 6   ph                                7118 non-null   float64
 7   disease growth possibility level  7124 non-null   object 
dtypes: float64(2), int64(1), object(5)
memory usage: 445.4+ KB

Step 5: Feature Split followed by Imbalanced dataset

For more, please refer to article published on: Handling-Imbalanced-dataset

Handling the imbalanced dataset using imblearn library.

Python

#independent and dependent feature split, X and y
X = df.drop('disease growth possibility level',axis=1)
y = df['disease growth possibility level']

# imblearn for handling imbalance in the dataset
from imblearn.combine import SMOTETomek
from imblearn.under_sampling import NearMiss
smk = SMOTETomek(random_state=42)
X,y = smk.fit_resample(X,y)

Output:

Step 6: Implementing NuSVC Classification Algorithm

Train test split.
Standard scaler for scaling the data.
Model building.
Model training.
Building a predictive model.
Accuracy metrics: confusion matrix, accuracy score,classification report.
Pickle the model into pkl file.

Python

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=42)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#NuSVC model
from sklearn.svm import NuSVC
clf = NuSVC()
clf.fit(X_train,y_train)

#prediction
y_pred = clf.predict(X_test)

#classification metrics
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

#predicative model
#here i used the a row from dataframe for checking if the model predicts correctly
input_data = (22.40,82,1,0,6.9,8,30)
input_data_as_numpy_array = np.asarray(input_data)
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
prediction = clf.predict(input_data_reshaped)
print(prediction)

#pickling the model
import pickle
pickle.dump(clf,open('model_mushroom.pkl','wb'))

Output:

You could download the pkl file from files sectionb, in the left-side panel.

Choosing the Right Parameters

Nu Parameter : This parameter controls the trade-off between the number of support vectors and margin errors. A higher $\nu$ value will result in more support vectors and fewer margin errors.
Kernel: The choice of kernel (linear, polynomial, RBF) depends on the nature of the data. Linear kernels are suitable for linearly separable data, while non-linear kernels can handle more complex relationships.
Gamma: This parameter defines the influence of a single training example. A low value means ‘far’ and a high value means ‘close’.

Conclusion

Nu-Support Vector Classification (NuSVC) offers a flexible and robust alternative to traditional SVC, particularly in scenarios involving noisy data and the need for nuanced margin control. By understanding its theoretical foundations and practical implementations, you can leverage NuSVC to tackle complex classification problems effectively. The provided code example demonstrates how to implement NuSVC using Python’s scikit-learn library, making it accessible for practical applications.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Grouping Multiple Columns and Transposing in Hive
Using ChatGPT for Pair Programming
L1/L2 Regularization in PyTorch
Top 15 Data Visualization Frameworks
Hardware Acceleration for Computer Vision Algorithms

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	22