KNN: k-Nearest Neighbour Algorithm in R From Scratch - Coding

In this article, we are going to discuss what is KNN algorithm, how it is coded in R Programming Language, its application, advantages and disadvantages of the KNN algorithm.

kNN algorithm in R

KNN can be defined as a K-nearest neighbor algorithm. It is a supervised learning algorithm that can be used for both classification and regression tasks. It is the simplest algorithm that can be applied in machine learning, data analytics, and data science.KNN algorithm assigns labels to the testing data set based on the class labels of the training data set. It is a lazy learning algorithm because there is no learning that happens in the real sense.KNN algorithm can be applied to both categorical and numerical data. In this article we are going to discuss the KNN algorithm in detail and how it can be implemented on R programming language.

Let us now discuss the steps for the implementation of the KNN algorithm and how to assign class labels to the test data point based on the training dataset.

Input: Take training dataset, test data
Select the value of K (i.e., number of nearest neighbours to be considered).
Calculate the euclidean distance for every point from the test data point , where euclidean distance can be calculated by the formula ((x2-x1)²+(y2-y1)²)^(1/2)
Euclidean distance=√((x2-x1)^2+(y2-y1)^2 ).
Identify the K nearest training data points.
If k=1 assign the class label of test data point with the training data point class label.
If k>1 assign the class label of test data point with the predominant class label of training data point.

Example of KNN Algorithm

Let us now discuss an example how to implement the K Nearest Neighbour Algorithm.

The below table represents the training dataset.The first column represents the serial number. The second column represents the number of pages in a book , third column represents the cost of book and fourth column represents the class of book based on the number of pages and cost of book . The class names include white and black&white. The book is catrgorized as white or black and white based on the cost price and cost of book.

SI. No	Number of Pages	Cost of Book	Class
1	167	51	White
2	182	62	Black and white
3	176	69	Black and white
4	173	64	Black and white
5	172	65	Black and white
6	174	56	White
7	169	58	Black and white
8	173	57	Black and white
9	170	55	Black and white

The above table represents the training dataset which have the class labels as In white , black& white.. The class labels are classified based on the values of cost of book and number of pages of book.

Test Data : SI.No:10 Number of pages : 170 Cost of book : 57 Class ?

For the above test data we need to identify the class label by using the training data set as represented in the table with the help of KNN algorithm steps.

Let us assign the value of K is 3 (i.e., k=3)

Euclidean distance=√((x2-x1)^2+(y2-y1)^2 )

Now let us calculate the euclidean distance for every training datapoint from the test data point. The fourth column in the below table represents the calculation of euclidean distance by using the formula mentioned above where x2,y2 are the test data points (i.e., x2=number of pages of test data , y2=cost of book of test data) and x1,y1 are the training data points (i.e., x1=number of pages of training data , y1=cost of book of training data) .

SI.No	Number of Pages	Cost of Book	Euclidean Distance	Class
1	167	51	√((170-167)^2+(57-51)^2 ) =√((3)^2+(6)^2 ) = √(9+36) =√45 = 6.75	White
2	182	62	13	Black and white
3	176	69	13.4	Black and white
4	173	64	7.6	Black and white
5	172	65	8.2	Black and white
6	174	56	4.1	White
7	169	58	1.4	Black and white
8	173	57	3	Black and white
9	170	55	2	Black and white
10	170	57	–	?

Now let us rearrange the table based on the distance (arranging the distance either in ascending or descending order). We have arranged the above table as below by arranging the euclidean distance in ascending order and rearranged the table as per the order of euclidean distance as shown below.

Si.No	Number of Pages	COst of Book	Euclidean Distance	Class
1	169	58	1.4	Black and white
2	170	55	2	Black and white
3	173	57	3	Black and white
4	174	56	4.1	White
5	167	51	6.7	White
6	173	64	7.6	Black and white
7	172	65	8.2	Black and white
8	182	62	13	Black and white
9	176	69	13.4	Black and white
10	170	57	–	?

Now we are showing the class labels for different values of k.

SI.No1	Number of Pages	Cost of Book	Euclidean Distance	Class
1	169	58	1.4	Black and white	k=1
2	170	55	2	Black and white	k=2
3	173	57	3	Black and white	k=3
4	174	56	4.1	White
5	167	51	6.7	White
6	173	64	7.6	Black and white
7	172	65	8.2	Black and white
8	182	62	13	Black and white
9	176	69	13.4	Black and white
10	170	57	–	?	–

Given the value of k is 3 . For k=3 the class labels are Intelligent, speaker and intelligent respectively.Based on the training dataset now we can assign the test data set with the class label black and white as it is predominant class label for k=3. Therefore the class label for the test data point is shown in the below table along with the training dataset.

Step by step explanation of the KNN algorithm code from the scratch

Let us now implement the above provided example in R programming from scratch.

Taking Data Set as input

In the below code we have taken an external dataset .In the below dataset we have 10 observations and 4 varibales which includes Serial Number , Number of Pages , Cost of Book and Class.The value of class is based on the number of pages and cost of book either white or black and white.

We can load / access the data by using the function read.csv(). In Rstudio we have some built in datasets . In the code explained below we have used an external data set as mentioned below . We have a lot of datasets avaliable in few websites like google , kaggle.com etc. We can download the datasets from the https://www.kaggle.com website .

Dataset Link: Example Dataset

R

dataFrame<-read.csv("https://media.geeksforgeeks.org/wp-content/uploads/
20240112220442/exampleData.csv")
dataFrame

Output:

   S.No Number.of.Pages Cost.of.Book           Class
1     1             167           51           White
2     2             182           62 Black and White
3     3             176           69 Black and White
4     4             173           64 Black and White
5     5             172           65 Black and White
6     6             174           56           White
7     7             169           58 Black and White
8     8             173           57 Black and White
9     9             170           55 Black and White
10   10             170           57 Black and White

We have divided last row of the data as test data and remaining data as the training data . We are predicting the class of the divided test data by using the KNN algorithm. The below code represents the division of dataset into train data and test data

R

#creating training data 
trainData=dataFrame[1:nrow(dataFrame)-1,]
trainData
 #creating test data
 testData=dataFrame[nrow(dataFrame),]
testData

Output:

  S.No Number.of.Pages Cost.of.Book           Class
1    1             167           51           White
2    2             182           62 Black and White
3    3             176           69 Black and White
4    4             173           64 Black and White
5    5             172           65 Black and White
6    6             174           56           White
7    7             169           58 Black and White
8    8             173           57 Black and White
9    9             170           55 Black and White
testData
   S.No Number.of.Pages Cost.of.Book           Class
10   10             170           57 Black and White

We can inspect and analyze the data by using functions like str() and summary() in R.

R

summary(dataFrame)

Output:

      S.No       Number.of.Pages  Cost.of.Book               Class  
 Min.   : 1.00   Min.   :167.0   Min.   :51.00   Black and White:8  
 1st Qu.: 3.25   1st Qu.:170.0   1st Qu.:56.25   White          :2  
 Median : 5.50   Median :172.5   Median :57.50                      
 Mean   : 5.50   Mean   :172.6   Mean   :59.40                      
 3rd Qu.: 7.75   3rd Qu.:173.8   3rd Qu.:63.50                      
 Max.   :10.00   Max.   :182.0   Max.   :69.00

This function is used to get the summary of the whole provided data.

R

str(dataFrame)

Output:

'data.frame':    10 obs. of  4 variables:
 $ S.No           : int  1 2 3 4 5 6 7 8 9 10
 $ Number.of.Pages: int  167 182 176 173 172 174 169 173 170 170
 $ Cost.of.Book   : int  51 62 69 64 65 56 58 57 55 57
 $ Class          : Factor w/ 2 levels "Black and White",..: 2 1 1 1 1 2 1 1 1 1

str() function in R used to display the internal structure of an object.It provides the information about the rows , columns , names of the rows , names of the colums and also give few additional points.

Selecting the value of K and Calculation of euclidean distance

In KNN algorithm we predict the class of the data based on the value of K . The value of K will be decided based on the value of number of observations .Usually, the value of K is the squareroot of the number of observations . For the data we have used , have 10 observations.By the value of observations the value of K will be 3. We can also use this code to store the value of k.

√((x2-x1)^2+(y2-y1)^2 ) is the formula to calculate the eucledian distance where x1,y1 are training data point and x2,y2 are the test data point . We find the euclidean distance for training data points from test data point .The below code represents the function for the calculation of the euclidean distance.

R

k<-3
#function for calculating the  euclidean distance
euclideanDistance=function(x,y){
  #checking whether x and y have same number of observation
  if(length(x)==length(y))
     {
       sqrt(sum((x-y)^2))
     }
  else
    {
       stop('X and Y shouls have same variable numbers')
    }
 }
euclideanDistance(9:15,16:22)

Output:

[1] 18.52026

In the above we have created a function to calculate the distance of the point . To determine the execution of the function we just called the function by providing the values x and y . Where x and y are equal length dataframes.

Complete implementation of KNN algorithm

R

#function for calculating the  euclidean distance
euclideanDistance=function(x,y){
  #checking whether x and y have same number of observation
  if(length(x)==length(y))
     {
       sqrt(sum((x-y)^2))
     }
  else
    {
       stop('X and Y shouls have same variable numbers')
    }
 }
      
 #function to find the K nearest neighbours
 nearestNeighbours=function(trainData,testData,k,funct,s=NULL)
 {
    #checking whether the observations are same or not
    if(ncol(trainData)!=ncol(testData))
    {
       stop('data should be same') 
    }
    if(is.null(s))
    {
       distance=apply(trainData,1,funct,testData) 
    }
    else
    {
        distance=apply(trainData,1,funct,testData,s)
    }
    
    #getting closest neighbours
    distances=sort(distance)[1:k]
    neighbour_res=which(distance %in% sort(distance)[1:k])
     
    if(length(neighbour_res)!=k)
    {
         warning
         (
           paste('Many variables have same length')
         )
    }
   result=list(neighbour_res,distances)
   return(result)
 }
 
 #Accessing the data
 dataFrame=read.csv("https://media.geeksforgeeks.org/wp-content/uploads/20240112220442/
                            exampleData.csv")
 #creating train data
 trainData=dataFrame[1:nrow(dataFrame)-1,]
 #creating test data
 testData=dataFrame[nrow(dataFrame),]
 #calling nearestNeaighbour function
 res=nearestNeighbours(trainData[,1:3],testData[,1:3],3,euclideanDistance)[[1]]
 as.matrix(trainData[res,1:3])
 #creating a prediction function
 knnPrediction=function(trainData, varible)
 {
     interData=table(trainData[,varible])
     predicted=interData[interData==max(interData)]
     return(predicted)
 }
 #calling knnPrection () function
 knnPrediction(trainData[res,],'Class')

Output:

  S.No Number.of.Pages Cost.of.Book
7    7             169           58
8    8             173           57
9    9             170           55
Black and White 
              3

The above output is predicted as black and white based on the training data present.

Step by Step Explanation of the KNN algorithm

Installing Packages

To implement the KNN algorithm in R programming , we need to install some packages includes class , ggplot2 , caret and GGally.

Process to install packages in the Rstudio.

We can install packages in R studio in two ways:

In the Rstudio go to tools, then click on tools , in tools we find install packages click on it then we find a tab , in that tab determine the required package name and click on install . These steps will successfully install the required packages.The below figure represents the tab that is shown when clicked on install packages.⇒ open Rstudio → click on tools →click on install packages → in install packages tab give package name →click on install in install packages tab.

We can also install the packages using the command install.packages(“package_name”) in the command prompt of Rstudio .The below figure represents the installation of packages using the command.⇒ open Rstudio → in console type install.packages(“package_name”) .

Importing Packages

In order to work with KNN algorithm we need to import the installed packages into our script . We load or import the packages into the Rscript by using the function library().Below lines represents the importing/loading of packages into a R script where class , caret, ggplot2 and GGally are the packages for different purpose.The purpose of each package is discussed below.

class – It is a package in R programming to work with the KNN algorithm and classification. It includes the functions like knn(), reduce.nn(),knn.cv() and many more.In this article we are importing this package to work the function knn().
caret – It is package in R to work with classification problems as well as with the regression problems.
ggplot2 – It is a pckage in R programming to create graphics. It is used for the purpose of the data visulaization.
GGally – It is package that is the extension of the package ggplot2 . This package will reduce the complexity of some functions.
library() – It is function in Rstudio used to load the specified package in the Rscript. We load many packages at a time in library() function. The syntax of library() function is library(“package1″,package2″……….”package n”).

R

library(class)
library(caret)
library(ggplot2)
library(GGally)

Accessing/Importing Dataset

After importing the required packages we need to load the data into the Rscript. We can the load or get the data into Rscript into two ways.Now let us discuss each of them.

We can load/acess the available datasets by using the function data().In Rstudion there are approximately 104 bult in datasetsare available.The below represents the code to load the dataset using the function data().In the code explained below we have used the built in dataset iris which has 150 rows and 5 columns.

R

data(iris)
iris

Output:

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa

We accessing the data into our R script.

We can also load / access the data by using the function read.csv(). The below code represents the code to load the dataset using the function read.csv(). read.csv() function stores the accessed data in data frame format.We can download the datasets from kaggle.com website , other google source or we can create our own data.

Normalization

In KNN algorithm we use normalization to make all variables of data to same level. We can make the data to same level by using normalization or standardization. We can use normalization when there is a lot of difference in variable values,it is not necessary to use all the time.

R

normal_frame<-function(a)(
  return  (((a-min(a))/(max(a)-min(a))))
)
iris_new_frame<-as.data.frame(lapply(iris[,-5],normal_frame))
summary(iris_new_frame)

Output:

  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
 Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
 Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
 3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

We observed that the normalization function has created a output with same level of value for all variables

Creating test and training data

We know that the KNN algorithm is a supervised learning algorithm in which it has both training and test data. Supervised learning algorithms learn from the previously available data. Now we are dividing our available data into training data and testing data . We are creating 70% of our data as training data and remaining data as test data.Here we have created two train and two test datasets. In the first set of train and test data set we have created with out the class column(i.e., Species clumn) . In the second setwe have reated data set with the class column (i.e, including the Species column).

R

set.seed(1234)
data_ran<-sample(1:nrow(iris_new_frame),size = nrow(iris_new_frame)*0.7,replace = FALSE)
train_iris<-iris_new_frame[data_ran,]
test_iris<-iris_new_frame[-data_ran,]
 
train_iris_ran<-iris[data_ran,5]
test_iris_ran<-iris[-data_ran,5]

Model Creating

We are creating the KNN model in R with the help of the function knn().The below code represents the creation of model using the function knn() . IN knn() function we have given the values of training data set , test data set , training dataset which as the class variable(in this data set the class variable is species in fifth column),the value of K.

R

knnModel<-knn(train=train_iris,test=test_iris,cl=train_iris_ran,k=13)
summary(knnModel)

Output:

    setosa versicolor  virginica 
        16         16         13

Performance of model

We evaluate the performance of the model by calculating the accuracy of the model.Accuracy tells that how accurately /correctly we are predicting the species based on the sepal length , sepal width , petal length and petal width.The below gives an idea how to calculate the accuracy of the model.

R

accuracy<-100*sum(test_iris_ran==knnModel)/NROW(test_iris_ran)
accuracy

Output:

[1] 95.55556

We can also know the performance parameters of the model by creating the confusion matrix for the model. In R programming we can create the confusion matrix by using the function confusionMatrix().This function can be used only when the caret is downloaded in the Rstudio.

R

table(knnModel,test_iris_ran)
confusionMatrix(table(knnModel,test_iris_ran))

Output:

            test_iris_ran
knnModel     setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         15         1
  virginica       0          1        12
> confusionMatrix(table(knnModel,test_iris_ran))
Confusion Matrix and Statistics
            test_iris_ran
knnModel     setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         15         1
  virginica       0          1        12
Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8485, 0.9946)
    No Information Rate : 0.3556          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.933           
                                          
 Mcnemar's Test P-Value : NA              
Statistics by Class:
                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9375           0.9231
Specificity                 1.0000            0.9655           0.9688
Pos Pred Value              1.0000            0.9375           0.9231
Neg Pred Value              1.0000            0.9655           0.9688
Prevalence                  0.3556            0.3556           0.2889
Detection Rate              0.3556            0.3333           0.2667
Detection Prevalence        0.3556            0.3556           0.2889
Balanced Accuracy           1.0000            0.9515           0.9459

Visualization

R

ggplot(aes(Sepal.Length,Petal.Width),data=iris)+
geom_point(aes(color=factor(Species)))

Output:

kNN algorithm in R from scratch

Applications of KNN Algorithm

KNN algorithm is used for classifying images in image recognition.
KNN algorithm can be used in text categeorization task.
It is useful for the detection of spam messages and spam mails.
KNN algorithm can also be used for the stock prediction , house price prediction ,weather prediction , market segmentation and real estate.
KNN algorithm can be used for the identification of fraud activities in financial transactions.
It can be used for the detection of unusual network traffic patterns.
KNN algorithm can be used in drug discovery and disease diagnosis.
It is helpful in the recognition of hand writing and face patterns.
KNN algorithms is useful to navigate the robots . It is helpful for robotics and robot motion planning.

Advanatges of KNN Algorithm

KNN algorithm is a simple algorithm.
It is an easy algorithm to implement.
KNN algorithm is a lazy learning algorithm.It doesn’t have training phase.
As, KNN algorithm is a lazy learning algorithm and build model at the time of prediction, it is suitable for dynamic and changing datasets .
KNN algorithm show versatality .KNN algorithm is suitable to implement both regression and classification problems.
KNN algorithm has the ability to deal with both qualitative and quantitative data (i.e., categorical and numerical data).
It is less sensitivity to outliers when compared with other algorithms.
KNN algorithm can implement complex patterns and easily acquire local structure of data.

Disadvantages of KNN Algorithm

KNN algorithm has complexity for calculating the distances.
It requires more space to store the training dataset.
The performance of the algorithm decreases as the number of dimensions of the dataset increases.
The performance of the algorithm also depends on the value of k. The small value of k leads to noise while large value of k leads to reduced sensitivity.
This algorithm is sensitive to noisy data, outliers and irrelevant features.

Conclusion

In this article we have learned about the KNN algorithm and the steps to implement the KNN algorithm. We have also learned about the implementation of the KNN algorithm in R programming language . We also learned about the applications , advantages and disadvantages of the KNN algorithm in detail.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
What is Data Migration ?
How to Become an Artificial Intelligence (AI) Engineer in 2024?
Ethics in Data Science and Proper Privacy and Usage of Data
How Much ML is Needed for Data Analysis?
How to perform Causal Analysis?

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	10