![]() |
In this article, we are going to discuss what is KNN algorithm, how it is coded in R Programming Language, its application, advantages and disadvantages of the KNN algorithm. kNN algorithm in RKNN can be defined as a K-nearest neighbor algorithm. It is a supervised learning algorithm that can be used for both classification and regression tasks. It is the simplest algorithm that can be applied in machine learning, data analytics, and data science.KNN algorithm assigns labels to the testing data set based on the class labels of the training data set. It is a lazy learning algorithm because there is no learning that happens in the real sense.KNN algorithm can be applied to both categorical and numerical data. In this article we are going to discuss the KNN algorithm in detail and how it can be implemented on R programming language. Let us now discuss the steps for the implementation of the KNN algorithm and how to assign class labels to the test data point based on the training dataset.
Example of KNN AlgorithmLet us now discuss an example how to implement the K Nearest Neighbour Algorithm. The below table represents the training dataset.The first column represents the serial number. The second column represents the number of pages in a book , third column represents the cost of book and fourth column represents the class of book based on the number of pages and cost of book . The class names include white and black&white. The book is catrgorized as white or black and white based on the cost price and cost of book.
The above table represents the training dataset which have the class labels as In white , black& white.. The class labels are classified based on the values of cost of book and number of pages of book. Test Data : SI.No:10 Number of pages : 170 Cost of book : 57 Class ?For the above test data we need to identify the class label by using the training data set as represented in the table with the help of KNN algorithm steps. Let us assign the value of K is 3 (i.e., k=3) Euclidean distance=√((x2-x1)^2+(y2-y1)^2 ) Now let us calculate the euclidean distance for every training datapoint from the test data point. The fourth column in the below table represents the calculation of euclidean distance by using the formula mentioned above where x2,y2 are the test data points (i.e., x2=number of pages of test data , y2=cost of book of test data) and x1,y1 are the training data points (i.e., x1=number of pages of training data , y1=cost of book of training data) .
Now let us rearrange the table based on the distance (arranging the distance either in ascending or descending order). We have arranged the above table as below by arranging the euclidean distance in ascending order and rearranged the table as per the order of euclidean distance as shown below.
Now we are showing the class labels for different values of k.
Given the value of k is 3 . For k=3 the class labels are Intelligent, speaker and intelligent respectively.Based on the training dataset now we can assign the test data set with the class label black and white as it is predominant class label for k=3. Therefore the class label for the test data point is shown in the below table along with the training dataset. Step by step explanation of the KNN algorithm code from the scratchLet us now implement the above provided example in R programming from scratch. Taking Data Set as inputIn the below code we have taken an external dataset .In the below dataset we have 10 observations and 4 varibales which includes Serial Number , Number of Pages , Cost of Book and Class.The value of class is based on the number of pages and cost of book either white or black and white. We can load / access the data by using the function read.csv(). In Rstudio we have some built in datasets . In the code explained below we have used an external data set as mentioned below . We have a lot of datasets avaliable in few websites like google , kaggle.com etc. We can download the datasets from the https://www.kaggle.com website . Dataset Link: Example Dataset R
Output: S.No Number.of.Pages Cost.of.Book Class We have divided last row of the data as test data and remaining data as the training data . We are predicting the class of the divided test data by using the KNN algorithm. The below code represents the division of dataset into train data and test data R
Output: S.No Number.of.Pages Cost.of.Book Class We can inspect and analyze the data by using functions like str() and summary() in R. R
Output: S.No Number.of.Pages Cost.of.Book Class This function is used to get the summary of the whole provided data. R
Output: 'data.frame': 10 obs. of 4 variables: str() function in R used to display the internal structure of an object.It provides the information about the rows , columns , names of the rows , names of the colums and also give few additional points. Selecting the value of K and Calculation of euclidean distanceIn KNN algorithm we predict the class of the data based on the value of K . The value of K will be decided based on the value of number of observations .Usually, the value of K is the squareroot of the number of observations . For the data we have used , have 10 observations.By the value of observations the value of K will be 3. We can also use this code to store the value of k. √((x2-x1)^2+(y2-y1)^2 ) is the formula to calculate the eucledian distance where x1,y1 are training data point and x2,y2 are the test data point . We find the euclidean distance for training data points from test data point .The below code represents the function for the calculation of the euclidean distance. R
Output: [1] 18.52026 In the above we have created a function to calculate the distance of the point . To determine the execution of the function we just called the function by providing the values x and y . Where x and y are equal length dataframes. Complete implementation of KNN algorithmR
Output: S.No Number.of.Pages Cost.of.Book The above output is predicted as black and white based on the training data present. Step by Step Explanation of the KNN algorithmInstalling PackagesTo implement the KNN algorithm in R programming , we need to install some packages includes class , ggplot2 , caret and GGally. Process to install packages in the Rstudio. We can install packages in R studio in two ways:
Importing PackagesIn order to work with KNN algorithm we need to import the installed packages into our script . We load or import the packages into the Rscript by using the function library().Below lines represents the importing/loading of packages into a R script where class , caret, ggplot2 and GGally are the packages for different purpose.The purpose of each package is discussed below.
R
Accessing/Importing DatasetAfter importing the required packages we need to load the data into the Rscript. We can the load or get the data into Rscript into two ways.Now let us discuss each of them.
R
Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Species We accessing the data into our R script.
NormalizationIn KNN algorithm we use normalization to make all variables of data to same level. We can make the data to same level by using normalization or standardization. We can use normalization when there is a lot of difference in variable values,it is not necessary to use all the time. R
Output: Sepal.Length Sepal.Width Petal.Length Petal.Width We observed that the normalization function has created a output with same level of value for all variables Creating test and training dataWe know that the KNN algorithm is a supervised learning algorithm in which it has both training and test data. Supervised learning algorithms learn from the previously available data. Now we are dividing our available data into training data and testing data . We are creating 70% of our data as training data and remaining data as test data.Here we have created two train and two test datasets. In the first set of train and test data set we have created with out the class column(i.e., Species clumn) . In the second setwe have reated data set with the class column (i.e, including the Species column). R
Model CreatingWe are creating the KNN model in R with the help of the function knn().The below code represents the creation of model using the function knn() . IN knn() function we have given the values of training data set , test data set , training dataset which as the class variable(in this data set the class variable is species in fifth column),the value of K. R
Output: setosa versicolor virginica Performance of modelWe evaluate the performance of the model by calculating the accuracy of the model.Accuracy tells that how accurately /correctly we are predicting the species based on the sepal length , sepal width , petal length and petal width.The below gives an idea how to calculate the accuracy of the model. R
Output: [1] 95.55556 We can also know the performance parameters of the model by creating the confusion matrix for the model. In R programming we can create the confusion matrix by using the function confusionMatrix().This function can be used only when the caret is downloaded in the Rstudio. R
Output: test_iris_ran VisualizationR
Output: ![]() kNN algorithm in R from scratch Applications of KNN Algorithm
Advanatges of KNN Algorithm
Disadvantages of KNN Algorithm
ConclusionIn this article we have learned about the KNN algorithm and the steps to implement the KNN algorithm. We have also learned about the implementation of the KNN algorithm in R programming language . We also learned about the applications , advantages and disadvantages of the KNN algorithm in detail. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 10 |