![]() |
In this article, we will discuss What is noisy data and perform Classification on a large and noisy dataset with R Programming Language. What is noisy data?Noise in data refers to random or irrelevant information that interferes with the analysis or interpretation of the data. It can include errors, inconsistencies, outliers, or irrelevant features that make it harder to extract meaningful insights or build accurate models. Noise can come in different forms
Methods of identify noise in a datasetIdentifying noise in a dataset involves various techniques depending on the nature of the data and the specific types of noise present.
When dealing with a large and noisy dataset for classification in R, there are some techniques that can handle both the scale of the data and the noise effectively.
Here we use a real dataset. We use the “Weather History” Data Set”. Dataset Link : weatherHistory
Output: [1] 96453 12
Formatted.Date Summary Precip.Type Temperature..C.
1 2006-04-01 00:00:00.000 +0200 Partly Cloudy rain 9.472222
2 2006-04-01 01:00:00.000 +0200 Partly Cloudy rain 9.355556
3 2006-04-01 02:00:00.000 +0200 Mostly Cloudy rain 9.377778
4 2006-04-01 03:00:00.000 +0200 Partly Cloudy rain 8.288889
5 2006-04-01 04:00:00.000 +0200 Mostly Cloudy rain 8.755556
6 2006-04-01 05:00:00.000 +0200 Partly Cloudy rain 9.222222
Apparent.Temperature..C. Humidity Wind.Speed..km.h. Wind.Bearing..degrees.
1 7.388889 0.89 14.1197 251
2 7.227778 0.86 14.2646 259
3 9.377778 0.89 3.9284 204
4 5.944444 0.83 14.1036 269
5 6.977778 0.83 11.0446 259
6 7.111111 0.85 13.9587 258
Visibility..km. Loud.Cover Pressure..millibars. Daily.Summary
1 15.8263 0 1015.13 Partly cloudy throughout the day.
2 15.8263 0 1015.63 Partly cloudy throughout the day.
3 14.9569 0 1015.94 Partly cloudy throughout the day.
4 15.8263 0 1016.41 Partly cloudy throughout the day.
5 15.8263 0 1016.51 Partly cloudy throughout the day.
6 14.9569 0 1016.66 Partly cloudy throughout the day.
'data.frame': 96453 obs. of 12 variables:
$ Formatted.Date : Factor w/ 96429 levels "2006-01-01 00:00:00.000 +0100",..: 2160 2161 2162 2163 2164
$ Summary : Factor w/ 27 levels "Breezy","Breezy and Dry",..: 20 20 18 20 18 20 20 20 20 20 ...
$ Precip.Type : Factor w/ 3 levels "null","rain",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Temperature..C. : num 9.47 9.36 9.38 8.29 8.76 ...
$ Apparent.Temperature..C.: num 7.39 7.23 9.38 5.94 6.98 ...
$ Humidity : num 0.89 0.86 0.89 0.83 0.83 0.85 0.95 0.89 0.82 0.72 ...
$ Wind.Speed..km.h. : num 14.12 14.26 3.93 14.1 11.04 ...
$ Wind.Bearing..degrees. : num 251 259 204 269 259 258 259 260 259 279 ...
$ Visibility..km. : num 15.8 15.8 15 15.8 15.8 ...
$ Loud.Cover : num 0 0 0 0 0 0 0 0 0 0 ...
$ Pressure..millibars. : num 1015 1016 1016 1016 1017 ...
$ Daily.Summary : Factor w/ 214 levels "Breezy and foggy starting in the evening First load the necessary libraries: randomForest for modeling and ggplot2 for visualization. Then read the weather dataset from a CSV file. Explore the structure of the dataset using str(). Temperature Distribution visualization
Output: ![]() Classification on a large and noisy dataset with R Data preprocessing on a large and noisy dataset
Output: [1] 0 Convert the ‘Summary’ column to a factor (categorical variable). Remove unnecessary columns (‘Formatted Date’, ‘Daily Summary’). Handle missing values by removing rows with any missing data. Split the dataset into training and testing sets
Output: Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 77162 factor numeric
err.rate 14000 -none- numeric
confusion 756 -none- numeric
votes 2083374 matrix numeric
oob.times 77162 -none- numeric
classes 27 -none- character
importance 8 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 77162 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call Split the dataset into training and testing sets (80% training, 20% testing).Train a Random Forest model using the training data (randomForest() function). Specify the target variable (‘Summary’) and all other columns as predictors. Predict on the test set
Output: Accuracy: 0.5496864 Make predictions on the test set using the trained model (predict() function).
ConclusionIn short, classifying large and noisy datasets in R requires preprocessing to handle missing values and noise, selecting robust algorithms like Random Forest or SVMs, evaluating performance using metrics and visualizations, and ensuring model stability. These steps are crucial for accurate classification despite the challenges posed by noisy data. |
Reffered: https://www.geeksforgeeks.org
R Language |
Related |
---|
![]() |
![]() |
![]() |
![]() |
![]() |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 15 |