![]() |
An outlier is a data point that differs significantly from other data points. This significant difference can arise due to many circumstances, be it an experimental error or mistake or a difference in measurements. In this article, we will review one of the types of outliers: global outliers. In data analysis, it is essential to comprehend and recognize global outliers. Understanding the overall distribution of the data and spotting any outliers both depend heavily on visual inspection. Additionally, the visualization sheds light on the potential effects of outliers on the relationship between characteristics and the target variable. What is an outlier?Data points in a dataset that substantially differ from the rest of the data are called outliers. Any kind of data, including time series, categories, and numerical data, may contain them. It is crucial to comprehend and manage outliers properly since they can significantly affect statistical analysis and machine learning models. Reason For OutliersOutliers in a dataset can occur for several reasons.
Global OutlierGlobal Outlier (also referred to as Point Anomaly), is when a single data point or observation is very different than the usual pattern. For example, consider a scenario where 98 out of 100 scores lie between 200 and 350, but the remaining 2 points have values of 600 and 720. In this case, the data point with a value of 720 stands out as a potential global outlier. Such data points usually stand out to other data points. Those outliers are at the extremes of the mappings, irregulars in the observations. Global outliers will do the same thing as any other outlier, i.e., it will be responsible for skewing the data distribution and affect the model performance of the machine learning model is it getting used by. But handling them with absolute care is necessary because sometimes those outliers can be crucial in identifying any trend. Global Outliers Detection MethodsThere are different different methods to detect and remove the outliers. Some of them are as follows: 1. Distance based Outlier Detections:Distance-based outlier detection methods identify outliers in a dataset based on the distances between data points. These methods rely on the assumption that outliers are far away from the majority of the data points. Here are some common distance-based outlier detection techniques:
2. Z-ScoresGlobal outliers can be found mathematically using a variety of statistical techniques. Using z-scores, which calculate how much a data point deviates from the dataset mean in terms of standard deviations, is one method. The following formula is used to determine a data point x’s z-score: where 3. Interquartile Range (IQR)The interquartile range (IQR), which calculates the spread of the middle 50% of the data, is an alternative method.
Managing Global outlierWhen global outliers are found, there are various methods for managing them. Removing the outliers from the dataset is one strategy, but caution must be used to prevent eliminating legitimate data points. Altering the data using methods like winsorizing, which swaps out extreme values with less extreme ones, or log transformations, which can lessen the effect of outliers on statistical analysis and machine learning models, is an additional strategy. Importance of Detecting OutlierMachine learning models and statistical analysis are susceptible to major disruptions from outliers. In statistical analysis, for instance, anomalies have the potential to distort the mean and standard deviation, resulting in imprecise estimations of central tendency and variability. Outliers in machine learning models can skew the findings by exerting an excessive amount of influence on the model’s predictions. Outlier Detection is an important process in identifying the patterns and the “story” a dataset holds. Some of the important significance of Outlier Detection is as follows:
Implementation of Global OutlierLet’s illustrate this with an example We will do a demonstration on the California Housing dataset. We will first load the necessary libraries needed. Python
The provided code snippet imports necessary libraries such as numpy, pandas, and matplotlib for data manipulation and visualization. It also imports the Loading DatasetAfter we have loaded the libraries, we need to load the California housing dataset and do a describe on the dataset. Describe is a great method to find any anomalies in the dataset as it gives the measure of central tendency about the data observations. Python
Output: MedInc HouseAge AveRooms AveBedrms Population \
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000
AveOccup Latitude Longitude
count 20640.000000 20640.000000 20640.000000
mean 3.070655 35.631861 -119.569704
std 10.386050 2.135952 2.003532
min 0.692308 32.540000 -124.350000
25% 2.429741 33.930000 -121.800000
50% 2.818116 34.260000 -118.490000
75% 3.282261 37.710000 -118.010000
max 1243.333333 41.950000 -114.310000
In order to help with the identification of possible outliers and the evaluation of the general qualities of the data, this summary offers a preliminary grasp of the distribution and features of the dataset. Here, we can see that the describe function provides some really interesting facts about the dataset, which are summarised below:
Here, it is safe to assume that these columns in our dataframe have Global Outliers. Using Plots and Visualizations to detect global outliersThe describe method is not much evidence, and is not sufficient as a data professional. Hence, we will now plot the graphs for each column in our dataset. For that, we will use all the columns except “Latitude” and “Longitude” since they will not be of relevant use here. Python
Output:
Here, in the plot above, we can clearly see that some datapoints are far from the other datapoints. This holds true for our analysis from describe() where we can that AveBedrms, AveOccup and AveRooms have significantly different values between their measures of central tendency and Quartile measures. This does not mean that they are observational mistakes. This could mean that there was a house listed which was a mansion with more than 50 rooms, or which had more than 20 bed rooms or had more than 100 occupants (because it was a hostel or servants had the same place of living). These things would need further survey and investigation. But until then, we can be sure to brand these outliers as Global Outliers. Outlier Detection Using BoxplotPython3
Output: ![]() Box Plot Here, we visualize the outliers using box plots. Outlier detection using Z-ScorePython3
Python3
Output:
Outlier detection using IQR (Interquartile Range)Python3
Output:
Results for 'MedInc' column:
IQR: 2.1802999999999995
Upper Bound: 8.013849999999998
Lower Bound: -0.7073499999999995
Results for 'HouseAge' column:
IQR: 19.0
Upper Bound: 65.5
Lower Bound: -10.5
Results for 'AveRooms' column:
IQR: 1.6116969213354757
Upper Bound: 8.469926334384166
Lower Bound: 2.023138649042263
Results for 'AveBedrms' column:
IQR: 0.0934531669715235
Upper Bound: 1.2397058168079962
Lower Bound: 0.8658931489219022
Results for 'Population' column:
IQR: 938.0
Upper Bound: 3132.0
Lower Bound: -620.0
Results for 'AveOccup' column:
IQR: 0.8525687229665166
Upper Bound: 4.561116868480889
Lower Bound: 1.1508419766148226
ConclusionGlobal outliers in data analysis can have a big effect on how a dataset is interpreted and modeled. For statistical studies and machine learning models to be accurate and reliable, global outliers must be recognized and managed. In this case, the code snippet that is provided provides a useful visual aid for illustrating how various attributes in the California housing dataset relate to the target variable, which is likely home prices. The algorithm enables a clear understanding of the relationship between specific features and house prices by generating scatter plots for each feature against the target variable. Finding possible outliers and comprehending the data’s overall distribution depend on this visual examination. It also sheds light on how outliers might affect how characteristics and the target variable relate to one another. All things considered, the code snippet-enabled visualization is a useful tool for learning about the dataset and making defensible choices about how to handle global outliers. |
Reffered: https://www.geeksforgeeks.org
AI ML DS |
Type: | Geek |
Category: | Coding |
Sub Category: | Tutorial |
Uploaded by: | Admin |
Views: | 14 |