Detecting outliers when fitting data with nonlinear regression - Coding

Nonlinear regression is a powerful tool used to model complex relationships between variables. However, the presence of outliers can significantly distort the results, leading to inaccurate parameter estimates and unreliable predictions. Detecting and managing outliers is therefore crucial for robust nonlinear regression analysis. This article delves into the methods and techniques for identifying outliers in nonlinear regression, ensuring you achieve reliable and accurate results.

Table of Content

Understanding Nonlinear Regression

What is Nonlinear Regression?
Importance of Outlier Detection

Methods for Detecting Outliers in Nonlinear Regression

1. Visual Inspection
2. Statistical Methods
3. Robust Regression Methods

Detecting Outliers With Nonlinear Regression: Practical Example

Initial Analysis
Applying Robust Methods
Comparing Results
Apply Model Regression Models

Understanding Nonlinear Regression

What is Nonlinear Regression?

Nonlinear regression is a form of regression analysis in which observational data is modeled by a function that is a nonlinear combination of the model parameters and depends on one or more independent variables. Unlike linear regression, which assumes a straight-line relationship between variables, nonlinear regression can model more complex relationships.

Importance of Outlier Detection

Outliers are data points that deviate significantly from the overall pattern of data. In the context of nonlinear regression, outliers can have a disproportionate influence on the model, leading to biased parameter estimates and poor predictive performance. Detecting and appropriately handling outliers is essential to maintain the integrity of the regression analysis.

Methods for Detecting Outliers in Nonlinear Regression

1. Visual Inspection

Scatter Plots: Scatter plots are a simple yet effective way to visually inspect data for potential outliers. By plotting the dependent variable against the independent variable(s), you can identify points that fall far from the expected relationship.
Residual Plots: Residual plots display the residuals (differences between observed and predicted values) against the independent variable or fitted values. Outliers often manifest as points with large residuals that deviate significantly from the rest of the data.

2. Statistical Methods

Studentized Residuals: Studentized residuals are residuals divided by an estimate of their standard deviation. They follow a t-distribution, making it easier to identify outliers. Points with studentized residuals greater than a certain threshold (e.g., 2 or 3) are considered outliers.
Cook’s Distance: Cook’s Distance measures the influence of each data point on the fitted values. Points with a Cook’s Distance greater than a certain threshold (commonly 4/n, where n is the number of data points) are considered influential and potential outliers.
Hadi’s Potential: Hadi’s Potential is a measure that combines leverage and residuals to identify influential points. It is particularly useful in nonlinear regression where leverage alone might not be sufficient to detect outliers.

3. Robust Regression Methods

Least Absolute Deviations (LAD): Least Absolute Deviations minimizes the sum of absolute residuals rather than the sum of squared residuals. This method is less sensitive to outliers and provides a robust alternative to ordinary least squares (OLS) regression.
M-Estimation: M-Estimation generalizes maximum likelihood estimation by using a loss function that reduces the influence of outliers. Common M-estimators include Huber’s T and Tukey’s Biweight.
Least Trimmed Squares (LTS): Least Trimmed Squares regression minimizes the sum of the smallest squared residuals, effectively ignoring the largest residuals which are likely due to outliers. This method provides robust parameter estimates in the presence of outliers.

Detecting Outliers With Nonlinear Regression: Practical Example

Data Description: Consider a dataset involving the fraction of breast cancer patients with metastases as the response variable and tumor size as the predictor variable. This dataset is used to illustrate the impact of outliers on nonlinear regression analysis.

Initial Analysis

Fit a Nonlinear Model: Fit a nonlinear model to the data using ordinary least squares regression.
Visual Inspection: Use scatter plots and residual plots to identify potential outliers.

Applying Robust Methods

Robust Regression: Apply robust regression methods such as LAD, M-Estimation, and LTS to fit the model.
Outlier Detection: Use statistical methods like Studentized Residuals, Cook’s Distance, and Hadi’s Potential to identify outliers.

Comparing Results

Parameter Estimates: Compare parameter estimates from ordinary least squares regression and robust regression methods.
Model Fit: Evaluate the goodness-of-fit measures (e.g., R-squared, mean squared error) for models with and without outliers.

Python

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.robust.robust_linear_model import RLM
from statsmodels.robust.norms import HuberT, LeastSquares
from sklearn.metrics import mean_squared_error

# Step 2: Create or load the dataset
np.random.seed(0)
n = 100
x = np.linspace(0, 10, n)
y = 2.5 * np.sin(1.5 * x) + np.random.normal(0, 0.5, n)
# Adding some outliers
x_outliers = np.append(x, [1, 2, 3])
y_outliers = np.append(y, [10, -10, 12])
data = pd.DataFrame({'tumor_size': x_outliers, 'metastasis_fraction': y_outliers})

# Step 3: Fit a nonlinear model using ordinary least squares regression
# Adding a nonlinear term for the regression model
data['tumor_size_squared'] = data['tumor_size'] ** 2
ols_model = smf.ols('metastasis_fraction ~ tumor_size + tumor_size_squared', data=data).fit()

# Step 4: Visual inspection using scatter plots and residual plots
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.scatterplot(x='tumor_size', y='metastasis_fraction', data=data)
plt.plot(data['tumor_size'], ols_model.fittedvalues, color='red')
plt.title('Scatter Plot with OLS Fit')

plt.subplot(1, 2, 2)
sns.residplot(x=ols_model.fittedvalues, y=ols_model.resid)
plt.title('Residual Plot')
plt.show()

Output:

Scatter Plot with OLS fit

Apply Model Regression Models

Python

# Step 5: Apply robust regression methods
# Robust regression using Least Absolute Deviations (LAD)
lad_model = smf.quantreg('metastasis_fraction ~ tumor_size + tumor_size_squared', data=data).fit(q=0.5)

# Robust regression using M-Estimation with HuberT norm
rlm_huber = RLM(data['metastasis_fraction'], sm.add_constant(data[['tumor_size', 'tumor_size_squared']]), M=HuberT()).fit()

# Step 6: Identify outliers using statistical methods
# Studentized Residuals
data['studentized_residuals'] = ols_model.get_influence().resid_studentized_internal

# Cook's Distance
data['cooks_distance'] = ols_model.get_influence().cooks_distance[0]

# Hadi's Potential (not directly available in statsmodels, so we use an approximation)
data['leverage'] = ols_model.get_influence().hat_matrix_diag

# Mark potential outliers
outlier_indices = data[(np.abs(data['studentized_residuals']) > 2) | (data['cooks_distance'] > 4/(n-2)) | (data['leverage'] > 0.2)].index
outliers = data.loc[outlier_indices]

# Step 7: Compare results from different methods
# OLS model
print("OLS Model Summary:")
print(ols_model.summary())

# LAD model
print("\nLAD Model Summary:")
print(lad_model.summary())

# RLM model with HuberT norm
print("\nRLM Model (HuberT) Summary:")
print(rlm_huber.summary())

# Goodness-of-fit measures
print("\nGoodness-of-Fit Measures:")
print(f"OLS Mean Squared Error: {mean_squared_error(data['metastasis_fraction'], ols_model.fittedvalues)}")
print(f"LAD Mean Squared Error: {mean_squared_error(data['metastasis_fraction'], lad_model.fittedvalues)}")
print(f"RLM (HuberT) Mean Squared Error: {mean_squared_error(data['metastasis_fraction'], rlm_huber.fittedvalues)}")

# Plotting the outliers
plt.figure(figsize=(12, 6))
sns.scatterplot(x='tumor_size', y='metastasis_fraction', data=data)
sns.scatterplot(x='tumor_size', y='metastasis_fraction', data=outliers, color='red')
plt.title('Scatter Plot with Outliers Highlighted')
plt.show()

Output:

OLS Model Summary:
                             OLS Regression Results                            
===============================================================================
Dep. Variable:     metastasis_fraction   R-squared:                       0.125
Model:                             OLS   Adj. R-squared:                  0.107
Method:                  Least Squares   F-statistic:                     7.134
Date:                 Tue, 30 Jul 2024   Prob (F-statistic):            0.00127
Time:                         21:15:45   Log-Likelihood:                -237.57
No. Observations:                  103   AIC:                             481.1
Df Residuals:                      100   BIC:                             489.0
Df Model:                            2                                         
Covariance Type:             nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              2.6754      0.709      3.772      0.000       1.268       4.083
tumor_size            -1.2510      0.332     -3.769      0.000      -1.910      -0.593
tumor_size_squared     0.1196      0.032      3.712      0.000       0.056       0.183
==============================================================================
Omnibus:                       35.806   Durbin-Watson:                   1.658
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              299.629
Skew:                           0.732   Prob(JB):                     8.64e-66
Kurtosis:                      11.226   Cond. No.                         142.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

LAD Model Summary:
                          QuantReg Regression Results                          
===============================================================================
Dep. Variable:     metastasis_fraction   Pseudo R-squared:               0.1515
Model:                        QuantReg   Bandwidth:                       2.178
Method:                  Least Squares   Sparsity:                        5.458
Date:                 Tue, 30 Jul 2024   No. Observations:                  103
Time:                         21:15:45   Df Residuals:                      100
                                         Df Model:                            2
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              3.0050      0.785      3.828      0.000       1.447       4.563
tumor_size            -1.6699      0.367     -4.545      0.000      -2.399      -0.941
tumor_size_squared     0.1686      0.036      4.729      0.000       0.098       0.239
======================================================================================

RLM Model (HuberT) Summary:
                     Robust linear Model Regression Results                    
===============================================================================
Dep. Variable:     metastasis_fraction   No. Observations:                  103
Model:                             RLM   Df Residuals:                      100
Method:                           IRLS   Df Model:                            2
Norm:                           HuberT                                         
Scale Est.:                        mad                                         
Cov Type:                           H1                                         
Date:                 Tue, 30 Jul 2024                                         
Time:                         21:15:45                                         
No. Iterations:                     11                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  2.5250      0.540      4.674      0.000       1.466       3.584
tumor_size            -1.2436      0.253     -4.919      0.000      -1.739      -0.748
tumor_size_squared     0.1208      0.025      4.923      0.000       0.073       0.169
======================================================================================

If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .

Goodness-of-Fit Measures:
OLS Mean Squared Error: 5.900980188225848
LAD Mean Squared Error: 6.09596526991115
RLM (HuberT) Mean Squared Error: 5.909805102901932

Model Regression Models

Conclusion

Detecting and managing outliers is an essential aspect of nonlinear regression analysis. By employing a combination of visual inspection, statistical methods, and robust regression techniques, researchers can ensure accurate and reliable parameter estimates. Advanced methods like the ROUT method and Monte Carlo simulations further enhance the robustness of the analysis. Properly addressing outliers leads to more trustworthy models and better decision-making based on the data.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
Integrating Numba with Tensorflow
Smart Home Energy Saving Analysis in R
Data Science in Urban Planning
Text Mining in R with tidytext
Implementing CART (Classification And Regression Tree) in Python

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	22