Implementing Generalized Least Squares (GLS) in Python - Coding

Generalized Least Squares (GLS) is a statistical technique used to estimate the unknown parameters in a linear regression model. It extends the Ordinary Least Squares (OLS) method by addressing situations where the assumptions of OLS are violated, specifically when there is heteroscedasticity or autocorrelation in the error terms. GLS is crucial in providing more accurate and reliable estimates in such scenarios. This article provides a detailed exploration of the GLS method, its underlying principles, and its implementation in Python.

Table of Content

What is Generalized Least Squares (GLS)?
Implementing GLS in Python

Example 1: Implementing GLS with statsmodels
Example 2: Implementing Feasible Generalized Least Squares (FGLS)

Evaluating GLS Models

What is Generalized Least Squares (GLS)?

Generalized Least Squares (GLS) is an extension of the Ordinary Least Squares (OLS) regression method used to estimate the unknown parameters in a linear regression model. GLS is particularly useful when the assumptions of OLS, such as homoscedasticity (constant variance of errors) and absence of autocorrelation (independence of errors), are violated. By addressing heteroscedasticity and autocorrelation in the error terms, GLS provides more accurate and reliable parameter estimates.

Why Use GLS?

In real-world data analysis, the assumptions of OLS are often violated. When this happens, OLS estimates become inefficient and biased.
GLS corrects for these issues by transforming the data to meet the OLS assumptions, ensuring that the estimates remain the Best Linear Unbiased Estimators (BLUE).

Implementing GLS in Python

Python offers several libraries for implementing GLS, with statsmodels being one of the most popular. statsmodels provides comprehensive tools for statistical modeling, including GLS.

To install statsmodels, you can use pip:

pip install statsmodels

Example 1: Implementing GLS with `statsmodels`

Let’s walk through an example using the Longley dataset, a classic dataset in econometrics.

1. Loading the Data

Python

import numpy as np
import statsmodels.api as sm

np.random.seed(0)
n = 100
X = np.random.rand(n, 2)
X = sm.add_constant(X)  # Adds a constant term to the predictor

# Generate heteroscedastic and autocorrelated errors
errors = np.random.randn(n)
for i in range(1, n):
    errors[i] += 0.5 * errors[i-1]

beta = np.array([1, 0.5, -0.3])
y = X @ beta + errors

2. Fitting an OLS Model

First, we fit an OLS model to obtain the residuals.

Python

ols_model = sm.OLS(y, X).fit()
print(ols_model.summary())

Output:

OLS Regression Results:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.003
Model: OLS Adj. R-squared: -0.018
Method: Least Squares F-statistic: 0.1307
Date: Wed, 17 Jul 2024 Prob (F-statistic): 0.878
Time: 11:30:25 Log-Likelihood: -147.01
No. Observations: 100 AIC: 300.0
Df Residuals: 97 BIC: 307.8
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0183 0.296 3.435 0.001 0.430 1.607
x1 -0.0074 0.380 -0.019 0.985 -0.762 0.748
x2 -0.1917 0.375 -0.511 0.610 -0.936 0.553
==============================================================================
Omnibus: 3.211 Durbin-Watson: 1.053
Prob(Omnibus): 0.201 Jarque-Bera (JB): 2.550
Skew: 0.339 Prob(JB): 0.279
Kurtosis: 3.389 Cond. No. 5.58
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3. Estimating the Covariance Matrix

Assume the error terms follow an AR(1) process. We estimate the autocorrelation parameter ?ρ by regressing the residuals on their lagged values.

Python

residuals = ols_model.resid
cov_matrix = np.diag(residuals**2)

4. Fitting the GLS Model

Finally, we fit the GLS model using the constructed covariance matrix.

Python

gls_model = sm.GLS(y, X, sigma=cov_matrix).fit()
print(gls_model.summary())

Output:

GLS Regression Results:
GLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.120
Model: GLS Adj. R-squared: 0.102
Method: Least Squares F-statistic: 6.633
Date: Wed, 17 Jul 2024 Prob (F-statistic): 0.00200
Time: 11:30:25 Log-Likelihood: -81.748
No. Observations: 100 AIC: 169.5
Df Residuals: 97 BIC: 177.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0210 0.035 29.584 0.000 0.952 1.089
x1 -0.0338 0.054 -0.628 0.531 -0.141 0.073
x2 -0.1543 0.044 -3.482 0.001 -0.242 -0.066
==============================================================================
Omnibus: 866.058 Durbin-Watson: 1.353
Prob(Omnibus): 0.000 Jarque-Bera (JB): 16.264
Skew: 0.200 Prob(JB): 0.000294
Kurtosis: 1.065 Cond. No. 8.08
==============================================================================
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2: Implementing Feasible Generalized Least Squares (FGLS)

In many practical situations, the exact form of the covariance matrix [Tex]\Omega[/Tex] is unknown. Feasible Generalized Least Squares (FGLS) is an iterative method that estimates [Tex]\Omega[/Tex] from the data and refines the GLS estimates.

Steps for FGLS:

Fit an initial OLS model and obtain residuals.
Estimate the covariance matrix [Tex]\Omega[/Tex] from the residuals.
Fit a GLS model using the estimated [Tex]\Omega[/Tex].
Iterate the process to refine the estimates.

Python

import numpy as np
import statsmodels.api as sm
from scipy.linalg import toeplitz

# Step 1: Generate sample data or use a dataset
np.random.seed(0)
n = 100
X = np.random.rand(n, 2)
X = sm.add_constant(X)  # Adds a constant term to the predictor

# Generate heteroscedastic and autocorrelated errors
errors = np.random.randn(n)
for i in range(1, n):
    errors[i] += 0.5 * errors[i-1]

beta = np.array([1, 0.5, -0.3])
y = X @ beta + errors

# Fit initial OLS model
ols_model = sm.OLS(y, X).fit()
ols_resid = ols_model.resid

# Step 2: Estimate covariance matrix
resid_fit = sm.OLS(ols_resid[1:], sm.add_constant(ols_resid[:-1])).fit()
rho = resid_fit.params[1]
sigma = rho**toeplitz(np.arange(len(ols_resid)))

# Step 3: Fit GLS model using estimated covariance matrix
gls_model = sm.GLS(y, X, sigma=sigma)
gls_results = gls_model.fit()

# Step 4: Iterate to refine estimates
for _ in range(5):
    ols_resid = gls_results.resid
    resid_fit = sm.OLS(ols_resid[1:], sm.add_constant(ols_resid[:-1])).fit()
    rho = resid_fit.params[1]
    sigma = rho**toeplitz(np.arange(len(ols_resid)))
    gls_model = sm.GLS(y, X, sigma=sigma)
    gls_results = gls_model.fit()

print(gls_results.summary())

Output:

GLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.022
Model: GLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 1.073
Date: Wed, 17 Jul 2024 Prob (F-statistic): 0.346
Time: 11:35:00 Log-Likelihood: -134.18
No. Observations: 100 AIC: 274.4
Df Residuals: 97 BIC: 282.2
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.9643 0.300 3.212 0.002 0.368 1.560
x1 0.2923 0.313 0.935 0.352 -0.328 0.913
x2 -0.3527 0.340 -1.038 0.302 -1.027 0.322
==============================================================================
Omnibus: 0.332 Durbin-Watson: 1.982
Prob(Omnibus): 0.847 Jarque-Bera (JB): 0.478
Skew: 0.114 Prob(JB): 0.787
Kurtosis: 2.750 Cond. No. 3.05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Evaluating GLS Models

Evaluating the performance of GLS models involves several diagnostic checks:

R-squared and Adjusted R-squared: Measure the proportion of variance explained by the model.
F-statistic: Tests the overall significance of the model.
Residual Analysis: Check for patterns in the residuals to ensure that the assumptions of homoscedasticity and no autocorrelation are met.

Conclusion

Generalized Least Squares (GLS) is a powerful tool for regression analysis, especially when dealing with heteroscedasticity and autocorrelation in the error terms. By transforming the data to meet the OLS assumptions, GLS provides more accurate and reliable parameter estimates.

Reffered: https://www.geeksforgeeks.org

AI ML DS

Related
HyperParameter Tuning: Fixing Overfitting in Neural Networks
Structural Equation Modeling: A Comprehensive Overview
Multivariable Calculus for Machine Learning
Distributed Applications with PyTorch
How to Get an Internship as an ETL Developer

Type:	Geek
Category:	Coding
Sub Category:	Tutorial
Uploaded by:	Admin
Views:	20