Missing data is a common issue in statistical analysis and can lead to biased results if not handled properly. Full Information Maximum Likelihood (FIML) is a robust method for dealing with missing data, particularly when the data is missing at random (MAR). FIML uses all available data to estimate parameters, providing unbiased and efficient estimates without the need for imputation. This article explains how to implement FIML for handling missing data in the R Programming Language.
What is Full Information Maximum Likelihood (FIML)?FIML is an estimation method that uses all available data points in a dataset to estimate model parameters, even when some data points are missing. It does this by maximizing the likelihood function over the observed data, thus leveraging all available information.
- Install and Load Necessary Packages: Ensure you have the required packages installed and loaded.
- Load the Data: Import your dataset into R.
- Induce Missing Data: For demonstration purposes, create missing data in the dataset.
- Fit a Model Using FIML: Use structural equation modeling (SEM) with the lavaan package to fit a model using FIML.
Step 1: Install and Load Necessary PackagesInstall the lavaan package, which supports FIML for handling missing data.
R
# Install and load the lavaan package
install.packages("lavaan")
library(lavaan)
Step 2: Load the DataFor demonstration purposes, we’ll use the built-in HolzingerSwineford1939 dataset from the lavaan package.
R
# Load the lavaan package
library(lavaan)
# Load the example dataset
data("HolzingerSwineford1939")
head(HolzingerSwineford1939)
Output:
id sex ageyr agemo school grade x1 x2 x3 x4 x5 x6 1 1 1 13 1 Pasteur 7 3.333333 7.75 0.375 2.333333 5.75 1.2857143 2 2 2 13 7 Pasteur 7 5.333333 5.25 2.125 1.666667 3.00 1.2857143 3 3 2 13 1 Pasteur 7 4.500000 5.25 1.875 1.000000 1.75 0.4285714 4 4 1 13 2 Pasteur 7 5.333333 7.75 3.000 2.666667 4.50 2.4285714 5 5 2 12 2 Pasteur 7 4.833333 4.75 0.875 2.666667 4.00 2.5714286 6 6 2 14 1 Pasteur 7 5.333333 5.00 2.250 1.000000 3.00 0.8571429 x7 x8 x9 1 3.391304 5.75 6.361111 2 3.782609 6.25 7.916667 3 3.260870 3.90 4.416667 4 3.000000 5.30 4.861111 5 3.695652 6.30 5.916667 6 4.347826 6.65 7.500000 Step 3: Induce Missing DataTo demonstrate FIML, we’ll artificially introduce some missing values into the dataset.
R
# Introduce missing data for demonstration
set.seed(123)
HolzingerSwineford1939$visual[1:10] <- NA
HolzingerSwineford1939$cubes[11:20] <- NA
Step 4: Fit a Model Using FIMLSpecify and fit a confirmatory factor analysis (CFA) model using the lavaan package. By default, lavaan uses FIML to handle missing data.
R
# Specify a CFA model
model <- '
visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
'
# Fit the model using FIML (default method in lavaan for missing data)
fit <- cfa(model, data = HolzingerSwineford1939, missing = "fiml")
# Summarize the model fit
summary(fit, fit.measures = TRUE, standardized = TRUE)
Output:
lavaan 0.6.17 ended normally after 35 iterations
Estimator ML Optimization method NLMINB Number of model parameters 30
Number of observations 301 Number of missing patterns 1
Model Test User Model: Test statistic 85.306 Degrees of freedom 24 P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 918.852 Degrees of freedom 36 P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.931 Tucker-Lewis Index (TLI) 0.896 Robust Comparative Fit Index (CFI) 0.931 Robust Tucker-Lewis Index (TLI) 0.896
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -3737.745 Loglikelihood unrestricted model (H1) -3695.092 Akaike (AIC) 7535.490 Bayesian (BIC) 7646.703 Sample-size adjusted Bayesian (SABIC) 7551.560
Root Mean Square Error of Approximation:
RMSEA 0.092 90 Percent confidence interval - lower 0.071 90 Percent confidence interval - upper 0.114 P-value H_0: RMSEA <= 0.050 0.001 P-value H_0: RMSEA >= 0.080 0.840 Robust RMSEA 0.092 90 Percent confidence interval - lower 0.071 90 Percent confidence interval - upper 0.114 P-value H_0: Robust RMSEA <= 0.050 0.001 P-value H_0: Robust RMSEA >= 0.080 0.840
Standardized Root Mean Square Residual:
SRMR 0.060
Parameter Estimates:
Standard errors Standard Information Observed Observed information based on Hessian
Latent Variables: Estimate Std.Err z-value P(>|z|) Std.lv Std.all visual =~ x1 1.000 0.900 0.772 x2 0.554 0.109 5.066 0.000 0.498 0.424 x3 0.729 0.117 6.220 0.000 0.656 0.581 textual =~ x4 1.000 0.990 0.852 x5 1.113 0.065 17.128 0.000 1.102 0.855 x6 0.926 0.056 16.481 0.000 0.917 0.838 speed =~ x7 1.000 0.619 0.570 x8 1.180 0.150 7.851 0.000 0.731 0.723 x9 1.082 0.195 5.543 0.000 0.670 0.665
Covariances: Estimate Std.Err z-value P(>|z|) Std.lv Std.all visual ~~ textual 0.408 0.080 5.124 0.000 0.459 0.459 speed 0.262 0.055 4.735 0.000 0.471 0.471 textual ~~ speed 0.173 0.049 3.518 0.000 0.283 0.283
Intercepts: Estimate Std.Err z-value P(>|z|) Std.lv Std.all .x1 4.936 0.067 73.473 0.000 4.936 4.235 .x2 6.088 0.068 89.855 0.000 6.088 5.179 .x3 2.250 0.065 34.579 0.000 2.250 1.993 .x4 3.061 0.067 45.694 0.000 3.061 2.634 .x5 4.341 0.074 58.452 0.000 4.341 3.369 .x6 2.186 0.063 34.667 0.000 2.186 1.998 .x7 4.186 0.063 66.766 0.000 4.186 3.848 .x8 5.527 0.058 94.854 0.000 5.527 5.467 .x9 5.374 0.058 92.546 0.000 5.374 5.334
Variances: Estimate Std.Err z-value P(>|z|) Std.lv Std.all .x1 0.549 0.119 4.612 0.000 0.549 0.404 .x2 1.134 0.104 10.875 0.000 1.134 0.821 .x3 0.844 0.095 8.881 0.000 0.844 0.662 .x4 0.371 0.048 7.739 0.000 0.371 0.275 .x5 0.446 0.058 7.703 0.000 0.446 0.269 .x6 0.356 0.043 8.200 0.000 0.356 0.298 .x7 0.799 0.088 9.130 0.000 0.799 0.676 .x8 0.488 0.092 5.321 0.000 0.488 0.477 .x9 0.566 0.091 6.250 0.000 0.566 0.558 visual 0.809 0.150 5.404 0.000 1.000 1.000 textual 0.979 0.112 8.729 0.000 1.000 1.000 speed 0.384 0.092 4.168 0.000 1.000 1.000 - Model Fit Indices: The summary provides various fit indices (e.g., CFI, TLI, RMSEA) to assess how well the model fits the data.
- Standardized Estimates: The standardized coefficients for each path in the model.
- Parameter Estimates: The estimated values for each parameter in the model along with their standard errors and p-values.
- Significant Parameters: Parameters with p-values less than a chosen significance level (e.g., 0.05) are considered statistically significant.
- Model Fit: Good model fit is indicated by CFI and TLI values close to 1, and RMSEA values less than 0.05.
The output from the lavaan package in R shows that a confirmatory factor analysis (CFA) model with three latent variables (visual , textual , speed ) was successfully estimated using Maximum Likelihood (ML) and converged after 35 iterations. The model fit indices indicate a moderately good fit (CFI=0.931, TLI=0.896, RMSEA=0.092). The factor loadings for all observed variables on their respective latent variables are significant. Covariances between latent variables are also significant, indicating relationships among them. Variances and intercepts of the observed variables reflect their levels and variability. Overall, the model provides a reasonable representation of the data structure, though some fit indices suggest room for improvement.
ConclusionFull Information Maximum Likelihood (FIML) is a powerful method for handling missing data in statistical models. The lavaan package in R provides a convenient way to implement FIML in structural equation modeling. By following the steps outlined in this guide, you can effectively use FIML to handle missing data and obtain reliable parameter estimates in your analysis.
|