In R, when dealing with under-dispersed count data, the appropriate modeling approach often involves using models that can accommodate count data distributions where the variance is less than the mean. This situation is opposite to overdispersed data, where the variance exceeds the mean.
Understanding Underdispersed Count DataUnderdispersion of count data is represented by situations where the variance is less than the mean; in other words, there is less variation of values from the mean than that predicted for a Poisson distribution. This can appear in different branches like ecology, epidemiology, and manufacturing when the rate of events is bounded or managed. Here are some suitable modeling techniques for under-dispersed count data in R.
- Poisson Regression with a Log Link
- Negative Binomial Regression
- Zero-Inflated Models
Let’s create a hypothetical dataset for modeling under-dispersed count data in the R Programming Language. We’ll generate a dataset with counts and two predictor variables. Here’s how you can create the dataset and complete code for analysis:
Step 1: Generating the DatasetWe’ll create a dataset with counts (counts ), and two predictor variables (predictor1 and predictor2 ). This dataset will serve as an example for our modeling.
R
# Create a dataset with counts and predictors
set.seed(123) # Set seed for reproducibility
# Number of observations
n <- 100
# Generate predictors
predictor1 <- rnorm(n, mean = 10, sd = 2)
predictor2 <- rbinom(n, size = 1, prob = 0.5)
# Generate counts (underdispersed)
lambda <- 5 - 0.2 * predictor1 + 0.8 * predictor2
counts <- rpois(n, lambda)
# Create the dataset
count_data <- data.frame(counts = counts, predictor1 = predictor1,
predictor2 = predictor2)
# Show the structure and summary of the dataset
summary(count_data)
Output:
counts predictor1 predictor2 Min. : 0.00 Min. : 5.382 Min. :0.00 1st Qu.: 2.00 1st Qu.: 9.012 1st Qu.:0.00 Median : 3.00 Median :10.124 Median :0.00 Mean : 3.38 Mean :10.181 Mean :0.48 3rd Qu.: 4.00 3rd Qu.:11.384 3rd Qu.:1.00 Max. :10.00 Max. :14.375 Max. :1.00 Step 2: Exploratory Data AnalysisBefore modeling, it’s essential to understand the distribution and characteristics of the dataset:
R
# Plot histogram of counts
hist(count_data$counts, breaks = 10, main = "Histogram of Counts", xlab = "Counts")
Output:
 Appropriate Model for Underdispersed Count Data in R Step 3: Modeling Underdispersed Count DataNow, let’s fit different models to the dataset:
1. Poisson Regression with Log LinkPoisson regression assumes that the mean and variance of the count data are equal, which makes it suitable for overdispersed data by default. However, in cases of underdispersion, you can use a log link function to model the count data. This approach helps in adjusting the mean-variance relationship appropriately.
R
# Fit Poisson regression with log link
model_poisson <- glm(counts ~ predictor1 + predictor2, family = poisson(link = "log"),
data = count_data)
# Summary of the model
summary(model_poisson)
Output:
Call: glm(formula = counts ~ predictor1 + predictor2, family = poisson(link = "log"), data = count_data)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.12913 0.31071 3.634 0.000279 *** predictor1 -0.00762 0.02991 -0.255 0.798944 predictor2 0.32002 0.10989 2.912 0.003587 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 122.51 on 99 degrees of freedom Residual deviance: 113.95 on 97 degrees of freedom AIC: 409.93
Number of Fisher Scoring iterations: 5 2. Negative Binomial RegressionNegative binomial regression is commonly used when count data exhibit overdispersion, but it can also be adapted for underdispersed data by adjusting the dispersion parameter. In R, the glm.nb function from the MASS package is often used for negative binomial regression.
R
# Fit negative binomial regression
library(MASS)
model_nb <- glm.nb(counts ~ predictor1 + predictor2, data = count_data)
# Summary of the model
summary(model_nb)
Output:
Call: glm.nb(formula = counts ~ predictor1 + predictor2, data = count_data, init.theta = 29.80974951, link = log)
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.126458 0.328195 3.432 0.000599 *** predictor1 -0.007348 0.031609 -0.232 0.816190 predictor2 0.319844 0.115880 2.760 0.005778 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(29.8097) family taken to be 1)
Null deviance: 110.36 on 99 degrees of freedom Residual deviance: 102.68 on 97 degrees of freedom AIC: 411.28
Number of Fisher Scoring iterations: 1
Theta: 29.8 Std. Err.: 40.1
2 x log-likelihood: -403.282 3. Zero-Inflated Poisson (ZIP) ModelZero-inflated models (zero-inflated Poisson or zero-inflated negative binomial) are appropriate when count data exhibit excess zeros and may also show underdispersion. These models account for the excess zeros through a mixture of two processes: one that generates zeros and another that generates counts.
R
# Fit zero-inflated Poisson model
library(pscl)
model_zip <- zeroinfl(counts ~ predictor1 + predictor2 | predictor1 + predictor2,
dist = "poisson", data = count_data)
# Summary of the model
summary(model_zip)
Output:
Call: zeroinfl(formula = counts ~ predictor1 + predictor2 | predictor1 + predictor2, data = count_data, dist = "poisson")
Pearson residuals: Min 1Q Median 3Q Max -1.9738 -0.5431 -0.2075 0.6634 3.6695
Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 1.136891 0.319096 3.563 0.000367 *** predictor1 -0.008332 0.030385 -0.274 0.783926 predictor2 0.323535 0.114325 2.830 0.004655 **
Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -4.174 49.047 -0.085 0.932 predictor1 -0.340 2.131 -0.160 0.873 predictor2 1.960 34.581 0.057 0.955 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 98 Log-likelihood: -202 on 6 Df Step 4: Model Comparison and SelectionCompare the models using AIC or BIC to select the best-fitting model:
R
# Compare models using AIC
AIC(model_poisson, model_nb, model_zip)
# Compare models using BIC
BIC(model_poisson, model_nb, model_zip)
Output:
df AIC model_poisson 3 409.9344 model_nb 4 411.2815 model_zip 6 415.9002
df BIC model_poisson 3 417.7499 model_nb 4 421.7022 model_zip 6 431.5313 - Among the models evaluated, the Poisson regression with a log link (
model_poisson ) has the lowest AIC value, suggesting it provides the best fit to the data while considering model complexity. - Consistent with AIC, the Poisson regression with a log link (
model_poisson ) also shows the lowest BIC value, reinforcing its suitability as the preferred model for this dataset considering both fit and complexity.
ConclusionThis dataset and the accompanying R code provide a practical example of how to handle and model underdispersed count data in R. By following these steps, you can analyze your own datasets and choose the appropriate statistical model based on the characteristics of your data and research objectives. Adjust the code as needed for your specific analysis and explore further extensions or refinements based on additional variables or modeling techniques.
|