Statistical Modelling of Count Data with Excessive Zeros and Overdispersion

Abstract

In count data modelling, excessive zeros and overdispersion appear frequently in real-life data, which significantly affected the goodness-of-fit of standard Poisson regression models (P). To address these two issues, regression models with different specifications of P were derived, such as negative binomial regression models (NB), zero-inflated regression models, and Conway-Maxwell Poisson regression models (CMP). NB and CMP introduced additional dispersion parameters relative to overdispersion, while zero-inflated regression models introduced additional component addressing the probability of excessive zeros. The performance of these models is illustrated in an application on a count data from a study on hemophilia patients with characteristic of overdispersion and excess zeros. The result shows that zero-inflated Poisson regression models (ZIP) and zero-inflated negative binomial regression models (ZINB) are the best model in handling overdispersion and zero-inflation, while the more complex and flexible zero-inflated Conway-Maxwell Poisson regression model (ZICMP) underperformed.

1. Introduction

1.1 Project Background

Many medical studies involve real-life data that has outcome variables in form of zero-inflated and overdispersed counts, such as number of deaths or number of patients in a particular period of time (Lee et al., 2023). These data are always positive integers and often have a small mean value and a heavily right-skewed distribution that does not follow normal distribution. Failure to select suitable models for analyzing this kind of data can lead to incorrect results and misleading findings that ultimately provide bad suggestions to the healthcare system in ways of patient management and treatment strategies.

Therefore, it is important to know which models are the best for data with count outcomes before analyzing. In this research project, we studied on a dataset collected from a study on hemophilia death factors, that has the count of death as the outcome variable, and various potential factors as the dependent variables such as whether the patient has been diagnosed with HIV, the blood clotting level of the patient, and the age of the patient. According to Berry et al. (1992) and the World Health Organization, hemophilia is a rare bleeding disorder where the blood does not clot properly due to a deficiency in specific clotting proteins that leads to prolonged bleeding after injuries, surgery, or spontaneous bleeding into joints and muscles, which can cause chronic pain and joint damage.

The harmfulness of hemophilia motivates us into finding the best model for our data that can provide concrete discoveries about the causes of death for hemophilia patients, and to also provide a reliable data modelling method for future researchers studying in related fields. To analyze the data and understand the relationships between the outcome “death” and the potential factors, one way is to discover statistical models that has good fit to the data and can breakdown complex components or patterns under the data.

1.2 Research Objectives

The primary objective of this project is to identify the best approach for modelling hemophilia count data with overdispersion and zero-inflation.

To achieve this, we divided our project into following steps:

Studied the development of count data modelling from the standard Poisson model, the motivations for developing models that extend from Poisson, and past studies around the applications of different count data models from related literatures.
Performed exploratory data analysis (EDA) on our hemophilia data.
Fitted the optimized standard Poisson model and performed overdispersion test.
Fitted the optimized zero-inflated models such as ZIP, ZINB, ZICMP.
Performed model selection by adjusting model structures corresponding to the hypothesis test results on the model parameters.
Fitted the adjusted optimized models.
Compared all fitted models using log-likelihood, AIC and BIC.
Selected best models for our data.

1.3 Significance of Study

This study revisited CMP and ZICMP, the relatively new and unexplored count data modelling method, and compared their performance to classical methods such as ZIP and ZINB in the scenario of handling overdispersion and zero-inflation in count data. The result of this study would provide future researchers with a trustworthy example in the application of CMP and ZICMP, which would then provide them with a reliable approach to model count data using CMP and ZICMP.

This would further enhance the quality of their research outcomes, since mistakes such as using incorrect models, or model misspecification when handling overdispersion or zero-inflation can be prevented by following the correct modelling procedures. Better decisions in real-life practice would then be made based on the correct conclusions of the studies. Given the commonness of count data in real life, the impact of this study is significant.

1.4 Outline of the Report

Section 1 introduces the project background with important fundamental knowledge. Section 2 provides a methodological review of project related literature. Section 3 provides the methodology of this project. Section 4 provides results with interpretation. Section 5 provides discussion on the results obtained and brief conclusion on our project.

2. Literature Review

2.1 Review of Hemophilia Count Data Modelling

Bladen et al. (2013) studied factors that potentially influence the Haemophilia Joint Health Score (HJHS) of young hemophilia patients, such as age, prophylaxis, history of high-titre inhibitors (HTI), and bleeding events. In Bladen’s study the data collected medical and physiotherapy notes of boys with severe hemophilia, aged 4–18 years, in which 53% of them has a HJHS of zero. Bladen used ZIP to account for excessive zeros in the data and successfully modelled the data, which lead to valuable findings and conclusions.

Jones et al. (2025) studied the association of insufficient or absent clotting factors VIII with hemophilia A from the data collected from the Cost of Haemophilia in Europe: a Socioeconomic Survey (CHESS). In the data there exhibits large proportion of individuals experiencing zero bleeds, Jones selected generalized Poisson regression models including P, NB, ZIP, ZINB to explore the association between factor activity levels (FALs) and annual bleeding rate (ABR), adjusting for age, BMI, HIV, HBV, and HCV.

2.2 Overview of Count Data Modelling

The application of count data modelling can be traced back as early as 1898, where Bortkiewicz conducted a study on annual number of deaths in the Prussian army from being kicked by mules. He utilized the idea of Poisson distribution that was derived by Poisson in 1837. Greenwood and Yule (1920) further generalized the Poisson distribution and derived the negative binomial distribution.

Mullahy (1986) proposed a modified count data model, termed hurdle models, that separated the standard Poisson model into two parts. Lambert (1992) proposed zero-inflated Poisson regression (ZIP). Greene (1994) further explored the possibility of specifying an alternative distribution to ZIP, such as NB, which forms a zero-inflated negative binomial model (ZINB).

2.3 Statistical Models

2.3.1 Multiple Linear Regression Models

The general form of MLR is:

\[ Y=\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p+\varepsilon \tag{1} \]

where \(Y\) is the dependent variable, \(X_1,X_2,\cdots,X_p\) are the independent variables or predictors, \(\beta_0\) is the intercept, \(\beta_1,\beta_2,\cdots,\beta_p\) are the regression coefficients or regression parameters, and \(\varepsilon\) is the error term.

2.3.2 Poisson Regression Models

The general form of Poisson regression models (P) can be written as

\[ \mu=\mathrm{exp}\left(\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p\right) \tag{2} \]

where dependent variable Y given independent variables \(X_1,X_2,\cdots,X_p\) follows Poisson distribution with mean \(\mu\). Because the logarithm of the conditional mean is linear in the parameters, this model is also called a log-linear model. The probability mass function (pmf) of the Poisson distribution can be given by

\[ P(Y=y)=\frac{\mu^y e^{-\mu}}{y!}, \quad y=0,1,2,\cdots \tag{3} \]

where \(P\left(Y=y\right)\) refers to the probability of the outcome variable Y equals to y the number of event occurrence. An important assumption of P is that \(E(Y)=Var(Y)=\mu\). This assumption is often violated by real-life data characteristics.

2.3.3 Generalized Linear Models

Generalized linear models (GLM) are first proposed by Nelder and Wedderburn in 1972, then formalized by McCullagh in 1989. McCullagh (1989) introduced GLM as a generalization of classical linear models that allows the linear model to handle many special cases such as logit and probit models for quantal responses, or log-linear models for count. By beginning with the standard Gaussian linear regression, as shown in equation 1, the form of GLM remained basically similar to the form MLR, with some specification to the random component Y. Assume the random component Y have independent Normal distributions with \(\mathrm{E}\left(Y\right)=\mu\) and constant variance \(\sigma^2\), the systematic component, where covariates \(X_1,X_2,\cdots,X_p\) produce a linear predictor \eta so that

\[ \eta=\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p \tag{4} \]

where the link between the random component and systematic component is \mu=\eta. It is therefore can be written as \(\eta_i=g\left(\mu_i\right)\) where \(g\left(\bullet\right)\) is called the link function. This allows the systematic component to come from a non-Gaussian distribution such as Poisson distribution. In modelling count that the distribution is Poisson, since counts based on independence in cross-classified data lead naturally to multiplicative effects (McCullagh, 1989), it can be expressed in GLM by the log link, where \(\eta=\ln{\mu}\) or can be also written as its inverse \(\mu=e^\eta\). Therefore, the general model form of GLM with a log link function is

\[ \ln(\mu)=\beta_1X_1+\beta_2X_2+\cdots+\beta_pX_p \tag{5} \]

It is worth noting that this model form is identical to the inverse of equation 2, which means it is functionally identical to P. The robustness of the GLM leads to its frequent appearance in statistical software such as R and SAS. In this project, all models including P, NB, CMP, and their zero-inflated models will be applied through the form of GLM with a log link function.

2.3.4 Negative Binomial Regression Models

The model form for NB is identical to equation 5 because it also belongs to the log-linear model family, the difference is that Y is assumed to follow negative binomial distribution. Negative binomial distribution is based on Poisson distribution but allowing variance to be greater than the mean by adding a heterogeneity parameter \(k\) to its pmf. The pmf of negative binomial distribution can be given by

\[ P(Y=y)=\frac{\Gamma(y+k)}{\Gamma(k)y!}\left(\frac{k}{\mu+k}\right)^k\left(\frac{\mu}{\mu+k}\right)^y, \quad y=0,1,2,\cdots \tag{6} \]

where \(\Gamma\left(a\right)=\int_{0}^{\infty}{e^{-t}t^{a-1}dt}\). With larger \(k\), the distribution leans closer to Poisson distribution. The relationship between the mean \mu and the variance can be given by

\[ Var(Y)=\mu+\frac{\mu^2}{k} \tag{7} \]

where \(\mu^2/k\) determines the extra variance in the distribution that Poisson distribution failed to handle. Figure 1 displays overdispersion using a simulated distribution using negative binomial distribution. Therefore, NB serves as an alternative of Poisson for handling overdispersion, the additional parameter k in its pmf shown in equation 6 allows a more flexible variance-mean relationship, therefore it can capture overdispersed Poisson distribution. However, NB is unable to handle underdispersion, which leaves room for improvement. Still NB has become one of the most popular approaches in modelling real-life count data because underdispersion was rare to be seen in real-life count Data.

2.3.5 Conway-Maxwell Poisson Regression Models

The model form for CMP is identical to equation 5 as it also belongs to log-linear model family. The difference between CMP and other log-linear models lies on the pmf. The pmf of CMP can be given by

\[ P(Y=y)=\frac{\lambda^y}{(y!)^\nu Z(\lambda,\nu)}, \quad y=0,1,2,\cdots \tag{8} \]

where \(\lambda\) is the rate parameter under the Poisson model and \(\lambda=E\left(Y^\nu\right)\), \(\nu\) is the dispersion parameter where \(\nu \geq 0\), for \(\nu = 1\) it denotes equidispersion, \(\nu > 1\) denotes underdispersion, and \(\nu < 1\) denotes overdispersion. \(Z\left(\lambda,\nu\right)\) is a normalizing constant that is given by

\[ Z(\lambda,\nu)=\sum_{y=0}^{\infty}\frac{\lambda^y}{(y!)^\nu} \tag{9} \]

2.3.6 Zero-inflated Models

Lambert (1992) studied the number of defects per area in a manufacturing process using data from a soldering experiment at AT&T Bell Laboratories. In the data, a high proportion of soldered areas had no detects, leading Lambert to introduce a generalized Poisson regression model to handle excess zeros, that is zero-inflated Poisson regression model (ZIP). The zero-inflated model specifies

\[ \mathrm{Pr}[Y=y]= \begin{cases} \pi+(1-\pi)f_2(0), & \text{if } y=0 \\ (1-\pi)f_2(y), & \text{if } y>0 \end{cases} \tag{10} \]

where \(f_2(y)\) is the base count density, \(\pi\) represents the probability of whether a zero is “structural zero” or “sampling zero”, which the former means the zeros cannot be captured by the distribution and are inflated, the latter means the zeros belongs to the base distribution. By defining a binary censoring indicator \(d_i\) for the zero-inflated models

\[ d_i= \begin{cases} 0 & \text{if } y_i=0 \\ 1 & \text{if } y_i>0 \end{cases} \tag{11} \]

2.4 Model Evaluation and Selection Techniques

2.4.1 Pearson Residual and Deviance

Residuals measure the distance between fitted values and actual values of the dependent variables (Cameron & Trivedi, 2013). The Pearson residual \(p_i\) can be written as:

\[ p_i=\frac{(y_i-\hat{\mu}_i)}{\sqrt{\hat{\omega}_i}} \tag{13} \]

where \(\hat{\omega}_i\) is an estimate of variance \(\omega_i\) of \(y_i\). The sum of squares of \(p_i\) gives the Pearson statistic \(P\). Therefore, Pearson statistic can be given by:

\[ P=\sum_{i=1}^{n}\frac{(y_i-\hat{\mu}_i)^2}{\sqrt{\hat{\omega}_i}} \tag{14} \]

It is the standard measure of goodness-of-fit for any model of \(y_i\) with mean \(\mu_i\) and variance \(\omega_i\). In practice, \(P\) is compared with \((n-k)\), where \(n\) is the number of observations and \(k\) is the number of parameters in the model, giving the degree-of-freedom. In GLM literature it is standard to interpret \(P > (n-k)\) as evidence of overdispersion. However, this interpretation assumes a correct specification for \(\mu_i\), or the model form. Therefore \(P > (n-k)\) could also indicate misspecification of the model form. Workie and Azene (2021) used Pearson residual to test for overdispersion in the data, where in their case the value of Pearson statistic divided by degree-of-freedom was higher than 1, which represented overdispersion.

The deviance residual, or deviance, defined by Cameron and Trivedi (2013), is given by:

\[ D(\mathbf{y},\hat{\boldsymbol{\mu}})=2\left\{\mathcal{L}(\mathbf{y})-\mathcal{L}(\hat{\boldsymbol{\mu}})\right\} \tag{15} \]

where \(\hat{\boldsymbol{\mu}}\) and \(\mathbf{y}\) are the \(n\times1\) vectors with \(i^{th}\) entries \(\mu_i\) and \(y_i\), respectively. This can be explained as the difference between the maximum log-likelihood achievable and the log-likelihood of the fitted model. It is another measure of goodness-of-fit but restricted to GLM. About the method of obtaining the log-likelihood, see section 2.4.2.

2.4.2 Maximum Log-likelihood

Likelihood provides a direct measure for the model’s goodness-of-fit to the data and can be given by \(L(\boldsymbol{\beta};\mathbf{y})=\prod_{i=1}^{n}f(y_i\mid\boldsymbol{\beta})\), where \(f(y_i\mid\boldsymbol{\beta})\) is the base density for observation \(y_i\) evaluated at parameters \(\boldsymbol{\beta}\). For example, the likelihood function for the Poisson model can be written as:

\[ L(\lambda)=\prod_{i=1}^{n}\frac{\lambda^{y_i}e^{-\lambda}}{y_i!} \tag{16} \]

where \(n\) refers to the number of counts. It is worth noting that the likelihood function is simply a multiplication of probabilities for each \(y_i\), where the probability can be calculated by the pmf of Poisson given in equation 3. Likelihood evaluates the model’s fitness to a given dataset, while probability predicts the chance of an event given a fixed probability distribution. However, in actual practice, log-likelihood is more often used compared to likelihood. The log-likelihood function for Poisson can be written as:

\[ \ell(\lambda)=\sum_{i=1}^{n}\ln\left(\frac{\lambda^{y_i}e^{-\lambda}}{y_i!}\right) \tag{17} \]

Since multiplication of probabilities with many samples may result in extremely small likelihood values, applying logarithms allows the function to become a summation and produces a more stable value, making it more intuitive to understand.

Cameron and Trivedi (2013) also stated that only in the case in which the limit of \(\ell/n\) is maximized would be considered, that the maximum likelihood estimate (MLE) \(\hat{\theta}_{ML}\) is the solution to the first-order conditions:

\[ \frac{\partial \ell}{\partial \boldsymbol{\theta}}= \sum_{i=1}^{n}\frac{\partial \ln(f_i)}{\partial \boldsymbol{\theta}}=0 \tag{18} \]

where \(f_i=f(y_i\mid\mathbf{x}_i,\boldsymbol{\theta})\) is the conditional pmf of the observed count \(y_i\), given predictor vector \(\mathbf{x}_i\) and parameter vector \(\boldsymbol{\theta}\), and \(\partial \ell/\partial \boldsymbol{\theta}\) is a \(q\times1\) vector where \(q\) is the number of model parameters.

After obtaining the MLE of parameters \(\hat{\theta}_{ML}\), the standard errors of the model can be obtained. For example, for the Poisson MLE \(\hat{\boldsymbol{\beta}}\):

\[ Var(\hat{\boldsymbol{\beta}})=\left(\sum_{i=1}^{n}\mu_i\mathbf{X}_i\mathbf{X}_i^\prime\right)^{-1} \tag{19} \]

where the variance matrix of \(\hat{\boldsymbol{\beta}}\) is obtained from the Fisher information matrix \(\mathbf{X}_i\mathbf{X}_i^\prime\), that is the product of vector of predictors \(\mathbf{X}_i\) with dimension \(p\times1\) and its transpose with dimension \(1\times p\). The maximum likelihood estimate standard errors (MLSE) for Poisson can be obtained by the square root of the variance, that is \(SE(\hat{\boldsymbol{\beta}})=\sqrt{Var(\hat{\boldsymbol{\beta}})}\).

After obtaining the MLSE for the model, Wald test can be conducted to determine the statistical significance of each model parameter. The z-value can be obtained by:

\[ z=\frac{\hat{\boldsymbol{\beta}}}{SE(\hat{\boldsymbol{\beta}})} \tag{20} \]

Then the p-value can be obtained by \(p\text{-value}=2\times[1-\Phi(|z|)]\), where \(\Phi(|z|)\) is the standard normal cumulative distribution function (CDF) that calculates the probability of being less than the absolute z-value in a standard normal distribution. We may compare the p-value to a given \(\alpha\) value such as 0.05. If the p-value is smaller than \(\alpha=0.05\), we conclude that the coefficient is statistically significant at 5% level. For a model coefficient that is insignificant, we may consider reducing that coefficient from the full model. However, statistical insignificance does not mean the variable has no effect on the outcome variable.

The main weakness of using likelihood as the sole measure for model performance comes from ignoring model complexity (Cameron & Trivedi, 2013). Since adding more model parameters almost always increases the goodness-of-fit, it can lead to overfitting and failure to generalize the data pattern. To account for this weakness, information criteria like AIC and BIC add penalties based on the number of parameters in the model or the sample size of the data.

2.4.3 Information Criteria

One of the most used information criteria is the Akaike Information Criterion (AIC). It was first proposed by Akaike (1974). The AIC can be calculated by:

\[ AIC=-2\ell+2k \tag{21} \]

where \(\ell\) is the log-likelihood and \(k\) is the number of adjustable parameters in the model. By adding \(k\) into consideration, AIC provides a robust measure for model performance while balancing fairness for simpler models.

An alternative criterion is the Bayesian Information Criterion (BIC). It was first proposed by Schwarz (1978) as a large-sample approximation to the Bayes factor. The BIC can be calculated by:

\[ BIC=-2\ell+\ln(N)k \tag{22} \]

where \(N\) is the sample size. The BIC extends its perspective onto the sample size of the data, which further encourages simple models on large sample size data, assuming the larger sample size provides more advantage to complex models. Therefore, it is worth noting that BIC almost always imposes heavier penalty on complex models as \(\ln(N)>2\) if \(N>8\), and in real-life data the sample size is almost always larger than 8.

Kuha (2004) and Aho (2014) studied the differences between AIC and BIC but did not give a clear preference toward either of them. Kuha (2004) mentioned that the two most iconic penalized model selection criteria are AIC and BIC, although many other criteria exist and most of them are modifications or generalizations of AIC and BIC. In practice, both criteria can be used to provide a two-perspective way of evaluating the model: by goodness-of-fit and by simplicity.

3. Methodology

3.1 Data Description

The data for this project is collected from a study on potential death factors on hemophilia patients, with count of death within a certain time and group categorized by potential factors as independent variables. The data’s outcome variable death is the death count for the particular group of patients.

The groups are aligned by 3 categorical independent variables: hiv with 2 categories, factor with 5 categories, and age with 14 categories. An offset variable py represents the exposure time for the particular group in the study. The original data contains 2144 rows.

3.1.1 Variable Transformation

We transformed age into a numerical variable using midpoint approximation:

\[ age_{num}=age_{cat}\times 5 - 3 \]

3.2 Models Forms

3.2.1 Generalized Log-linear Models

\[ \ln(\mu)=\beta_0+\beta_1X_{hiv}+\sum_{i=2}^{5}\beta_iX_{factor}+\beta_6X_{age}+\ln(py) \tag{23} \]

3.2.2 Zero-inflated Models

\[ \ln\left(\frac{\pi}{1-\pi}\right)=\alpha_0+\alpha_1hiv+\sum_{i=2}^{5}\alpha_ifactor+\alpha_6age+\ln(py) \tag{24} \]

3.3 Exploratory Data Analysis

3.3.1 Univariate Analysis

Univariate analysis includes summary statistics for outcome and predictors. For count outcome or predictors, it includes minimum, maximum, mean, variance, and proportion of zeros.

3.3.2 Correlation Analysis of Variables

\[ r_s=1-\frac{6\sum_{i=1}^{n}(x_i-y_i)^2}{n(n^2-1)} \tag{25} \]

\[ t=r_s\sqrt{\frac{n-2}{1-r_s}} \tag{26} \]

3.3.3 Data Validation

Since logarithm is applied to py, it is necessary to check whether all py values are larger than 0.

3.4 Test for Overdispersion / Zero-Inflation

3.4.1 Residual Analysis

Overdispersion can be tested by computing the dispersion ratio using Pearson statistic or deviance.

3.4.2 Overdispersion Test

\[ (y_i-\mu_i)^2-y_i=\alpha \cdot g(\mu_i)+\varepsilon_i \tag{27} \]

3.5 Model Selection Strategy

3.5.1 Model Selection

We conduct backward selection based on inference on full zero-inflated models ZIP, ZINB, and ZICMP.

\[ n=\sum_{k=1}^{6}\binom{X_{Count}+X_{Zero}}{k} \]

3.5.2 Model Comparison

Log-likelihood, AIC, and BIC are used to compare all full and adjusted models.

4. Results

4.1 Exploratory Data Analysis Results

4.1.1 Result of Data Validation

By checking all py values, it is found that row 378, the group where hiv = 1, factor = 2, age = 14 has py = 0. Since it is unreasonable for the exposure time to be zero and impossible to calculate \(\ln(0)\), row 378 was removed.

4.1.2 Univariate Statistics

The outcome variable death has value ranging from 0 to 6, with mean 0.215 and variance 0.371. The proportion of zeros in death is large, where \(\frac{1832}{2143}\approx 85\%\) of all rows are zeros.

TABLE 2. Frequency distribution and approximated percentage of death.
death	0	1	2	3	4	5	6
Count	1832	212	62	28	6	2	1
Percentage	85%	10%	3%	1%	<1%	<1%	<1%

4.1.3 Result of Correlation Analysis

The Spearman correlation matrix shows that hiv has a significant monotonic increasing correlation with death.

4.2 Regression Analysis Results

4.2.1 Test for Overdispersion

The Pearson statistic ratio is 1.406, which is significantly larger than 1, suggesting overdispersion. Cameron & Trivedi’s overdispersion test also provides inferential evidence for overdispersion.

TABLE 9. Comparison of full Poisson and full NB model.
Model	Log-likelihood	AIC	BIC
Full Poisson	-933.253	1880.515	1920.195
Full NB	-918.211	1852.422	1897.782

TABLE 10. Comparison between standard and zero-inflated models.
Model	Log-likelihood	AIC	BIC
Full Poisson	-933.253	1880.515	1920.195
Full ZIP	-892.252	1812.504	1891.884
Full NB	-918.211	1852.422	1897.782
Full ZINB	-890.777	1811.553	1896.603

TABLE 11. Comparison between full CMP and full ZICMP model.
Model	Log-likelihood	AIC	BIC
Full CMP	-928.847	1873.693	1919.053
Full ZICMP	-898.828	1827.649	1912.699

4.2.2 Assessment of Zero-Inflation

The zero-inflated models outperform their corresponding standard models, confirming that the data exhibits zero-inflation.

4.2.3 Conway-Maxwell Poisson Regression

CMP outperformed P across all three evaluators but was unable to outperform NB. ZICMP outperformed CMP, but its BIC advantage was very small and still underperformed against NB in terms of BIC.

4.2.4 Model Selection on Zero-inflated Models

The factor predictor in zero components showed obvious statistical insignificance. One adjusted model was constructed for each full zero-inflated model: ZIP-1, ZINB-1, and ZICMP-1.

\[ \ln\left(\frac{\pi}{1-\pi}\right)=\alpha_0+\alpha_1hiv+\alpha_2age+\ln(py) \tag{28} \]

4.2.5 Overview of Model Fit Statistics

TABLE 15. Comparison of all models.
Model	Log-likelihood	AIC	BIC
Full Poisson	-933.253	1880.515	1920.195
Full NB	-918.211	1852.422	1897.782
Full CMP	-928.847	1873.693	1919.053
Full ZIP	-892.252	1812.504	1891.884
Full ZINB	-890.777	1811.553	1896.603
Full ZICMP	-898.828	1827.649	1912.699
ZIP-1	-895.252	1810.504	1867.204
ZINB-1	-893.877	1809.754	1872.123
ZICMP-1	-905.743	1833.486	1895.855

4.3 Interpretation of Result

Full ZINB obtained best likelihood, ZINB-1 obtained lowest AIC, and ZIP-1 obtained lowest BIC. Therefore, full ZINB, ZIP-1, and ZINB-1 are all strong models for practical use.

5. Discussion and Conclusion

5.1 Performance of Conway-Maxwell Poisson Models

For the hemophilia data, ZICMP did not perform better than ZINB or ZIP before or after parameter reduction. One issue is that more complex models may have limited sensitivity in separating overdispersion and zero-inflation.

5.2 Death Factors for Hemophilia Patients

The results show a positive and significant relationship between death count and HIV. Older age and HIV-positive status are associated with higher death risk.

5.3 Future Research Directions

Future work may compare quasi-Poisson (NB1), NB2, and zero-inflated geometric models, and also further study computational adjustments for ZICMP.

5.4 Conclusion

This study provides a complete example of count data modelling from testing overdispersion and zero-inflation, to model selection, evaluation and comparison. Adjusted ZINB achieved best AIC score, while adjusted ZIP achieved best BIC score, showing that adjusted ZIP and ZINB are the best models for this task.

6. Acknowledgements

We would like to thank Dr. Chan Moon Tong Tony for his continuous support, patient guidance, and insightful feedback throughout this project.

7. References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.
Bartolucci, F., Li Donni, P., & Pandolfi, S. (2026). Bivariate zero-inflated stochastic block models for the analysis of longitudinal hospital network data. Journal of the Royal Statistical Society. Series A, Statistics in Society. qnag026.
Berry, E., Mariani, G., Sultan, Y., Jones, P., World Health Organization. Hereditary Diseases Programme et al. (1992). Educational information about haemophilia for health care professionals. World Health Organization. https://iris.who.int/handle/10665/60832
Berk, R., & Macdonald, J. M. (2008). Overdispersion and poisson regression. Journal of Quantitative Criminology, 24(3), 269-284.
Bladen, M., Main, E., Hubert, N., Koutoumanou, E., Liesner, R., & Khair, K. (2013). Factors affecting the Haemophilia Joint Health Score in children with severe haemophilia. Haemophilia, 19(4), 626-631.
Boatwright, P., Borle, S., & Kadane, J. B. (2003). A Model of the Joint Distribution of Purchase Quantity and Timing. Journal of the American Statistical Association, 98(463), 564–572.
Böhning, D., Dietz, E., Schlattmann, P., Mendonça, L., & Kirchner, U. (1999). The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. Journal of the Royal Statistical Society. Series A, Statistics in Society, 162(2), 195–209.
Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data (2nd ed.). Cambridge University Press.
Cameron, A. C., & Trivedi, P. K. (1990). Regression-based tests for overdispersion in the Poisson model. Journal of Econometrics, 46(3), 347-364.
Conway, R. W., & Maxwell, W. L. (1962). A queuing model with state dependent service rates. Journal of Industrial Engineering, 12, 132-136.
Dai, K., Feng, Z., Hu, T., Su, Z., Yuan, D., Qin, B., Gu, M., Peng, F., & Jiang, Y. (2023). Seasonality and meteorological factors of HIV-negative cryptococcal meningitis in Guangdong Province, China. Mycoses, 66(11), 1003–1011.
de Winter, J. C. F., Gosling, S. D., & Potter, J. (2016). Comparing the Pearson and Spearman Correlation Coefficients Across Distributions and Sample Sizes: A Tutorial Using Simulations and Empirical Data. Psychological Methods, 21(3), 273–290.
Greene, W. H. (1994). Accounting for excess zeros and sample selection in Poisson and negative binomial regression models. Faculty Digital Archive: NYU Libraries. http://archive.nyu.edu/bitstream/2451/26263/2/94-10.pdf
Greenwood, M., & Yule, G. U. (1920). An inquiry into the nature of frequency distributions representative of multiple happenings with particular reference to the occurrence of multiple attacks of disease or of repeated accidents. Journal of the Royal Statistical Society, 83(2), 255-279.
Hinde, J., & Demétrio, C. G. (1998). Overdispersion: models and estimation. Computational Statistics & Data Analysis, 27(2), 151-170.
Hoef, J. M. V., & Boveng, P. L. (2007). Quasi-Poisson vs. Negative Binomial Regression: How Should We Model Overdispersed Count Data? Ecology, 88(11), 2766–2772.
Hutchinson, M. K., & Holtman, M. C. (2005). Analysis of Count Data Using Poisson Regression. Research in Nursing & Health, 28(5), 408–418.
Jiang, R., Sun, T., Song, D., & Li, J. J. (2022). Statistics or biology: the zero-inflation controversy about scRNA-seq data. Genome Biology, 23(1), 31.
Joe, H., & Zhu, R. (2005). Generalized Poisson Distribution: the Property of Mixture of Poisson and Comparison with Negative Binomial Distribution. Biometrical Journal, 47(2), 219–229.
Jones, C., Wu, Y., Kragh, N., Bystrická, L., Wilson, A., & Burke, T. (2025). The association of factor VIII activity levels with bleeding and quality of life in haemophilia A: findings from the European CHESS II study. Orphanet Journal of Rare Diseases, 20(1), 272.
Kalyanam, K., Borle, S., & Boatwright, P. (2007). Deconstructing Each Item’s Category Contribution. Marketing Science, 26(3), 327–341.
Lambert, D. (1992). Zero-Inflated Poisson Regression, With an Application to Defects in Manufacturing. Technometrics, 34(1), 1–14.
Lee, K. H., Pedroza, C., Avritscher, E. B., Mosquera, R. A., & Tyson, J. E. (2023). Evaluation of negative binomial and zero-inflated negative binomial models for the analysis of zero-inflated count data: application to the telemedicine for children with medical complexity trial. Trials, 24(1), 613.
Liu, J., Ouyang, X., Pan, P., Ning, J., & Guo, Y. (2024). Modeling stand mortality of Chinese fir plantations in subtropical China using mixed-effects zero-inflated negative binomial models. Forest Ecology and Management, 565, 122016.
Lu, C., Durante, D., & Friel, N. (2026). Zero-inflated stochastic block modelling of efficiency-security trade-offs in weighted criminal networks. Journal of the Royal Statistical Society. Series A, Statistics in Society, 189(2), 869–897.
McCullagh, P. (1989). Generalized Linear Models (2nd ed.). Routledge.
Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics, 33(3), 341-365.
Nelder, J. A., & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society Series A: Statistics in Society, 135(3), 370-384.
Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464.
Sellers, K. F. (2023). The Conway–Maxwell–Poisson Distribution. Cambridge University Press.
Sellers, K. F., Borle, S., & Shmueli, G. (2012). The COM-Poisson model for count data: a survey of methods and applications. Applied Stochastic Models in Business and Industry, 28(2), 104–116.
Sellers, K. F., & Raim, A. (2016). A flexible zero-inflated model to address data dispersion. Computational Statistics & Data Analysis, 99, 68-80.
Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., & Boatwright, P. (2005). A useful distribution for fitting discrete data: revival of the Conway-Maxwell-Poisson distribution. Applied Statistics, 54(1), 127–142.
Workie, M. S., & Azene, A. G. (2021). Bayesian zero-inflated regression model with application to under-five child mortality. Journal of Big Data, 8(1), 4.