Hong Kong Metropolitan University
School of Science and Technology
BSc (Hon) in Data Science and Artificial Intelligence
COMP S461F Data Science Final Project Report
2025-26 · Group 1
Supervisor: Dr. Chan Moon Tong Tony
April 2026
GithubIn count data modelling, excessive zeros and overdispersion appear frequently in real-life data, which significantly affected the goodness-of-fit of standard Poisson regression models (P). To address these two issues, regression models with different specifications of P were derived, such as negative binomial regression models (NB), zero-inflated regression models, and Conway-Maxwell Poisson regression models (CMP). NB and CMP introduced additional dispersion parameters relative to overdispersion, while zero-inflated regression models introduced additional component addressing the probability of excessive zeros. The performance of these models is illustrated in an application on a count data from a study on hemophilia patients with characteristic of overdispersion and excess zeros. The result shows that zero-inflated Poisson regression models (ZIP) and zero-inflated negative binomial regression models (ZINB) are the best model in handling overdispersion and zero-inflation, while the more complex and flexible zero-inflated Conway-Maxwell Poisson regression model (ZICMP) underperformed.
Many medical studies involve real-life data that has outcome variables in form of zero-inflated and overdispersed counts, such as number of deaths or number of patients in a particular period of time (Lee et al., 2023). These data are always positive integers and often have a small mean value and a heavily right-skewed distribution that does not follow normal distribution. Failure to select suitable models for analyzing this kind of data can lead to incorrect results and misleading findings that ultimately provide bad suggestions to the healthcare system in ways of patient management and treatment strategies.
Therefore, it is important to know which models are the best for data with count outcomes before analyzing. In this research project, we studied on a dataset collected from a study on hemophilia death factors, that has the count of death as the outcome variable, and various potential factors as the dependent variables such as whether the patient has been diagnosed with HIV, the blood clotting level of the patient, and the age of the patient. According to Berry et al. (1992) and the World Health Organization, hemophilia is a rare bleeding disorder where the blood does not clot properly due to a deficiency in specific clotting proteins that leads to prolonged bleeding after injuries, surgery, or spontaneous bleeding into joints and muscles, which can cause chronic pain and joint damage.
The harmfulness of hemophilia motivates us into finding the best model for our data that can provide concrete discoveries about the causes of death for hemophilia patients, and to also provide a reliable data modelling method for future researchers studying in related fields. To analyze the data and understand the relationships between the outcome “death” and the potential factors, one way is to discover statistical models that has good fit to the data and can breakdown complex components or patterns under the data.
The primary objective of this project is to identify the best approach for modelling hemophilia count data with overdispersion and zero-inflation.
To achieve this, we divided our project into following steps:
This study revisited CMP and ZICMP, the relatively new and unexplored count data modelling method, and compared their performance to classical methods such as ZIP and ZINB in the scenario of handling overdispersion and zero-inflation in count data. The result of this study would provide future researchers with a trustworthy example in the application of CMP and ZICMP, which would then provide them with a reliable approach to model count data using CMP and ZICMP.
This would further enhance the quality of their research outcomes, since mistakes such as using incorrect models, or model misspecification when handling overdispersion or zero-inflation can be prevented by following the correct modelling procedures. Better decisions in real-life practice would then be made based on the correct conclusions of the studies. Given the commonness of count data in real life, the impact of this study is significant.
Section 1 introduces the project background with important fundamental knowledge. Section 2 provides a methodological review of project related literature. Section 3 provides the methodology of this project. Section 4 provides results with interpretation. Section 5 provides discussion on the results obtained and brief conclusion on our project.
Bladen et al. (2013) studied factors that potentially influence the Haemophilia Joint Health Score (HJHS) of young hemophilia patients, such as age, prophylaxis, history of high-titre inhibitors (HTI), and bleeding events. In Bladen’s study the data collected medical and physiotherapy notes of boys with severe hemophilia, aged 4–18 years, in which 53% of them has a HJHS of zero. Bladen used ZIP to account for excessive zeros in the data and successfully modelled the data, which lead to valuable findings and conclusions.
Jones et al. (2025) studied the association of insufficient or absent clotting factors VIII with hemophilia A from the data collected from the Cost of Haemophilia in Europe: a Socioeconomic Survey (CHESS). In the data there exhibits large proportion of individuals experiencing zero bleeds, Jones selected generalized Poisson regression models including P, NB, ZIP, ZINB to explore the association between factor activity levels (FALs) and annual bleeding rate (ABR), adjusting for age, BMI, HIV, HBV, and HCV.
The application of count data modelling can be traced back as early as 1898, where Bortkiewicz conducted a study on annual number of deaths in the Prussian army from being kicked by mules. He utilized the idea of Poisson distribution that was derived by Poisson in 1837. Greenwood and Yule (1920) further generalized the Poisson distribution and derived the negative binomial distribution.
Mullahy (1986) proposed a modified count data model, termed hurdle models, that separated the standard Poisson model into two parts. Lambert (1992) proposed zero-inflated Poisson regression (ZIP). Greene (1994) further explored the possibility of specifying an alternative distribution to ZIP, such as NB, which forms a zero-inflated negative binomial model (ZINB).
The general form of MLR is:
where \(Y\) is the dependent variable, \(X_1,X_2,\cdots,X_p\) are the independent variables or predictors, \(\beta_0\) is the intercept, \(\beta_1,\beta_2,\cdots,\beta_p\) are the regression coefficients or regression parameters, and \(\varepsilon\) is the error term.
The general form of Poisson regression models (P) can be written as
where dependent variable Y given independent variables \(X_1,X_2,\cdots,X_p\) follows Poisson distribution with mean \(\mu\). Because the logarithm of the conditional mean is linear in the parameters, this model is also called a log-linear model. The probability mass function (pmf) of the Poisson distribution can be given by
where \(P\left(Y=y\right)\) refers to the probability of the outcome variable Y equals to y the number of event occurrence. An important assumption of P is that \(E(Y)=Var(Y)=\mu\). This assumption is often violated by real-life data characteristics.
Generalized linear models (GLM) are first proposed by Nelder and Wedderburn in 1972, then formalized by McCullagh in 1989. McCullagh (1989) introduced GLM as a generalization of classical linear models that allows the linear model to handle many special cases such as logit and probit models for quantal responses, or log-linear models for count. By beginning with the standard Gaussian linear regression, as shown in equation 1, the form of GLM remained basically similar to the form MLR, with some specification to the random component Y. Assume the random component Y have independent Normal distributions with \(\mathrm{E}\left(Y\right)=\mu\) and constant variance \(\sigma^2\), the systematic component, where covariates \(X_1,X_2,\cdots,X_p\) produce a linear predictor \eta so that
where the link between the random component and systematic component is \mu=\eta. It is therefore can be written as \(\eta_i=g\left(\mu_i\right)\) where \(g\left(\bullet\right)\) is called the link function. This allows the systematic component to come from a non-Gaussian distribution such as Poisson distribution. In modelling count that the distribution is Poisson, since counts based on independence in cross-classified data lead naturally to multiplicative effects (McCullagh, 1989), it can be expressed in GLM by the log link, where \(\eta=\ln{\mu}\) or can be also written as its inverse \(\mu=e^\eta\). Therefore, the general model form of GLM with a log link function is
It is worth noting that this model form is identical to the inverse of equation 2, which means it is functionally identical to P. The robustness of the GLM leads to its frequent appearance in statistical software such as R and SAS. In this project, all models including P, NB, CMP, and their zero-inflated models will be applied through the form of GLM with a log link function.
The model form for NB is identical to equation 5 because it also belongs to the log-linear model family, the difference is that Y is assumed to follow negative binomial distribution. Negative binomial distribution is based on Poisson distribution but allowing variance to be greater than the mean by adding a heterogeneity parameter \(k\) to its pmf. The pmf of negative binomial distribution can be given by
where \(\Gamma\left(a\right)=\int_{0}^{\infty}{e^{-t}t^{a-1}dt}\). With larger \(k\), the distribution leans closer to Poisson distribution. The relationship between the mean \mu and the variance can be given by
where \(\mu^2/k\) determines the extra variance in the distribution that Poisson distribution failed to handle. Figure 1 displays overdispersion using a simulated distribution using negative binomial distribution. Therefore, NB serves as an alternative of Poisson for handling overdispersion, the additional parameter k in its pmf shown in equation 6 allows a more flexible variance-mean relationship, therefore it can capture overdispersed Poisson distribution. However, NB is unable to handle underdispersion, which leaves room for improvement. Still NB has become one of the most popular approaches in modelling real-life count data because underdispersion was rare to be seen in real-life count Data.
The model form for CMP is identical to equation 5 as it also belongs to log-linear model family. The difference between CMP and other log-linear models lies on the pmf. The pmf of CMP can be given by
where \(\lambda\) is the rate parameter under the Poisson model and \(\lambda=E\left(Y^\nu\right)\), \(\nu\) is the dispersion parameter where \(\nu \geq 0\), for \(\nu = 1\) it denotes equidispersion, \(\nu > 1\) denotes underdispersion, and \(\nu < 1\) denotes overdispersion. \(Z\left(\lambda,\nu\right)\) is a normalizing constant that is given by
Lambert (1992) studied the number of defects per area in a manufacturing process using data from a soldering experiment at AT&T Bell Laboratories. In the data, a high proportion of soldered areas had no detects, leading Lambert to introduce a generalized Poisson regression model to handle excess zeros, that is zero-inflated Poisson regression model (ZIP). The zero-inflated model specifies
where \(f_2(y)\) is the base count density, \(\pi\) represents the probability of whether a zero is “structural zero” or “sampling zero”, which the former means the zeros cannot be captured by the distribution and are inflated, the latter means the zeros belongs to the base distribution. By defining a binary censoring indicator \(d_i\) for the zero-inflated models
Residuals measure the distance between fitted values and actual values of the dependent variables (Cameron & Trivedi, 2013). The Pearson residual \(p_i\) can be written as:
where \(\hat{\omega}_i\) is an estimate of variance \(\omega_i\) of \(y_i\). The sum of squares of \(p_i\) gives the Pearson statistic \(P\). Therefore, Pearson statistic can be given by:
It is the standard measure of goodness-of-fit for any model of \(y_i\) with mean \(\mu_i\) and variance \(\omega_i\). In practice, \(P\) is compared with \((n-k)\), where \(n\) is the number of observations and \(k\) is the number of parameters in the model, giving the degree-of-freedom. In GLM literature it is standard to interpret \(P > (n-k)\) as evidence of overdispersion. However, this interpretation assumes a correct specification for \(\mu_i\), or the model form. Therefore \(P > (n-k)\) could also indicate misspecification of the model form. Workie and Azene (2021) used Pearson residual to test for overdispersion in the data, where in their case the value of Pearson statistic divided by degree-of-freedom was higher than 1, which represented overdispersion.
The deviance residual, or deviance, defined by Cameron and Trivedi (2013), is given by:
where \(\hat{\boldsymbol{\mu}}\) and \(\mathbf{y}\) are the \(n\times1\) vectors with \(i^{th}\) entries \(\mu_i\) and \(y_i\), respectively. This can be explained as the difference between the maximum log-likelihood achievable and the log-likelihood of the fitted model. It is another measure of goodness-of-fit but restricted to GLM. About the method of obtaining the log-likelihood, see section 2.4.2.
Likelihood provides a direct measure for the model’s goodness-of-fit to the data and can be given by \(L(\boldsymbol{\beta};\mathbf{y})=\prod_{i=1}^{n}f(y_i\mid\boldsymbol{\beta})\), where \(f(y_i\mid\boldsymbol{\beta})\) is the base density for observation \(y_i\) evaluated at parameters \(\boldsymbol{\beta}\). For example, the likelihood function for the Poisson model can be written as:
where \(n\) refers to the number of counts. It is worth noting that the likelihood function is simply a multiplication of probabilities for each \(y_i\), where the probability can be calculated by the pmf of Poisson given in equation 3. Likelihood evaluates the model’s fitness to a given dataset, while probability predicts the chance of an event given a fixed probability distribution. However, in actual practice, log-likelihood is more often used compared to likelihood. The log-likelihood function for Poisson can be written as:
Since multiplication of probabilities with many samples may result in extremely small likelihood values, applying logarithms allows the function to become a summation and produces a more stable value, making it more intuitive to understand.
Cameron and Trivedi (2013) also stated that only in the case in which the limit of \(\ell/n\) is maximized would be considered, that the maximum likelihood estimate (MLE) \(\hat{\theta}_{ML}\) is the solution to the first-order conditions:
where \(f_i=f(y_i\mid\mathbf{x}_i,\boldsymbol{\theta})\) is the conditional pmf of the observed count \(y_i\), given predictor vector \(\mathbf{x}_i\) and parameter vector \(\boldsymbol{\theta}\), and \(\partial \ell/\partial \boldsymbol{\theta}\) is a \(q\times1\) vector where \(q\) is the number of model parameters.
After obtaining the MLE of parameters \(\hat{\theta}_{ML}\), the standard errors of the model can be obtained. For example, for the Poisson MLE \(\hat{\boldsymbol{\beta}}\):
where the variance matrix of \(\hat{\boldsymbol{\beta}}\) is obtained from the Fisher information matrix \(\mathbf{X}_i\mathbf{X}_i^\prime\), that is the product of vector of predictors \(\mathbf{X}_i\) with dimension \(p\times1\) and its transpose with dimension \(1\times p\). The maximum likelihood estimate standard errors (MLSE) for Poisson can be obtained by the square root of the variance, that is \(SE(\hat{\boldsymbol{\beta}})=\sqrt{Var(\hat{\boldsymbol{\beta}})}\).
After obtaining the MLSE for the model, Wald test can be conducted to determine the statistical significance of each model parameter. The z-value can be obtained by:
Then the p-value can be obtained by \(p\text{-value}=2\times[1-\Phi(|z|)]\), where \(\Phi(|z|)\) is the standard normal cumulative distribution function (CDF) that calculates the probability of being less than the absolute z-value in a standard normal distribution. We may compare the p-value to a given \(\alpha\) value such as 0.05. If the p-value is smaller than \(\alpha=0.05\), we conclude that the coefficient is statistically significant at 5% level. For a model coefficient that is insignificant, we may consider reducing that coefficient from the full model. However, statistical insignificance does not mean the variable has no effect on the outcome variable.
The main weakness of using likelihood as the sole measure for model performance comes from ignoring model complexity (Cameron & Trivedi, 2013). Since adding more model parameters almost always increases the goodness-of-fit, it can lead to overfitting and failure to generalize the data pattern. To account for this weakness, information criteria like AIC and BIC add penalties based on the number of parameters in the model or the sample size of the data.
One of the most used information criteria is the Akaike Information Criterion (AIC). It was first proposed by Akaike (1974). The AIC can be calculated by:
where \(\ell\) is the log-likelihood and \(k\) is the number of adjustable parameters in the model. By adding \(k\) into consideration, AIC provides a robust measure for model performance while balancing fairness for simpler models.
An alternative criterion is the Bayesian Information Criterion (BIC). It was first proposed by Schwarz (1978) as a large-sample approximation to the Bayes factor. The BIC can be calculated by:
where \(N\) is the sample size. The BIC extends its perspective onto the sample size of the data, which further encourages simple models on large sample size data, assuming the larger sample size provides more advantage to complex models. Therefore, it is worth noting that BIC almost always imposes heavier penalty on complex models as \(\ln(N)>2\) if \(N>8\), and in real-life data the sample size is almost always larger than 8.
Kuha (2004) and Aho (2014) studied the differences between AIC and BIC but did not give a clear preference toward either of them. Kuha (2004) mentioned that the two most iconic penalized model selection criteria are AIC and BIC, although many other criteria exist and most of them are modifications or generalizations of AIC and BIC. In practice, both criteria can be used to provide a two-perspective way of evaluating the model: by goodness-of-fit and by simplicity.
The data for this project is collected from a study on potential death factors on hemophilia patients, with count of death within a certain time and group categorized by potential factors as independent variables. The data’s outcome variable death is the death count for the particular group of patients.
The groups are aligned by 3 categorical independent variables: hiv with 2 categories, factor with 5 categories, and age with 14 categories. An offset variable py represents the exposure time for the particular group in the study. The original data contains 2144 rows.
We transformed age into a numerical variable using midpoint approximation:
Univariate analysis includes summary statistics for outcome and predictors. For count outcome or predictors, it includes minimum, maximum, mean, variance, and proportion of zeros.
Since logarithm is applied to py, it is necessary to check whether all py values are larger than 0.
Overdispersion can be tested by computing the dispersion ratio using Pearson statistic or deviance.
We conduct backward selection based on inference on full zero-inflated models ZIP, ZINB, and ZICMP.
Log-likelihood, AIC, and BIC are used to compare all full and adjusted models.
By checking all py values, it is found that row 378, the group where hiv = 1, factor = 2, age = 14 has py = 0. Since it is unreasonable for the exposure time to be zero and impossible to calculate \(\ln(0)\), row 378 was removed.
The outcome variable death has value ranging from 0 to 6, with mean 0.215 and variance 0.371. The proportion of zeros in death is large, where \(\frac{1832}{2143}\approx 85\%\) of all rows are zeros.
| death | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| Count | 1832 | 212 | 62 | 28 | 6 | 2 | 1 |
| Percentage | 85% | 10% | 3% | 1% | <1% | <1% | <1% |
The Spearman correlation matrix shows that hiv has a significant monotonic increasing correlation with death.
The Pearson statistic ratio is 1.406, which is significantly larger than 1, suggesting overdispersion. Cameron & Trivedi’s overdispersion test also provides inferential evidence for overdispersion.
| Model | Log-likelihood | AIC | BIC |
|---|---|---|---|
| Full Poisson | -933.253 | 1880.515 | 1920.195 |
| Full NB | -918.211 | 1852.422 | 1897.782 |
| Model | Log-likelihood | AIC | BIC |
|---|---|---|---|
| Full Poisson | -933.253 | 1880.515 | 1920.195 |
| Full ZIP | -892.252 | 1812.504 | 1891.884 |
| Full NB | -918.211 | 1852.422 | 1897.782 |
| Full ZINB | -890.777 | 1811.553 | 1896.603 |
| Model | Log-likelihood | AIC | BIC |
|---|---|---|---|
| Full CMP | -928.847 | 1873.693 | 1919.053 |
| Full ZICMP | -898.828 | 1827.649 | 1912.699 |
The zero-inflated models outperform their corresponding standard models, confirming that the data exhibits zero-inflation.
CMP outperformed P across all three evaluators but was unable to outperform NB. ZICMP outperformed CMP, but its BIC advantage was very small and still underperformed against NB in terms of BIC.
The factor predictor in zero components showed obvious statistical insignificance. One adjusted model was constructed for each full zero-inflated model: ZIP-1, ZINB-1, and ZICMP-1.
| Model | Log-likelihood | AIC | BIC |
|---|---|---|---|
| Full Poisson | -933.253 | 1880.515 | 1920.195 |
| Full NB | -918.211 | 1852.422 | 1897.782 |
| Full CMP | -928.847 | 1873.693 | 1919.053 |
| Full ZIP | -892.252 | 1812.504 | 1891.884 |
| Full ZINB | -890.777 | 1811.553 | 1896.603 |
| Full ZICMP | -898.828 | 1827.649 | 1912.699 |
| ZIP-1 | -895.252 | 1810.504 | 1867.204 |
| ZINB-1 | -893.877 | 1809.754 | 1872.123 |
| ZICMP-1 | -905.743 | 1833.486 | 1895.855 |
Full ZINB obtained best likelihood, ZINB-1 obtained lowest AIC, and ZIP-1 obtained lowest BIC. Therefore, full ZINB, ZIP-1, and ZINB-1 are all strong models for practical use.
For the hemophilia data, ZICMP did not perform better than ZINB or ZIP before or after parameter reduction. One issue is that more complex models may have limited sensitivity in separating overdispersion and zero-inflation.
The results show a positive and significant relationship between death count and HIV. Older age and HIV-positive status are associated with higher death risk.
Future work may compare quasi-Poisson (NB1), NB2, and zero-inflated geometric models, and also further study computational adjustments for ZICMP.
This study provides a complete example of count data modelling from testing overdispersion and zero-inflation, to model selection, evaluation and comparison. Adjusted ZINB achieved best AIC score, while adjusted ZIP achieved best BIC score, showing that adjusted ZIP and ZINB are the best models for this task.
We would like to thank Dr. Chan Moon Tong Tony for his continuous support, patient guidance, and insightful feedback throughout this project.