(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Nowcasting the 2022 mpox outbreak in England [1]

['Christopher E. Overton', 'Department Of Mathematical Sciences', 'University Of Liverpool', 'Liverpool', 'United Kingdom', 'Uk Health Security Agency', 'Data Science', 'Analytics', 'London', 'Department Of Mathematics']

Date: 2023-10

In May 2022, a cluster of mpox cases were detected in the UK that could not be traced to recent travel history from an endemic region. Over the coming months, the outbreak grew, with over 3000 total cases reported in the UK, and similar outbreaks occurring worldwide. These outbreaks appeared linked to sexual contact networks between gay, bisexual and other men who have sex with men. Following the COVID-19 pandemic, local health systems were strained, and therefore effective surveillance for mpox was essential for managing public health policy. However, the mpox outbreak in the UK was characterised by substantial delays in the reporting of the symptom onset date and specimen collection date for confirmed positive cases. These delays led to substantial backfilling in the epidemic curve, making it challenging to interpret the epidemic trajectory in real-time. Many nowcasting models exist to tackle this challenge in epidemiological data, but these lacked sufficient flexibility. We have developed a nowcasting model using generalised additive models that makes novel use of individual-level patient data to correct the mpox epidemic curve in England. The aim of this model is to correct for backfilling in the epidemic curve and provide real-time characteristics of the state of the epidemic, including the real-time growth rate. This model benefited from close collaboration with individuals involved in collecting and processing the data, enabling temporal changes in the reporting structure to be built into the model, which improved the robustness of the nowcasts generated. The resulting model accurately captured the true shape of the epidemic curve in real time.

During 2022, outbreaks of mpox, the disease caused by the monkeypox virus, occurred simultaneously in multiple non-endemic countries, including England. These outbreaks were distinct from historic outbreaks with a majority of cases in gay, bisexual and other men who have sex with men and in individuals without recent travel histories to endemic countries. To inform public health policy, understanding the number of new cases and growth rate of the outbreak in real-time is essential. However, the outbreak was characterised by long delays from individuals developing symptoms (or getting a test) and being reported as a positive case. This creates a biased picture of the outbreak, where observed real-time cases underestimates the true extent of the outbreak. We developed a mathematical model that accounts for these reporting delays to estimate the true shape of the epidemic curve in real-time. The modelled outputs are able to accurately capture the true shape of the epidemic, and provide improved real-time insight over the raw data. This model was used continuously throughout the outbreak response in the UK to provide insight to the incident management team at the UK Health Security Agency.

Data Availability: Training data for the modified data used to inform the best fitting models is available at https://github.com/OvertonC/nowcasting_the_2022_mpox_outbreak_in_england . This data has had dates removed to improve anonymity. This data enables the non-parametric models to be used, and can also be used as training data for future development of nowcasting models. Code for running all models is available at https://github.com/OvertonC/nowcasting_the_2022_mpox_outbreak_in_england . Individual-level data on the reporting delay used to inform the parametric models has not been made available due to potential identifiability. An application for data access can be made to the UK Health Security Agency. Data requests can be made to the Office for Data Release ( https://www.gov.uk/government/publications/accessing-ukhsa-protected-data/accessing-ukhsa-protected-data ) and by contacting [email protected] . All requests to access data are reviewed by the Office for Data Release and are subject to strict confidentiality provisions in line with the requirements of: the common law duty of confidentiality, data protection legislation (including the General Data Protection Regulation), Caldicott principles, the Information Commissioner’s statutory data sharing code of practice, and the national data opt-out programme.

In this paper, we develop a novel nowcasting model using generalised additive models, implemented in the mgcv [ 15 ] package in R [ 16 ]. We propose a range of models, ranging from fully non-parametric to hybrid models using a mixture of parametric and non-parametric aspects, which we compare using coverage, bias, and weighted interval scores using the scoringutils [ 17 ] package. We then discuss a range of challenges that have emerged in real-time during the nowcasting of the mpox outbreak, and describe the solutions we have applied.

Another challenge arises when ascertainment rate is changing over time. For example, ascertainment rate of symptom onset dates is likely to be high at the start of an outbreak, when public health teams can dedicate a lot of time to each individual case. However, as case numbers rise, it may not be feasible to continue such an intense individual-level response. This can lead to the ascertainment rate of data on symptom onset declining as case numbers rise, which causes artificial damping of the epidemic curve. To adjust for this, one would need methods that simultaneously adjust for nowcasting and changing case ascertainment rates. This is an important area for future research into nowcasting, but we do not account for it here due to the time pressures involved in getting an operational product.

A further challenge in real-time nowcasting is when the reporting structure changes. Nowcasting fits a relationship to the historic reporting structure and assumes this remains the same going forwards. Time-varying reporting structures can be fitted, but this requires sufficient data to pick up recent changes in the trend. Therefore, at times when the reporting structure changes, the nowcasting can be compromised. To get around this, we perform data modification, to change the historic data to reflect the current reporting structure. This involves discussing details around the data reporting changes with those involved in data collection and processing, to understand how the reporting structure is changing for the recent data. This close collaboration with those collecting/curating the data is vital for building nowcasting models in real time as part of an operational outbreak response, and we show that not taking this knowledge into account reduces the performance of our models.

There are a range of nowcasting tools and packages available [ 6 – 11 ]. However, in any nowcasting problem, the unique characteristics of the local setting and data processing mean that a general tool may not be suitable. For example, through working directly with data processing teams at UKHSA, we were able to understand the data reporting cycle and temporal changes, which is important to build directly into the model. Also, having access to individual-level data enabled direct estimation of the reporting delay distributions, so it was useful to have flexible models where these distributions could be incorporated directly into the model. In addition, computational tractability and reliability were vital, to ensure modelling products could be delivered every week. Another challenge resulted from many of the existing packages requiring Bayesian implementation, which could not be run on the same IT systems used to access the data, hindering the development of efficient modelling pipelines, which is essential for operational product delivery. Therefore, we developed a nowcasting model that makes novel use of individual-level patient data using hierarchical generalised additive models (GAM), based on the Chain Ladder nowcasting method [ 12 ]. These models have a similar structure to GAM-based nowcasting models used in the German COVID-19 nowcasting hub [ 13 ], but we consider both parametric and non-parametric approaches to the reporting delay distribution, making optimal use of the available individual-level patient data. These are easy to implement and run, but are not as theoretically robust as the Bayesian approaches. However, the developed model is easy to use and has sufficient flexibility to modify in real-time the evolving data landscape. This made the model robust to changes in data processing, which allowed continuous delivery of the operational nowcasting product throughout the mpox outbreak, with the outputs regularly being included in the UKHSA monkeypox technical briefing [ 14 ].

As part of the surveillance aspect of the public health response in the UK, it is vital to understand the shape of the epidemic curve, to ensure public health teams are aware of the current epidemic trajectory and case burden, and to facilitate evaluation of interventions in real time. Ideally, when monitoring an epidemic curve, one would look at the infection date of individuals. However, infection date is rarely observed directly. Instead, other variables need to be used as a proxy for infection date, each of which are lagged relative to infections. The shorter the lag, the more useful the proxy since the shape of the epidemic curve will be less perturbed and the epidemic curve will be more closely aligned temporally. However, events that occur closer to the time of infection may be more affected by reporting delays, in that the delay from the event occurring to being reported to public health teams may be longer. This can bias the epidemic curve when using these proxy events in real-time. Nowcasting is an area of research that attempts to correct for this reporting delay bias when reproducing the epidemic curve.

The detection of cases in May 2022 quickly demonstrated signs atypical of the usual mpox outbreaks, since multiple cases were detected in many countries simultaneously, and most cases had no known travel links to endemic regions. The majority of cases, within this outbreak, have occurred in gay, bisexual and other men who have sex with men (GBMSM). Case questionnaire data suggests likely close sexual contact as a driver of transmission, due to the prolonged skin-to-skin contact. The international dispersion of the virus has resulted in the largest outbreak of mpox reported outside of Africa and in July 2022 the WHO declared the outbreak of mpox a Public Health Emergency of International Concern (PHEIC)[ 5 ].

Mpox, a disease caused by the monkeypox virus [ 1 ], was first discovered in 1958 when researching monkeys that showed signs of a poxvirus [ 2 ]. In 1970, the first documented human cases of mpox were detected in the Democratic Republic of the Congo [ 3 ]. Since then, the disease has become endemic in the DRC and spread to other Central and West African countries. Despite being endemic, outbreaks of mpox have often been zoonotic in origin [ 4 ]. Prior to 2022, cases and transmission of mpox have occurred outside of Central and West African countries, but these have usually consisted of a primary case with recent travel history to endemic regions, followed by self-limiting local outbreaks. In May 2022, cases of mpox were detected in multiple non-endemic countries.

The second issue concerns nowcasting by specimen date. As the epidemic progressed, different testing laboratories have been opened, and test processing procedure has changed. The main change that has been introduced relates to increasing the processing requirements before confirming a test. As of 30 June 2022, test processing procedure was changed to add an extra day of processing before they enter the UKHSA database. In addition to this, as the outbreak progressed, processing over the weekend was changed, to allow for reduced staffing levels over the weekend. This led to tests no longer being processed on a Sunday, which changed the weekly reporting cycles. This mainly affects specimens collected on Fridays and Saturdays, which will have approximately 1 extra day delay relative to earlier samples. Specimens collected on other days may have also been affected, but in general the reporting labs intended to process specimens within the two days after the specimen was collected, so Friday and Saturday are most sensitive to this change.

In addition to the data coverage issues (Section 2.2), there are other limitations to the data which create challenges for the nowcasting. The first issue concerns nowcasting by symptom onset date. When nowcasting an epidemic curve, it is essential to have reliable estimates of the delay from the event of interest occurring to this event being reported in the data. However, sometimes individual cases may be reported before data are available on all events of interest, with these data being added to the record over time. This means that the “date reported” variable for an individual may not correspond to the “date event reported” for that individual. This presents a challenge for nowcasting, since it means that the historic delays from “event” to “date reported” may not reflect the actual distribution of delays from “event” to “date event reported”, which governs how the epidemic curve evolves in real time. One solution to this is to have daily timestamps of the database, which track the state of data in real time. However, often these may not be available due to database design constraints, which is the case for the UK mpox data.

In the nowcasting, we consider two key variables by which the epidemic curve will be visualised. These are specimen date, the date a positive swab is taken, and symptom onset date, the date a positive individual develops symptoms. Specimen date has a high completion rate, being above 90% for most of the outbreak ( Fig 1(A)) . At the start of the outbreak, symptom onset date also had a high completion rate. However, this relied on labour intensive questionnaires from local health protection teams. As the outbreak grew in size, the policy changed and questionnaires became optional. This led to a sharp drop in coverage, from over 90% before June 2022 to around 50% from mid July 2022 ( Fig 1(B)) . Over June 2022, the coverage was continually declining, which will bias the shape of the observed epidemic curve by symptom onset date during June.

The mpox linelist comprised of 3,817 cases as of 31 December 2022. In this paper, we focus on the acute phase of the outbreak, running the modelling until 31 August 2022, where 3,484 cases had been reported. These data are biased towards males and age groups 25–34 and 35–44 ( Table 1 ). We remove individuals who report recent foreign travel (964 as of 31 December 2022), since many of these individuals may have been infected abroad.

Data on daily cases are obtained from the mpox case linelist. The mpox case linelist is compiled Monday to Friday by Outbreak Surveillance Team and South East and London Field Service team using testing data from MOLIS (Molecular Laboratory Information System) and SGSS (Second Generation Surveillance System), combined with operational data from HPZone collected by Health Protection Teams. The line list provides information about cases such as symptom onset date (the date of the first symptom, from HPZone), specimen date (the date the specimen was taken for the test, from MOLIS/SGSS) and the date confirmed cases were reported to the mpox Incident Management Team (IMT).

3. Methods

3.1. What is nowcasting? Nowcasting aims to estimate the number of events occurring on a given day that are yet to be reported [18]. Reported data can often be considered as a “reporting triangle”. That is, for data sufficiently long ago, we have subsequent reports starting on that day and running up to today. For each subsequent day, we have one fewer day of reporting, up until the most recent day, where we only have data reported on that day. Therefore, the reported data forms a triangle. Nowcasting aims to complete this reporting triangle by turning it into a reporting square, where all observation days have the full range of subsequently reported values.

3.2. Nowcasting model The number of events occurring on a given day, y(o), is the sum over all possible reporting delays of the number of events that occurred on day o and were reported after d days, The summand here is equal to the sum over all events that occurred on day o of the probability that the event is reported after d days, i.e. where f θ (d) is the probability that an event is reported after d days. Since both the epidemic and reporting process are stochastic, we assume there is some error in the reported data. This error is proportional to the magnitude of the observations, since the stochasticity acts on an individual basis, so we assume the observations are given by y d (o)×ϵ. Since we are modelling an epidemic, we assume that the number of events each day follows an exponential function in time. However, instead of assuming a fixed exponential rate, we allow the exponential rate to be a smooth function of time. That is, the number of events on day o is given by for a smooth function s(·). For the probability of a reporting delay of length d, f θ (d), we can consider a few approaches: The probability can be fit independently for each delay length, for example using a generalised linear model. The probability can be assumed to follow a smooth function of the delay lengths, which can be fit using a spline. A parametric model for the reporting delay can be assumed. In this paper, we focus on approaches (2) and (3), since it is reasonable to assume that the reporting delay distribution is a smooth function of time. Approach (1) is the approach used in the Mack Chain Ladder method [12]. To allow the reporting delay distributions to change over time, we consider a maximum value, T, for the reporting delay. Above this value, we assume the probability of an event being reported to be equal to zero. This means that the model is fitted using data from the last T days, so will reflect the reporting delay over this time period, rather than the whole outbreak. In addition to capturing time varying reporting trends, this assumption also reduces the computational time of the model, which makes it more efficient for operational use. Therefore, we have When selecting a suitable value for T, we need to ensure that it is sufficiently large to capture all likely reporting delays. Additionally, to aid computational time, we only fit data using the last T days, rather than only truncating the delay distribution at T. Therefore, we also want T to be large enough for the model to identify trends in the data.

3.3. Smooth non-parametric model Under the assumption that the reporting delay can be described by a smooth function of time, we assume that f θ (d)≈exp(s(d)), which leads to To solve this, we can use a generalised additive model with a negative binomial error structure, where s 1 (·) and s 2 (·) are smoothers. Here, wday(o) and wday(d) are random effects to capture the day of week cycle in the epidemic curve and reporting trends, respectively. Negative binomial error is chosen because the daily case counts are integer valued and overdispersed (since they are generated from a stochastic process). For the smoothers, we need to specify the type and number of basis functions to be used. We use penalised cubic regression splines for the type, and the number used depends on the length of the data series. For the epidemic curve (event date smoother), we add one basis function for every τ dates, which effectively allows the epidemic to change shape every τ days. We consider two different values of τ, τ∈{7,14}, and evaluate the performance to select the optimal number of basis functions. When selecting for τ, the model is trading off between flexibility (smaller τ) and reducing sensitivity to the part of the data with substantial backfilling (larger τ). Since the last part of the data is highly sensitive to the backfilling, we do not want to add too much flexibility, otherwise the model may overreact to stochastic underreporting. For the reporting delay smoother, we add 10 basis functions across the range of permitted values. Fewer than 10 led to the model poorly fitting the data, and more basis functions led to overfitting the data. The resulting model, with splines used for both the epidemic curve term (s 1 (o)) and reporting delay term (s 2 (d)) is a simplified version of the model described in [19], with penalised cubic regression splines used instead of P-splines. However, instead of extrapolating the reporting delays outside of the observation window, we truncate using the maximum possible reporting delay, as described in Section 3.2.

3.4. Parametric reporting delay model Alternatively, if the reporting delay is known to be smooth and unimodal, it may be well approximated by a parametric distribution, which are often used in describing time-delay distributions [20,21]. An advantage of this is that biases such as right-truncation or right-censoring can be explicitly corrected in the parametric survival model. In this case, we determine f θ (d) before fitting the nowcasting model, and the parametric distribution can then be added to the model, giving The parametric distribution can be incorporated into the model as either a fixed effect, random effect, or an offset. We opt to use an offset, since this directly links the model to the probabilistic derivation in Section 3.2. To estimate the delay distribution and correct for the right-truncation, we apply the method from [22], which uses maximum likelihood estimation to find the best-fitting distribution. For each case, we record two data points: the date of the event (S) and the date of report (E). Therefore, to estimate f θ (·), we need to model the likelihood of observing the data where g θ (·) is the probability density function of the reporting delay distribution, and G θ (·) is the cumulative distribution function. We assume that the reporting delay follows a Weibull distribution. To calculate the probability of cases being reported after d days, f θ (d), we have To fit this distribution, we use maximum likelihood estimation. We only consider the central estimate for the distribution, rather than propagating distribution uncertainty through the nowcasting model. Extensions using a Bayesian version of this method, such as [23], could be applied to correctly propagate the distribution uncertainty, for example by bootstrapping the nowcasting model across posterior samples of the reporting delay. However, such methods were not applicable since Bayesian packages could not be used on the IT systems used to access the data. Additionally, the model uncertainty is found to be well calibrated using only the central estimate, so the additional uncertainty would not substantially improve performance. This model extends on existing GAM-based nowcasting models [19,24] through the explicit incorporation of the right-truncation corrected reporting delay distribution, f θ (d), instead of fitting the reporting delay within the nowcasting model. This allows the model to make optimal use of individual-level data describing the reporting process.

3.5. Adjusting for data limitations and changes to the reporting cycle In Section 2.3, data limitations are discussed that could bias the nowcasting. Where these limitations affect all days equivalently, the limitations can be mitigated through changing the data to reflect the true data reporting cycle, and leaving the model structure unchanged. The aim of these changes is to make the historic data reflect the current reporting practices, so that the model can detect an appropriate reporting delay distribution. To reflect the changes around specimen processing, we amend the data prior to 30 June 2022 to have an extra day between specimen collection and reporting. For symptom onset date, we do not consider this change to the reporting cycle, since it is less sensitive to symptom onset date than it is to specimen date. However, we still need to modify the symptom onset data to reflect the difference between the date the case was reported and the date “symptom onset date” was reported. From early analysis, in the majority of cases symptom onset date was added after the case was reported. Therefore, the “reported date” will be earlier than “symptom onset date reported date”; the latter is the variable needed for nowcasting, since this reflects the delay the most recent data points are exposed to. Through this early analysis, the average reporting delay was found to be approximately two days longer than that suggested by the “reported date”. Therefore, we consider a modified version of the data, where the “reported date” is changed to be two days later than the recorded “reported date”. Whilst the symptom onset date and 30 June 2022 specimen date corrections affect all days of the week equally, the data processing changes on 6 July 2022 introduced a more complicated change to the reporting cycle, changing the reporting cycle for tests processed over the weekend. To account for this, we altered the historic data where specimen date was on Friday and Saturday to reflect this change, adding an extra day to the corresponding date of report. However, this now resulted in specimens collected on a Friday or Saturday having different reporting cycles to the other weekdays. To adjust for this in the model, in the non-parametric model, we added independent reporting delay splines to the GAM, based on whether the specimen date was a Friday or Saturday, or not. We incorporated these as fully independent smoothers, with no shared trend or smoothness [25]. This results in the model where FS(o) is an indicator function on whether day o is either Friday or Saturday, or another day of the week. This model could be adjusted to use smoothers with either shared trends or smoothness, if appropriate. In the parametric model, we fitted f θ (·) independently depending on whether the event date was a Friday or Saturday, or other day of the week. This is conceptually similar to the non-parametric approach. When evaluating model performance, we consider models with the amended data and models with the original data, to check whether the data driven modifications improve model accuracy. These data limitations demonstrate the importance of communicating with data collection/data processing teams when developing nowcasting methods, to ensure any large changes/complications in the data stream can be modelled appropriately.

3.6. Regional hierarchical model The initial outbreak was centred in London. As the outbreak spread, cases started occurring in the other regions of England, but at much lower case numbers. The geographic range of the spread means that a single epidemic curve for the whole of England may no longer be reliable, as this could conceal regional differences in the epidemic curves. Therefore, we constructed a hierarchical model [25] to allow the epidemic curve to vary across the regions. Since the regional case counts are far lower than London, we assume that the reporting delay distribution is consistent across the country, so that all regions follow the same reporting cycle. This allows the smaller regions to borrow national information when parameterising the reporting delay, so that we do not get excessive uncertainty. We then fit independent smoothers to the epidemic curve for each region, to allow regional variation. We also incorporate a national epidemic curve to account for correlation between regions. This creates a hierarchical structure where regions have some shared trend, but different smoothness. This results in the model where β 1,region is a region-specific adjustment to the intercept. For operational purposes, we opted for a two-region breakdown: London and non-London, but the method could be applied to smaller regions.

3.7. Confidence intervals and prediction intervals There are two methods for quantifying uncertainty with nowcasting models: confidence intervals and prediction intervals. Confidence intervals capture the uncertainty about the expected future trends, but do not capture potential noise in the future data. Prediction intervals capture both uncertainty in the trend and noise in the data by also including uncertainty in the data generating process. That is, prediction intervals indicate the interval in which the observed future values are expected to fall. Since the mpox data are noisy, and case numbers relatively low, we focus on prediction intervals, since these provide the range of extreme data values that we might observe, which is useful for surveillance. To calculate prediction intervals, we generate posterior samples from the fitted model, using the Metropolis-Hastings sampler provided by the gam.mh function in mgcv. For each posterior sample of the model coefficients, we calculate the expected value of the model and convert back to the linear scale. Taking these expected values, we can simulate realisations of the data generating process with these expected values. In this model, we assume that the data are sampled from a negative binomial distribution. In mgcv, the negative binomial distribution is parameterised in the terms of the mean, μ, and variance, σ2, such that μ = E[y] and . In this formulation, the θ parameter is also fitted. Therefore, given the expected values of the model, μ, and the fitted θ (sometimes referred to as the size parameter), we can simulate realisations of the model from a negative binomial distribution with these parameters. For each posterior sample of the coefficients, we draw a random realisation from the negative binomial model. Combining these across all posterior samples, we can calculate the prediction intervals of the model. The full posterior prediction sample is also extracted for growth rate calculations.

3.8. Model scoring To assess the optimal model structures, models are scored using the scoringutils [17] package. This package produces a wide range of scoring metrics. We focus on three of these metrics when comparing the models: calibration (95% coverage)—the proportion of predictions that fall within the prediction intervals should be close to the confidence level of those intervals; bias–the probability that the predictions overestimate or underestimate the true value; interval score–a weighted measure of distance between the predicted distribution and the true observation [26]. To score the model, the predictions are compared to the observed data when sufficient time has passed to assume the data are complete. The scoring is performed over a range of lead times from 0 days to 8 days. A lead time of 0 days means that the model is trying to predict the values for the most recent date, and a lead time of 8 days means that the model is trying to predict the values for 8 days before the most recent date. For longer lead times, accuracy is expected to be higher, since the data are more complete, with the nowcasting models having to do minor adjustments. For shorter lead times, the models require very large adjustments to the observed data, so are likely to have reduced accuracy.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011463

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/