(C) PlosOne
This story was originally published on plosone.org. The content has not been altered[1]
Licensed under Creative Commons Attribution (CC BY) license .
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
--------------------
Do psychiatric diseases follow annual cyclic seasonality?
['Hanxin Zhang', 'Committee On Genetics', 'Genomics', 'Systems Biology', 'The University Of Chicago', 'Chicago', 'Illinois', 'United States Of America', 'Department Of Medicine', 'Institute Of Genomics']
Date: None
Additionally, we modeled the disease seasonalities (and trends) in Sweden based on their national registry, which incorporates almost all 9 million Swedes. Although there is no exact enrollment information supplied, we can safely assume static enrollment in years because Swedish patients were disenrolled only if they died or left Sweden. In other words, unlike the US dataset, the Swedish database is immune to the “asynchronous enrollment” problem.
All these abovementioned factors influence the raw observations of diagnoses rates over time. One naive approach is to estimate the raw trend, treating the population as a whole, and fitting a line of DR, which we then calculate as the total number of diagnoses over the total number of enrollments across the whole database. This produces results somewhat discordant with the findings of previous published studies. Therefore, we designed a Bayesian multilevel model that addresses the issues discussed above ( Fig 2 ) and which allowed us to infer latent disease trends for any specific age and sex group.
First, although MarketScan follows population health statistics for over a decade (2003 to 2014), most of the patients are “visible” to insurance records for a shorter time interval; patients were only enrolled in the insurance records for a few years, a few months, or, sometimes, even a few weeks, leading to “asynchronous enrollment.” Due to heterogeneous insurance recruitment practices, groups of people simultaneously joining an insurer by no means resemble a random sample from the general US population. Second, possibly due to changes in coding standards, the dataset contains annual shifts (systematic jumps) of DRs at the beginning of every year for population strata of consistent composition. We observed these shifts for a subset of diseases. Third, some US holidays result in a general disruption of both health practice and reporting, and these disruptions are visible in the raw disease prevalence plots. Fourth, we noticed that most diseases manifest annual periodic prevalence fluctuations. For example, skin infections are on the rise in the summer, while upper respiratory system infections are more frequent in the winter. Fifth, disease prevalence seasonalities and trends vary greatly across sex and age-groups. Lastly, the data contain stochastic noise (temporal fluctuations in the recorded disease diagnoses) and, possibly, diagnosis encoding errors.
Although the database is one of the largest and most comprehensive collections of US insurance claims, it was built by merging asynchronous subsets collected by multiple private health insurers. As a result, the data are characterized by some properties that complicate our analysis ( Fig 1 ). We identified at least 6 such issues and will briefly explain what they are, along with the efforts we made to address them in our analysis.
The institutional review board (IRB) at the University of Chicago determined that the study is IRB exempt, given that patient data in both countries were preexisting and de-identified.
The primary goal of this study was to model disease seasonality (and trends) in the US in recent years. To probe this question, we made use of 2 large, country-scale datasets, the IBM Watson Health MarketScan dataset [ 18 ] containing the insurance claim records of over 150 million unique U.S. citizens and the Swedish National Health Register [ 19 ], detailing the health dynamics of virtually all Swedes—over 11 million unique people visible in the data. The US data cover the time interval between 2003 and 2014, while the Swedish data encompass an interval between 1980 and 2013.
Methods and techniques
We modeled male and female trends separately. To make corrections for asynchronous patient enrollments, we first grouped all 150 million enrollees in the US database by (1) their enrollment dates (starts and ends); (2) patients’ age at the middle of enrollment; and (3) patients’ sex.
We then identified a collection of nearly a million enrollment range-, age-, and sex-specific population strata. Each stratum was characterized by a unique enrollment interval, for example, January 1, 2003 to December 31, 2004. We also placed different sexes into separate strata and further subdivided patients by age-groups. Specifically, we used the following approximate, decade-long age subdivision: 0 to 10, 11 to 20, 21 to 30, 31 to 40, 41 to 50, and 51 to 65. Claims that occurred at over 65 years old were discarded because the majority of those enrollees supposedly switched to Medicare, and the remaining records for patients over 65 were likely to be erroneous or nonrepresentative.
For each population stratum, assuming a latent linear trend and a yearly repeated disease prevalence seasonality, we defined a model and estimated its parameters. However, this approach would become practically intractable if we were to fit this model on nearly 1 million population strata simultaneously. Therefore, we reduced the number of population strata by merging them into a smaller number of bigger strata, based on the proximity of enrollment boundaries, using K-means clustering. We pooled population strata with close enrollment boundaries together. In this way, we obtained around 600 “softbounded” population strata (100 for each age-group) for each sex. Each composite stratum is a combination of hundreds of raw, “hard-bounded” populations. These composite strata vary slightly in terms of date of enrollment beginning and end, typically in the range of a few weeks, and are rather homogeneous inside the shared enrollment window. Considering the Bayesian consistency of our model, a more robust and powerful way would be to cluster and merge population strata using a Bayesian Gaussian mixture model. However, such a method would soon exhaust a computer’s random access memory because of the need to track a large number of variables. It is well known that the K-means method is equivalent to the Gaussian mixture model’s hard EM (expectation–maximization) implementation in some limiting formulation. Here, we argue that K-means clustering, chosen for its scalability, is good enough to mimic the behavior of the Bayesian Gaussian mixture model and accomplish population number reduction.
For each disease, we decomposed the DR trend, for a given softbound population stratum i of age-group j for a sex-specific condition, into 4 parts: a linear trend, possible shifts at the beginning of every year, a seasonality term modeling periodic patterns, and an error term incorporating all other effects. (1) where (2) (3) (4) (5)
In the above equations, y i,j (t) is the DR of a softbound population stratum i of age-group j at time point (week) t. Parameters α i,j and β i,j are the intercept and the slope, respectively, of the latent linear trend. Moreover, 1(condition) is an indicator function that evaluates to 1 only if the input condition is true. s k and γ i,j,k are the kth separation and shift, respectively. The separations are when the shift could happen, and we assumed they are all year starts (s 1 = January 1, 2003, s 2 = January 1, 2004 …).
Note that for the Swedish database, all enrollees are visible from the start, so there is no “asynchronous enrollment” problem and only 1 all-inclusive population stratum was considered for each age-group and sex.
We used a Fourier series with period W = 365.25/7 weeks to model the potential seasonality of some conditions. p i,j,n and q i,j,n are harmonic bases.
The traditional parametrization of the “seasonality term” (a Fourier series) is convenient for the estimation phase of analysis. However, to interpret estimates, it is intuitive to use the following re-parametrization of the Fourier term: (6) where (7) (8)
A i,j,n and ϕ i,j,n in Eq 6 are amplitude and phases for the n’s harmonic. Arctan2 corresponds to a 2-argument arctangent function.
We estimated all parameters under a Bayesian framework. We sampled the prior values of α i,j and β i,j from skew normal distributions with age-group–specific locations, scales, and shapes. A skew normal distribution density function is defined in the following way: (9) where ϕ(x) and Φ(x) are density and cumulative distribution functions, respectively, for a standard normal distribution. Our choice of prior distribution was motivated by an analysis of the parameter estimate distributions for various groups of patients—they indeed resemble the skew normal shape.
(10)
(11)
To allow information flow through different age-groups, we sampled the location parameters from a zero-mean Gaussian process prior: (12) (13) where kα(j, j′) and kβ(j, j′) are exponentiated quadratic kernels with scale and length drawn from flat hyperpriors. The prior of the Gaussian process “linked” different age-groups within a unified estimation procedure and allowed information about disease trend flow across age-groups—by assuming that similar age-groups share similar trend parameters.
We drew the scale parameters of α i,j and β i,j from flat, half-Cauchy hyperpriors, and we restricted the shape parameters and by zero-mean Laplace distributions so that the scale parameter would not compete with the shape parameter. In our experiments, we found that a skew normal with a large shape could behave similarly as a skew normal with a large scale. This pathological behavior would result in inefficient sampling.
We sampled the population as well as age-specific shifts γ i,j,k from a zero-mean Laplace distribution, thus incorporating our prior belief that shifts should not mask the linear trend effect. We sampled the bases for seasonality from zero-mean normal distributions.
Finally, to offset the effect of holidays and celebrations, we applied a holiday-smooth function that took the average DRs around US federal holidays and Easters/Good Fridays. We overcame the presence of outliers caused by other unknown forces using a Student t distribution sampling: (14) where the location parameter to . and are scales and degrees of freedoms sampled from flat half-Cauchy hyperpriors. Fig 1B illustrates the outcome of the holiday-smooth function.
We approximated the model using a No-U-Turn sampler [20] initialized by variational inference [45]. In general, for one sex-specific condition, it would take hundreds of CPU hours to attain a reasonable estimation due to the high-dimensional searching space—we were sampling thousands of parameters simultaneously. We applied the model to 33 neuropsychiatric and 47 infectious conditions of 2 sexes and tried to reproduce and make corrections for trends in different age-groups.
Once we obtained the estimation of harmonic bases p i,j,n and q i,j,n as in Expression (4), we computed the posterior harmonic base estimates for the whole population as (15) (16) where w i is the weight according to the size of population stratum i, so and could be interpreted as the estimate of nth harmonic bases for age-group j for the whole population.
After Bayesian inference, we obtained all the posterior parameter distributions, which enabled us to estimate the annual seasonality free from the influence of trends, sudden shifts, and noises such as holiday effects. For each age-/sex-specific condition, we divided the raw DR seasonality to its time-average DR over 570 weeks (starting from January 1, 2003, Expressions (17) and (18)) and obtained the relative fluctuation in percentage, as shown in the main figures.
(17)
(18)
We attempted to correct baseline fluctuation of all medical visits by deducting s(t) from the s all (t) that represents the yearly variation of all conditions and diseases (S1 Fig and Fig 2, Step 2 lower center plot): (19)
This procedure also revealed disease trends. However, as we carefully examined these estimates, it was clear that they might not reflect real disease trends over time simply because we estimated the trends with cohorts of quasi-static enrollments—the same people joined and left and their age went up accordingly. The change of age drastically impacted our disease trend estimations. For example, we observed that incidents of some infectious diseases went down because the prevalence of some pediatric infections decreases as children grow older. By contrast, many cardiovascular condition trends are positive because older people are more prone to them.
Fig 2 summarizes our model, where data corresponding to a “raw” trend are an input to our model. A “raw” trend is deconvoluted into trends within hundreds of population strata based on enrollment dates (the left panel and the center panel). The model fits each population stratum separately, but still allows certain information shared across population strata, in a hierarchical framework (the center panel). Finally, we make corrections and estimate the seasonalities and trends for specific age-groups (the right panel).
Lastly, it is worth mentioning that we dropped all higher-order harmonics in the Fourier series after the first 5 (N = 5) for approximation based on model selection results (Eqs 4 and 6). We tested N = 5, 15, and 25 to find the best approximation model. To evaluate the model, we computed the sum of Watanabe–Akaike information criteria (WAIC) [46] over 33 neuropsychiatric and 47 infectious diseases in the 2 sexes and found that N = 5 was simple and good enough to model the seasonality (S13 Fig).
The Bayesian procedure we designed helped to mitigate multiple confounding factors with a multilevel model, but it could also be problematic, given its complexity. First, we could not certify the convergence of the MCMC because we were not estimating one single parameter, whereby a diagnostic statistic like Gelman–Rubin [47] or Geweke [48] would have been applicable to determine the mixing of that parameter. The approximation of each disease’s seasonality involved thousands of parameters, making it difficult to determine how many iterations were needed to reach a stationary point. To alleviate this concern, we employed the No-U-Turn sampler, which is able to mix an MCMC process rapidly and reliably [20]. More importantly, we inspected the disease trend’s posterior expectation curves and seasonality, restored from the posterior estimation of parameters (like the green lines on Fig 2, Step 2, upper panel) and confirmed that they were aligned with the input raw trend and seasonality. Collectively, using all the available tools, the intrinsic seasonality is reflected in the results insofar as we are able.
[1] Url:
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001347
(C) GlobalVoices
Licensed under Creative Commons Attribution 3.0 Unported (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/