(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Non-equivalent, but still valid: Establishing the construct validity of a consumer fitness tracker in persons with multiple sclerosis [1]

['Ashley Polhemus', 'Epidemiology', 'Biostatistics', 'Prevention Institute', 'University Of Zurich', 'Zurich', 'Chloé Sieber', 'Institute For Implementation Science In Health', 'Christina Haag', 'Ramona Sylvester']

Date: 2023-02

Tools for monitoring daily physical activity (PA) are desired by persons with multiple sclerosis (MS). However, current research-grade options are not suitable for longitudinal, independent use due to their cost and user experience. Our objective was to assess the validity of step counts and PA intensity metrics derived from the Fitbit Inspire HR, a consumer-grade PA tracker, in 45 persons with MS (Median age: 46, IQR: 40–51) undergoing inpatient rehabilitation. The population had moderate mobility impairment (Median EDSS 4.0, Range 2.0–6.5). We assessed the validity of Fitbit-derived PA metrics (Step count, total time in PA, time in moderate to vigorous PA (MVPA)) during scripted tasks and free-living activity at three levels of data aggregation (minute, daily, and average PA). Criterion validity was assessed though agreement with manual counts and multiple methods for deriving PA metrics via the Actigraph GT3X. Convergent and known-groups validity were assessed via relationships with reference standards and related clinical measures. Fitbit-derived step count and time in PA, but not time in MVPA, exhibited excellent agreement with reference measures during scripted tasks. During free-living activity, step count and time in PA correlated moderately to strongly with reference measures, but agreement varied across metrics, data aggregation levels, and disease severity strata. Time in MVPA weakly agreed with reference measures. However, Fitbit-derived metrics were often as different from reference measures as reference measures were from each other. Fitbit-derived metrics consistently exhibited similar or stronger evidence of construct validity than reference standards. Fitbit-derived PA metrics are not equivalent to existing reference standards. However, they exhibit evidence of construct validity. Consumer-grade fitness trackers such as the Fitbit Inspire HR may therefore be suitable as a PA tracking tool for persons with mild or moderate MS.

Physical activity (PA) is an important aspect of health and well-being. However, PA is often reduced in persons with multiple sclerosis (MS), a neurodegenerative autoimmune disease which affects physical function, motor control, and energy levels. It is of public health interest to increase PA behavior in this population. However, valid and user-friendly methods for tracking PA are required to quantify PA behavior during patients’ daily lives. So-called “research-grade” wearable devices are used for short-term measurements (for example, 7 days), but offer poor user experience and are therefore not suitable for longer-term PA tracking. It is therefore increasingly common for MS researchers to use “consumer-grade” devices such as Fitbits. However, high-quality evidence of their validity in MS populations is limited. In this study, we compared PA metrics derived from a Fitbit device to multiple, validated research-grade methods. While the PA metrics derived from each method were not equivalent, all exhibited the similar evidence of validity. In some cases, Fitbit outperformed research-grade methods. We posit that PA metrics derived from the Fitbit are now suitable for long-term PA tracking in MS populations, and that the resulting longitudinal data has the potential to progress our understanding of world PA behavior in MS populations.

Copyright: © 2023 Polhemus et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

In this study, we aimed to expand and update existing evidence on the validity of wrist-worn Fitbit devices in MS populations. We assessed the construct validity of three PA metrics–step count, time spent in PA, and time spent in moderate to vigorous PA (MVPA)–derived from the Fitbit Inspire HR. We did this by comparing Fitbit-derived PA metrics to multiple reference measures ( Table 1 ), and systematically triangulating evidence of their criterion validity, convergent validity, and known-groups validity ( Fig 1 ). This validation study evaluated PA metrics according to validation best practices, accounting for the known shortcomings of existing reference measures [ 32 ].

However, only limited evidence of validity is available for any Fitbit device in MS populations. Existing validation studies are primarily conducted in healthy adults, and three recent systematic reviews of such studies cautiously support the validity of Fitbit-derived PA metrics [ 18 – 21 ]. However, validation studies also suggest that these metrics’ accuracies decrease at low activity intensities [ 20 ], at slow walking speeds [ 18 , 22 – 24 ], and with the use of walking aids [ 25 ]. Not only do PwMS walk slower healthy controls, they also exhibit different abnormal gait patterns [ 26 , 27 ] and frequently adopt walking aids as their MS progresses [ 28 , 29 ]. It is plausible that these factors affect the validity of Fitbit-derived PA metrics in PwMS. To date, validation studies in PwMS are limited to step count, and do not address the other PA metrics produced by these trackers [ 30 , 31 ]. Given the expanding use of wrist-worn Fitbits to track PA in MS, a thorough evaluation of their validity in this population is warranted.

For such tools to be effective, they must reliably and conveniently track PA over long periods of time, yielding either clinically or personally meaningful information. Consumer-grade PA trackers such as wrist-worn Fitbits are therefore gaining popularity in this population, and have already been used to generate PA outcomes in several large cohort and interventional studies [ 12 – 14 ]. They are easy to use, engaging, inexpensive, and provide meaningful PA metrics which are interpretable within the context of public health guidelines [ 15 ]. In addition, these devices enable users to interact with their own data, set goals, and review progress over time. These features promote long-term engagement with remote monitoring technologies [ 16 , 17 ]. The resulting rich, longitudinal data could provide insights into PA behavior not observed in traditional periodic or questionnaire-based PA metrics.

Multiple sclerosis (MS) is a neurodegenerative autoimmune disease which affects physical and cognitive function, motor control, and energy levels. Physical activity (PA) is often reduced in persons with MS (PwMS) [ 1 , 2 ], though it is known to aid in symptom and fatigue management [ 3 – 5 ] and is perceived as an important part of health care by PwMS [ 6 , 7 ]. Managing appropriate amounts of PA is often difficult for PwMS, as overexertion can cause severe short-term fatigue or symptom exacerbations before the benefits of PA are realized [ 8 – 10 ]. To enable the best health outcomes, tools for managing PA and fatigue are desired by PwMS [ 11 ].

Materials and methods

Objective The objective of this study was to assess the construct validity of physical activity (PA) metrics derived from the Fitbit Inspire HR, a consumer-grade fitness tracker. Construct validity is the extent to which an index measure–or the instrument under study–measures the theoretical construct it is supposed to measure [46]. Several sub-types of validity comprise construct validity [47]. In this study, we assess Fitbit-derived PA measures in terms of their criterion validity, known-groups validity, and convergent validity. Criterion validity refers to an instrument’s ability to measure the concept it purports to measure, and is typically assessed through correlations and agreement with a well-validated reference standard, or “criterion measure.” [48] Known-groups validity is the ability of an instrument to discriminate between groups of individuals which are known to differ from each other, such as disease severity strata [49]. Finally, convergent validity refers to a measure’s ability to demonstrate an expected relationship with other theoretically related, clinically relevant constructs [50]. Convergent validity is often assessed through correlation and other association measures. This validation study was conducted as part of BarKA-MS, a cohort study on the barriers and facilitators to PA in PwMS [51]. It expands upon best practices developed by Johnston et al., [32] who propose a six-step framework for designing and reporting validation studies of consumer wearables: 1) target population, 2) index measure (the measure being validated), 3) testing conditions, 4) criterion measure (the reference standard), 5) data processing methods, and 6) statistical analysis.

Target population Our target population was ambulatory PwMS. We recruited a convenience sample of PwMS undergoing elective inpatient neurorehabilitation at the Kliniken Valens between January and November 2021. Participants were eligible if they 1) had a confirmed diagnosis of MS according to their referring physician, 2) were 18 years of age or older, 3) had reduced walking ability but were able to walk independently with or without an assistive device, 4) had access to WiFi and a mobile device in the rehabilitation center and at home, 5) were willing to wear study devices to measure their PA, and 6) were able to answer study questionnaires in German. The BarKA-MS study was composed of two phases (in the clinic and at home). The first phase lasted between one to three weeks depending on the length of the rehabilitation stay and the second phase lasted four weeks. We set a target sample size of 45 participants based on the expected rate of enrollment at Kliniken Valens in the first half of 2021. The recruitment window was then extended due to slower than expected enrollment throughout the COVID pandemic. The ethics committee of the canton of Zurich approved the study protocol (BASEC-no. 2020–02350) and all participants provided written informed consent.

Index measure Our index measures–or the measures we aimed to validate–were step count, time in PA, and time in MVPA derived from the Fitbit Inspire HR. The Fitbit Inspire HR is a consumer PA tracker which is worn on the wrist and measures step count, PA intensity, sleep, heart rate, and other fitness metrics at up to minute-level granularity. Participants were given a Fitbit Inspire HR and were instructed to wear it on their non-dominant wrist during the day and if desired at night throughout the course of the study. The accompanying mobile application was installed on each participant’s mobile device, and each participant was given a de-identified, pre-configured study account. Alerts and daily goals were either turned off or set to a minimum, and participants were encouraged to leave these settings off for the duration of the study. Minute-level data were collected and stored through the Fitabase platform (Fitabase, San Diego, California), a cloud-based study management platform which provides industry-standard security measures such as encryption, password protection, access logs, etc. All participants consented to the privacy statements and settings associated with these platforms.

Testing conditions According to Johnston et al.’s framework, index measures were compared to criterion measures during laboratory evaluation (i.e., controlled walking tests), semi-free-living evaluation (i.e., scripted assessments which simulate various free-living activities), and free-living evaluation (i.e., during daily living ‘in the wild’) [32]. For brevity, we refer to laboratory evaluations and semi-free living evaluations together as ‘scripted tasks.’

Laboratory evaluation Rehabilitation schedule permitting, PA metrics were assessed manually, via the Fitbit, and via criterion measures during a 6-Minute Walk Test [52] in participants’ final week at the clinic. Criterion measures are described in greater detail in the next section. All participants were instructed to cover as much distance in six minutes as possible, and rests were allowed. Participants rested in a seated position for at least three minutes immediately prior to and following the test to allow for confirmation of timestamp alignment between devices. Semi-free-living evaluation. A sub-sample of participants also completed an assessment comprised of five scripted tasks designed to replicate movement patterns regularly encountered in daily life. PA metrics were assessed via the Fitbit and via criterion measures (see below) during these tasks. The semi-free-living evaluation consisted of: Walking with postural transitions: Participants repeatedly rose from a seated position, walked approximately five meters to an examination bed, lay supine for three seconds, returned to the chair, and sat for three seconds. This task was designed to assess the effect of short walking bouts interrupted with postural transitions.

Simulated cleaning: Participants repeatedly carried a series of glasses, cups, saucers, and towels from one table to another nearby table. During each repetition, participants unfolded and re-folded the towels. This task simulated light PA with short walking bouts in a confined space, frequent direction changes, and weight shifting between legs. We designed this task to simulate working in a kitchen or tidying a room.

Sit to stand: In this task, participants repeatedly rose from and returned to a seated position. This activity further tested how postural transitions are characterized by index and criterion measures.

Wheelchair push: Participants propelled themselves around a circular track in a wheelchair with the Fitbit worn on the outermost wrist to assess how manual wheelchair propulsion, and more generally upper body activity, is characterized.

Stair climb and descent: In this task, participants repeatedly walked up and down two flights of stairs to assess step count accuracy during stair climbing and descent. These activities were selected and designed in collaboration with subject matter experts at the rehabilitation facility. Each semi-free-living evaluation lasted approximately 30 minutes. Participants were instructed to complete each task at a pace they could maintain safely for three minutes and to use their preferred walking aids. Rests were allowed. Participants rested in a seated position for at least three minutes immediately prior to and following each task to enable confirmation of timestamp alignment and to mitigate fatigue effects. Free-living evaluation. For the purposes of this evaluation, participants wore both the Fitbit and a criterion measure (Actigraph GT3X, see below) under free-living conditions for approximately 14 days. This two-week period was comprised of their final week in the rehabilitation clinic and the following week in their home environment. Participants occasionally wore the devices longer if the rehabilitation period was unexpectedly extended. After participants had worn the device at home for seven days, the participants logged the dates they had worn the devices and returned the Actigraph GT3X to investigators by mail. Participants continued to wear the Fitbit as part of the BarKA cohort study.

Criterion measures Average manual step counts were considered the criterion measure for assessing Fitbit’s step count algorithm during scripted tasks. Tasks were video-recorded and two assessors manually counted steps according to a validated standard operating procedure (S1 Text). Several additional criterion measures were derived from the Actigraph GT3X (Manufacturing Technology, Inc., FL, USA), a research-grade accelerometer which has been validated in PwMS [53,54]. Actigraph devices were initialized in Actilife 6.0 with a sampling rate of 30Hz and worn on the right hip. However, multiple data processing methods exist to derive PA metrics in this population (Table 1) [38,44,55]. These methods use different data (i.e., 1-dimensional vs. 3-dimensional movement) and processing methods (i.e., standard vs. highly sensitive filtering) to calculate PA metrics. However, the Fitbit is not expected to agree exactly with any of the criterion measures derived from the Actigraph GT3X (Table 1). The Actigraph measures were derived and validated for wear on the hip [35,38,44], whereas Fitbit is wrist-worn. The Actigraph GT3X-based methods derive PA metrics from an accelerometer only [35,38,44]. The factors which influence PA classification are not publicly available, though support resources suggest that movement intensity, heart rate, and breathing rate may influence PA estimation [43]. Finally, Actigraph-derived measures are non-equivalent with each other [56]. Any Actigraph method may therefore impart criterion standard bias if compared to Fitbit as a single criterion measure [57]. We therefore opted to assess the metrics derived from Fitbit through triangulation [58] in an agreement validation study [57] and through an assessment of construct validity. Criterion measures for step count, time in PA, and time in MVPA were derived from Actigraph through multiple established methods (Table 1). Two Actigraph-based methods were used to derive step count (referred to as Actigraph (Standard) and Actigraph (LFE)) [59], two methods were used to derive time in PA (Actigraph (Vert) [35] and Actigraph (VM)) [38], and three methods were used to derive time in MVPA (Actigraph (Uniform), [44] Actigraph (Severity), [44] and Actigraph (Sasaki)) [38]. Construct validity was further evaluated by quantifying the relationship between PA metrics and theoretically-related clinical assessments. Convergent validity was assessed through associations with patient reported outcomes and clinical outcome measures. Patient reported outcomes included the MS Walking Scale– 12 (MSWS-12), a patient-reported measure of walking ability and its impact on daily activities [60,61] and the International PA Questionnaire (IPAQ), a self-assessment of PA during the previous seven days [62]. Clinical measures included the Expanded Disability Status Scale (EDSS) [63]; the 10-meter Gait Speed test (10mGS) [64]; and the 6-Minute Walk Test (6MWT) [65]. These measures were assessed during the last week of rehab, except for the IPAQ, which was reported by participants following the free-living assessment. Known-groups validity was assessed by comparing PA metrics between subgroups according to established cutoffs of clinical scales. Disease severity strata were defined as mild (EDSS < 4.0), moderate (EDSS 4.0–5.5), and severe (EDSS 6.0–6.5) body function impairment, consistent with previous studies [44].

Data processing Actigraph data were uploaded to Actilife, filtered to remove non-human movement artifact with both the standard filter and the low frequency extension (LFE), aggregated into one minute epochs, and exported for further processing. Step count, PA intensity (sedentary behavior, LPA, MVPA), and heart rate data derived from the Fitbit Inspire HR were calculated according to Fitbit’s proprietary algorithms and extracted in one minute epochs. All processing was conducted in R, version 4.1.0 in the RStudio environment, version 1.4.1717. Validated algorithms (Table 1) were applied to derive PA intensity and step count. Non-wear time was defined as 30 minutes of continuous zeros with a 2-minute spike tolerance [66]. For Actigraph, this definition referred to epochs with zeroes in the x, y and z axes, and for Fitbit this referred to epochs with zero step count, sedentary PA categorization, and no registered heart rate. Wear periods shorter than 10 minutes were discarded to reduce false positives in wear time resulting from short spikes. Days with at least 10 hours of wear time during waking hours were considered valid [67], and participants with at least two valid days were included in this analysis [68]. Epochs in which both devices were worn during waking hours (6AM to 11PM) on valid days were included in aggregation and analysis. Data categorized as non-wear time and epochs which occurred on non-valid days were removed. The day participants left the clinic and traveled home was not included in this analysis, as these days did not represent ‘normal’ activity. To limit the effects of differential wear patterns on agreement analyses, only minutes during which both the Fitbit and the Actigraph were worn were included in data aggregation and further analysis. Data aggregation. For each method, PA data were then aggregated into three levels of granularity for agreement analysis: “epoch-level”, “daily”, and “average” PA. Epoch-level data was used to evaluate absolute agreement between PA metrics over short periods of time and during diverse activities of daily living. Timestamp alignment within one minute was confirmed according to visit notes, videos, and manual inspection for each participant. Minute-level step counts were aggregated into 5 minute epochs. An agreement window of plus or minus one minute was applied in a pairwise fashion to minute-level PA intensity metrics. This window accounted for the effects of timestamp misalignment and the potential dependency of Fitbit’s PA algorithm on heart rate. An epoch was considered in agreement if Fitbit-derived PA intensity yielded the same categorization as Actigraph-derived PA intensity within a window of plus or minus one minute of the Actigraph’s timestamp. Daily PA metrics were calculated by summing all included minute-level data per patient per day. Days in both the rehab setting and the home setting were included in analyses at the daily level of aggregation. Average-level PA metrics were calculated for the home environment only by averaging each participant’s daily PA metrics over all valid days, consistent with previous PA study outcomes in MS populations [40,69]. Data labeling. Data collected during laboratory and semi-free-living evaluations were extracted and labeled by consulting visit notes, video timestamps, and manual inspection. Manual and device-derived step counts were calculated for each scripted task and for the rests between tasks.

Statistical analysis Agreement of categorical data was assessed through a multi-level implementation of Fleiss’ kappa assuming participant-level random effects [70]. Differences in PA categorizations during individual scripted tasks were identified through Fisher exact tests. Kruskal-Wallis tests, Wilcoxon signed-rank tests, Pearson product-moment correlation coefficients (Pearson’s r), Lin’s concordance correlation coefficients (CCC) [71] evaluated the differences, correlations, and absolute agreement between measures for continuous and count data. Bland Altman plots [72] visualized the mean bias and limits of agreement at the daily level. At the epoch and daily level, Pearson’s r, CCC, and Bland Altman statistics were adjusted for patient-level random effects according to the procedures defined by Parker et al. [73] Pearson’s r was selected because it was adjustable for patient-level random effects, and data were visually assessed for approximate normal distributions. Confidence intervals were derived through bootstrapping. In sensitivity analysis, these analyses were repeated for each disease severity stratum. For data collected during scripted tasks, this analysis was conducted for all scripted tasks together, accounting for task-level random effects as described by Parker et al. [73] Wilcoxon-Mann-Whitney tests and Wilcoxon effect sizes [74] quantified the existence and magnitude of differences across known groups. Pearson’s r quantified the relationships between average PA metrics and clinical measures.

[END]
---
[1] Url: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000171

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/