The accuracy of wearables employing photoplethysmography (PPG)-based heart rate sensors may vary significantly between specific devices and activity types, but these sensors’ performances do not appear to be significantly affected by the skin tone of the wearer, according to a new study published in NPJ Digital Medicine.
The investigation, conducted by Duke University researchers among 53 individuals, highlighted consistent over-reporting of heart rate during low-intensity activity, accuracy and output differences between consumer- and research-grade devices, and other concerns that are “especially important for researchers and clinicians to be aware of when choosing devices for clinical research and clinical decision support,” the researchers wrote in the study.
“We started this study because we were seeing some evidence, both in research and anecdotally, that indicated that wearable devices weren’t working as well for people with darker skin tones,” Jessilyn Dunn, an assistant professor of biomedical engineering at Duke and an author on the paper, said in a statement from the university. “People would compare a reading on a chest strap to their smart watch and get different heart rate values. The companies that manufacture these devices don’t put out any metrics about how well they work across skin tones, so we wanted to collect evidence about how well they work and identify potential circumstances where they may not work well.”
In particular, it also explored whether these types of devices and sensors could inadvertently be playing a role in healthcare innovation inequality.
“To date, no studies have systematically validated wearables under various movement conditions across the complete range of skin tones, and particularly on skin tones at the darkest end of the spectrum,” the researchers wrote. “To our knowledge, this is the first reported characterization of wearable sensors across the complete range of skin tones. Validation of wearable devices during activity and across all skin tones is critical to enabling their equitable use in clinical and research applications.”
TOPLINE DATA
Skin tone was not found to be a statistically significant driver of either mean directional error or mean absolute error across the collective devices. Here, while at-rest participants with the second-darkest skin tone logged the greatest mean directional error and and the lightest had the lowest, at-rest absolute error was greatest among the darkest skin tone and lowest among those with the second darkest skin tone. There were also no difference in heart rate variability between skin tones.
The researchers did identify significant differences when measuring skin tone accuracy differences across specific devices; however, this effect was based on the performance of the Biovotion wearable which showed lower resting heart rates and higher active heart rates.
In contrast to skin tones, the researchers did see major accuracy errors during physical activity, and when comparing the readings of specific devices.
At rest, the Xiaomi Miband 3 recorded the highest mean absolute error (10.2 bpm) while the Apple Watch 4 recorded the lowest (4.4 bpm). For research wearables, the Biovotion Everion was highest (16.5 bpm) while the Empatica E4 was lowest (11.3 bpm). For standard deviation of the mean absolute average, included to measure accuracy consistency, the Fibit Charge 2 (7.3) and Apple Watch 4 (2.7) represented the full range across consumer devices while the Empatica E4 (8.0) and the Biovotion Everion (6.4) did so for research devices.
During physical activity, mean absolute error among consumer devices was greatest in the Xiaomi Miband 3 (13.8 bpm) and lowest in the Apple Watch 4 (4.6 bpm), while for research devices the Biovotion (19.8) and Empatica E4 (12.8) punctuated the readings. Standard deviations for these categories ranged between the Garm Vivosmart 3 (9.2) and the Apple Watch 4 (3.0), and the Empatica E4 (8.5) and Biovotion Everion (5.3).
The specific types of activities also drove some differences in sensor accuracy.
“We found that there was a bigger drop in accuracy during activities that involved wrist motion that could introduce motion artifacts, like typing, and we saw a drop in accuracy during deep breathing, which could indicate the devices locking onto cyclic behavior, like breathing, rather than heart rate,” Dunn said.
Across all these measurements, the Apple Watch 4 was the top performer, followed by the Garmin Vivosmart 3.
“We were initially surprised that the commercial devices were more accurate, but they also have huge user bases, so they’re able to use lots of data to clean up their signals and improve their algorithms,” she said. “The research wearables are just using raw data, which is important for researchers and clinicians to be aware of.”
HOW IT WAS DONE
Researchers enrolled participants across a range of skin tones who did not have skin conditions or sensitivities, and were not currently taking medications or substances that could affect heart rate.
These participants wore commercial (Apple Watch 4, Fitbit Charge 2, Garmin Vivosmart 3 and Xiaomi Miband 3) and research wearables (Empatica E4, Biovotion Everion) on each wrist, as well as an ECG patch that was used as the baseline for heart rate measurements. Each was asked to perform tasks that ranged from timed walks, a typing task and sitting at rest.
THE LARGER TREND
The devices being measured by the Duke researchers are increasingly finding themselves at the heart of healthcare research. Following the large-scale Apple Heart Study conducted by Stanford Medicine researchers, Apple has gone on to announce open participation in a handful of new health trials run in partnership with major universities and health groups. Fitbit, meanwhile, has long seen its activity trackers employed by trials looking for inexpensive, user friendly sensors.
Researchers have previously questioned whether the activity trackers in wearables and other smart devices are reliable enough for clinical use. Somewhat recently, that conversation has turned toward whether light-based sensors, computer vision and machine learning could be introducing unintended biases to digital health efforts.
IN CONCLUSION
“While the research-grade wearables are the only wearables that provide users with raw data that can be used to visualize PPG waveforms and calculate [heart rate variability] the [heart rate] measurements tended to be less accurate than consumer-grade wearables. This is especially important for researchers and clinicians to be aware of when choosing devices for clinical research and clinical decision support. It is our hope that this analysis framework can act as a guide for researchers, clinicians and health consumers to evaluate such tradeoffs when exploring potential wearable devices for use in a clinical study, digital biomarker development, clinical practice, or in personal health monitoring,” the researchers wrote.