Week 266
Aging, Everyday Symptom Assessment, LLM benchmarks, Bioelectronics
In Week #266 of the Doctor Penguin newsletter, the following papers caught our attention:
1. Aging. Most research on sleep and aging has focused on the brain, leaving open the question of whether disrupted sleep affects biological aging more broadly across the body.
The MULTI Consortium developed “Sleep Chart,” a framework mapping the association between self-reported sleep duration and 23 biological aging clocks derived from MRI imaging, plasma proteomics, and metabolomics, spanning nine organ systems. Each clock measures the biological age gap (BAG): the difference between a person’s chronological age and the biological age predicted by machine learning models trained on organ-specific data, with a positive gap indicating an organ is aging faster than expected. The central finding is a consistent U-shaped pattern, with both short (under 6 hours) and long (over 8 hours) sleep associated with higher biological age gaps across the brain and body, while the range in which biological age gaps are minimized falls between 6.4 and 7.8 hours, varying by organ and sex. These clocks further revealed that abnormal sleep duration is associated with increased risk of systemic diseases, including depression, diabetes, and cardiovascular disease, as well as all-cause mortality, with short sleep showing particularly broad, multi-organ genetic correlations. Notably, long and short sleep are both linked to late-life depression but through different pathways. Long sleep appears to contribute to depression partly through accelerated biological aging, while for short sleepers, the link to depression is less well explained by biological aging, suggesting other mechanisms are at play. Importantly, there is no strong evidence that sleep duration causally drives disease. Reverse causality remains possible, meaning the underlying disease may itself influence sleep patterns, and so these findings should not be interpreted as causal.
Read paper | Nature
2. Everyday Symptom Assessment. While large language models have shown strong diagnostic performance on curated medical benchmarks, these evaluations rarely reflect how ordinary people describe their symptoms in everyday life.
Breda et al. developed SymptomAI, a conversational AI agent built on Gemini and deployed through the Fitbit app from June 2025 to April 2026, enrolling nearly 14,000 participants across the US reporting symptoms in everyday life. The study randomized participants across five agent designs, contrasting approaches where users freely describe their symptoms without guidance against approaches where the agent actively interviews the user through structured or dynamic follow-up questions. Agents that actively elicited information achieved 27% higher diagnostic accuracy on average than the passive baseline, which most consumer LLMs currently use by default. Notably, fully dynamic agents that chose their own questions performed comparably to those following canonical clinical interview frameworks, suggesting AI can conduct effective medical history-taking autonomously. Using these question-eliciting agents, SymptomAI's differential diagnoses were significantly more accurate than those of board-certified clinicians reviewing the same conversations (odds ratio of 2.47), and clinicians preferred SymptomAI's output in over half of blinded comparisons. It is worth noting the limitation that clinicians diagnosed from AI-led transcripts rather than their own interviews, where they may have sourced different or better information had they been the ones asking the questions, and thus this result should be understood within this specific evaluation setup. Leveraging the validated AI diagnoses as labels, the study further identified distinct biosignal shifts in over 500,000 days of Fitbit wearable data that preceded symptom onset, pointing toward a future where wearables could proactively trigger symptom check-ins.
Read Paper | arXiv
3. LLM benchmarks. As LLMs saturate existing medical AI benchmarks, the field increasingly demands evaluations grounded in real clinical data and human baselines.
Brodeur et al. evaluate the diagnostic and management reasoning capabilities of OpenAI's o1-series large language model across six experiments ranging from curated educational vignettes to real-world unstructured clinical data, benchmarking model performance against hundreds of physicians including residents, attending physicians, and nurse practitioners. The experiments collectively test distinct but complementary clinical competencies: differential diagnosis generation, diagnostic test selection, structured clinical reasoning documentation, management planning, probabilistic reasoning, and real-world second-opinion generation on emergency department patients. Across all six experiments, o1 matched or exceeded physician-level performance. The emergency department experiment extends beyond prior vignette-based evaluations by demonstrating that o1 consistently outperformed two attending physicians on genuine unstructured patient data, with the largest and most statistically significant performance gap observed at initial triage where clinical information is most limited and decision urgency is highest. These results motivate prospective clinical trials to assess whether such performance gains translate to measurable improvements in patient outcomes.
Read Paper | Science
4. Bioelectronics. A new generation of in-body bioelectronic devices that integrate diagnostic sensing and therapeutic actuation within a single system.
In this Perspective, Ceto et al. survey the rapidly evolving landscape of in-body bioelectronic systems, highlighting three major technological frontiers. First, electrically conductive hydrogels address the fundamental mismatch between conventional stiff electronic implants and soft biological tissues by acting as a mechanical and electrochemical bridge, as well as enabling controlled drug release and sensitive biomarker detection. Second, bioresorbable implants offer temporary organ monitoring without surgical removal, such as a wireless pH sensor deployed after gastric surgery that dissolves harmlessly once recovery is complete. Third, ingestible and luminal devices bring bioelectronics into the gastrointestinal tract with minimal invasiveness, for example using vibration to activate gastric stretch receptors for satiety signaling, or mapping anorectal pressure to assess bowel dysfunction. Beyond these hardware advances, optogenetics is highlighted as a next-generation neuromodulation strategy. By genetically engineering neurons to express light-sensitive proteins and activating them with implanted micro-LED arrays, this approach achieves cell-type specificity and bidirectional control that conventional electrical stimulation cannot match, with early clinical trials underway for retinal diseases. Nevertheless, significant translational barriers remain, including absent regulatory frameworks for dissolving electronic implants, unresolved immunogenicity concerns for optogenetic therapies, and the added complexity of incorporating AI-driven closed-loop control into an already challenging regulatory landscape.
Read Paper | Nature Communications
-- Emma Chen, Pranav Rajpurkar & Eric Topol


