In Week #252 of the Doctor Penguin newsletter, the following papers caught our attention:
1. Pandemic Forecasting. Traditional disease forecasting models heavily depend on structured numerical data, overlooking diverse disease-relevant sources such as public health policies and real-time reports on emerging variants that exist in textual formats. How can we leverage this previously inaccessible textual information to improve pandemic forecasting?
Du et al. developed PandemicLLM, a framework that uses a large language model to forecast COVID-19 hospitalization trends by reformulating pandemic prediction as a text-reasoning problem. The model processes spatial data (e.g., demographics, healthcare systems, political affiliations), policy changes, and genomic surveillance data (textual variant characteristics) as descriptive text, while embedding epidemiological time series (e.g., hospitalizations, vaccination rates, variant prevalence percentages over time) through recurrent neural networks into special tokens within ~300-word prompts. During training, the framework fine-tunes LLaMA2 to classify hospitalization trends into five ordinal classes (substantial decrease to substantial increase), providing interpretable predictions for public health decision-makers. Tested across all 50 US states over 19 months, PandemicLLM consistently outperformed the CDC Ensemble Model baseline, demonstrated robust performance without retraining, and showed particular strength in adapting to emerging variants like Omicron when real-time genomic surveillance data was incorporated. This framework demonstrates a novel method to process previously inaccessible textual information in pandemic forecasting methodology.
Read paper | Nature Computational Science
2. Oncology. Clinical decision-making in oncology requires integrating complex multimodal data, yet generalist foundation models struggle to match specialized precision medicine tools. Could equipping large language models with domain-specific tools offer a better approach?
Ferber et al. developed an AI agent that combines GPT-4 with specialized precision oncology tools to support clinical decision-making in cancer care. Rather than pursuing an all-encompassing multimodal foundation model, they equipped GPT-4 with domain-specific tools including vision transformers for detecting microsatellite instability and KRAS/BRAF mutations from histopathology slides, MedSAM for radiological image segmentation, web search capabilities (Google, PubMed), and access to the OncoKB precision oncology database. The system draws from a repository of 6,800 medical documents and clinical guidelines from six official sources. When evaluated on 20 realistic multimodal patient cases in gastrointestinal oncology, the AI agent correctly used appropriate tools 87.5% of the time, reached accurate clinical conclusions in 91% of cases, and provided relevant oncology guidelines 75.5% of the time. Most significantly, the integrated agent dramatically outperformed GPT-4 alone, improving decision-making accuracy from 30.3% to 87.2%. This modular approach offers advantages including individual tool validation, regulatory compliance, superior explainability compared to black-box models, and the ability to cross-reference official guidelines with up-to-date research findings, establishing a foundation for AI-driven personalized oncology support systems.
Read Paper | Nature Cancer
3. Multi-Agent. Do multi-agent clinical AI systems built from the best components actually perform best overall?
Bedi et al. investigated the "Optimization Paradox" in clinical AI multi-agent systems, where individually optimized components fail to produce optimal overall system performance. The study evaluated 2,400 real patient cases from the MIMIC-CDM dataset across four abdominal pathologies, comparing single-agent systems (one model performing all tasks) against multi-agent systems with specialized agents for information gathering, interpretation, and differential diagnosis. The study tested multiple multi-agent configurations, including a "Best of Breed" (BoB) system constructed by selecting the top-performing individual agent for each task independently, resulting in a heterogeneous system using models from different model families (GPT, Claude, and Gemini). While multi-agent systems generally outperformed single agents on process metrics—following clinical protocols (78.4% win rate), interpreting lab results correctly (87.5% win rate), and computational efficiency (76.9% win rate)—these improvements didn't translate to better diagnostic accuracy. Specifically, despite being constructed from the best-performing individual components, the BoB system significantly underperformed in diagnostic accuracy (67.7% vs. 77.4% for the best multi-agent system) and experienced dramatically higher failure rates, including test result hallucination in 13.87% of cases versus 0.42% for the top-performing system. Error analysis suggests coordination issues between components, possibly due to incompatible formatting, communication styles, and downstream adaptability limitations. The findings emphasize the critical need for end-to-end system validation rather than relying solely on component-level optimization in healthcare AI deployment.
Read Paper | arXiv
4. Mental Health. Generative AI chatbots have surpassed previous rule-based chatbots and mental health apps in synthesizing information and customizing care plans, but clear evaluation frameworks are needed amid AI hype.
In this perspective, Torous and Topol discuss the first randomized trial of a generative AI chatbot (Therabot) for mental health treatment, which showed promising results in reducing symptoms of depression, anxiety, and eating disorders at 4 and 8 weeks compared to waiting-list controls (an untreated comparison group). However, the authors caution against AI hype and propose three key considerations for assessment. First, they emphasize that studies comparing AI interventions to waiting-list controls provide only preliminary evidence, similar to early-phase drug studies that explore feasibility and safety rather than efficacy; second, they stress the importance of demonstrating long-term rather than just short-term benefits, noting that today's digital therapeutics and health apps have struggled with long-term outcomes and sustained engagement; third, they argue that without the ability to assume medical and legal liability, AI tools cannot deliver independent care and still require human professional oversight. Overall, AI must demonstrate efficacy, effectiveness, safety, and cost-effectiveness in healthcare settings before clinical integration.
Read Paper | The Lancet
-- Emma Chen, Pranav Rajpurkar & Eric Topol