In Week #251 of the Doctor Penguin newsletter, the following papers caught our attention:
1. Dermatology Foundation Model. While foundation models have shown promise in ophthalmology and radiology, developing effective multimodal AI for dermatology presents unique challenges due to the multiple imaging modalities involved and complex diagnostic workflows.
Yan et al. developed PanDerm, a general-purpose, multimodal dermatology foundation model trained on over 2 million skin images across four imaging modalities (dermoscopy, clinical photography, total-body photography, and dermatopathology) from 11 clinical institutions. Using masked latent modeling and CLIP feature alignment for self-supervised learning, PanDerm demonstrates remarkable data efficiency when fine-tuning for downstream tasks, often outperforming existing models while using only 10% of labeled data across 28 diverse benchmarks. The model exhibited scaling behavior, with its performance continuing to improve significantly as training data increased from 0.8 to 1.8 million images. Reader studies showed PanDerm outperformed clinicians by 10.2% in early melanoma detection, improved clinicians' diagnostic accuracy by 11% for skin cancer diagnosis, and enhanced non-dermatologist healthcare providers' differential diagnosis by 16.5% across diverse skin conditions. The authors identified two key technical insights: using CLIP as a teacher model achieved superior training data efficiency compared to the most representative method, DINOv2—this is particularly valuable given the limited amount of medical data—and masked feature reconstruction proved more effective at capturing subtle diagnostic features than alternatives like MAE.
Read paper | Nature Medicine
2. Scaling Laws. Electronic health records (EHRs) consist of temporally ordered clinical events and attributes, forming a natural sequence akin to text in language modeling. While scaling laws have advanced large language model development by enabling predictable performance improvements with increased model size and data, these principles remain unexplored for EHR data.
Zhang et al. present the first systematic investigation of scaling laws for EHR foundation models, demonstrating that transformer architectures trained on patient timeline data exhibit predictable scaling behavior similar to large language models. The modeling approach formulates EHR data as variable-length sequences of medical tokens representing diagnoses, medications, lab results, and procedures, with models trained autoregressively to predict the next token. Using the MIMIC-IV database containing over 200,000 patient records, they trained Llama architectures from scratch of varying sizes (1M to 982M parameters) and found clear power-law relationships between compute budget, optimal model size, and training tokens, with IsoFLOP curves showing parabolic loss patterns (i.e., U-shaped) that identify optimal model sizes for given computational budgets. Zero-shot evaluation on ICU mortality and 30-day readmission prediction showed consistent improvements up to 28M parameters before performance saturated, likely due to insufficient training data. These findings suggest that continued scaling in clinical task performance is possible, but only when accompanied by proportional increases in training data.
Read Paper | arXiv
3. Autism. Autism presents with a wide variety of symptoms and severity levels. Therefore, healthcare practitioners today have no other choice but to trust their clinical intuition, honed through years of training and first-hand experience, to reach an accurate diagnosis. How can we utilize this accumulated clinical know-how?
Stanley et al. used large language models to deconstruct expert clinician intuition from clinical reports to inform our understanding of autism. They fine-tuned a French language model (RoBERTa-based) on over 4,000 clinical reports from more than 1,000 children with suspected autism, achieving 79.4% accuracy in distinguishing confirmed autism cases from those ruled out. Through the attention mechanism that identifies the most diagnostically relevant sentences in each report, the study found that repetitive and stereotyped behaviors, special interests, and sensory-based characteristics were much more predictive of autism diagnosis than social communication deficits. This challenges the current emphasis in diagnostic criteria like the DSM-5, which heavily weights social deficits, and suggests that clinical intuition relies more on observable repetitive behaviors and sensory responses when making accurate autism diagnoses. The study demonstrated that large language models can distill clinical intuition from accumulated experience and first-hand observations to reveal what truly drives expert medical diagnosis.
Read Paper | Cell
4. FDA Postmarket Surveillance. The first systematic assessment of the FDA's postmarket surveillance of AI/ML medical devices.
Babic et al. analyzed adverse event reports from the MAUDE database, the FDA's system for reporting adverse events associated with medical devices, for approximately 950 AI/ML devices approved between 2010-2023. They identified deficiencies in the current reporting system, including extensive missing data (Event Location missing for 100% of AI/ML reports), inadequate event classifications that often misattribute user errors as device malfunctions, and concentration of over 98% of adverse events in fewer than five devices (compared to 85% for non-AI/ML devices). Most critically, the existing framework fails to capture AI/ML-specific problems like concept drift, covariate shift, and algorithmic instability. Additionally, the adverse event reporting system is designed at a patient/case level to identify individual issues, but many problems associated with AI/ML devices, such as miscalibration, present only at a population level. The authors conclude that MAUDE, originally designed for traditional hardware-based devices, cannot adequately assess AI/ML device safety and propose two solutions: enhancing the current system through quarterly performance updates that include pairwise patient comparisons and vulnerability assessments, or adopting a "nutrition label" approach with transparent reporting of training data, performance metrics, and deployment conditions within a cooperative reporting regime involving manufacturers, users, and regulators.
Read Paper | npj Digital Medicine
-- Emma Chen, Pranav Rajpurkar & Eric Topol