๐ Poster Presentation at Spoke 9 Congress - AI-Augmented Delphi
our work โAI-Augmented Delphi: Design and Evaluation of a Human-Aligned LLM Workflow for Accelerating Consensusโ was selected for a poster presentation at the Spoke 9 Congress โ โThe Pharmacology of RNA Drugs: an Unmet Pharmacological Need Tackled by the National Centre of RNA Drugsโ, held in Milan on November 4โ5, 2025 at the beautiful Assolombarda center.
๐ฏ The Work
developed at Helaglobe, the study introduces an AI-assisted workflow for the Delphi consensus process, applied to the field of RNA-based therapeutics โ a fast-growing area that brings both exciting opportunities and new regulatory and safety challenges.
what is the Delphi consensus method?
- a panel of experts independently rates a set of statements
- results are aggregated and shared with the group
- statements that donโt reach agreement are revised and re-evaluated in successive rounds
- widely used in clinical guidelines, medical research, and policy
- the bottleneck: the revision phase is slow and demanding โ experts must review feedback, verify literature, and rewrite statements by hand
the core question we set out to answer: can a multi-agent AI system replicate the quality of human expert revision in a Delphi process, while accelerating consensus formation?
๐งช Methods
fifty international panelists โ clinicians, researchers, and patient representatives โ were split into two parallel groups of 25, each evaluating the same 28 clinical statements in a controlled Delphi process:
- arm A โ traditional human-led revision
- arm B โ AI-assisted revision under expert supervision
after round 1, statements that failed to reach the 75% agreement threshold were selected for revision. in arm B, the AI workflow โ powered by GPT-4.1 โ ran three sequential agents:
- reference detection agent โ identifies missing or relevant citations
- PDF summarization agent โ extracts and summarizes supporting literature
- statement revision agent โ generates evidence-anchored rewrites with explicit change logs and rationale
to ensure evidence grounding, a hybrid RAG module combined a dense retriever (FAISS, weight 0.7) and a sparse retriever (BM25, weight 0.3). all AI-generated outputs underwent dual expert review before entering round 2 โ a human-in-the-loop approach to maintain factual accuracy and clinical plausibility.
๐ Results
| ย | round 1 consensus | round 2 consensus | improvement |
|---|---|---|---|
| arm A โ human Delphi | 71% (20/28) | 93% (26/28) | +21.4 pp |
| arm B โ AI-assisted Delphi | 46% (13/28) | 86% (24/28) | +39.3 pp |
the AI-assisted workflow recovered significantly more sub-threshold statements (+39.3 percentage points), closely matching expert-level performance while substantially speeding up the revision process. only 4 statements in arm B and 2 in arm A remained below threshold after round 2.
๐ก Key Takeaways
- AI can closely match expert performance in structured consensus workflows when properly supervised
- retrieval-augmented generation is key: grounding revisions in verified evidence prevents hallucinations and ensures auditability
- human-in-the-loop is not optional โ itโs what makes the system trustworthy and deployable in clinical settings
- the approach is domain-agnostic and could be extended to any Delphi process beyond RNA therapeutics
- remaining challenges: reference quality dependency, structured data requirements, and continuous expert oversight
๐ Thanks
thanks to everyone who made this work possible: Davide Cafiero, Fabio Tedone, Elena Caproni, and Lucia Politi, and to the Helaglobe team.
this research was supported by the Piano Nazionale di Ripresa e Resilienza (PNRR) โ within the National Center for Gene Therapy and Drugs based on RNA Technology, in collaboration with the Department of Pharmaceutical Sciences, University of Milan.






