Introduction:
This preliminary report presents a comparative assessment of six large language models (LLMs) in their ability to answer 100 self-assessment multiple-choice questions (MCQs) from the American Academy of Ophthalmology (AAO) on the topic of pediatric ophthalmology.
Methods:
We evaluated the performance of the following LLMs: mixtral-8x7b-instruct-v0.1, wizardlm-70b, Gemini-pro-dev-api, pplx-70b-online, claude2.1, and gpt-4-turbo. Our assessment focused on two key aspects. First, we measured the rate of successful answers provided by each LLM. Second, we analyzed the consistency of incorrect answers across the LLMs to determine whether certain incorrect responses were consistently chosen by all models.
Results:
Our findings indicate that Chat GPT-4 outperformed the other LLMs in answering the pediatric ophthalmology MCQs, achieving a success rate of 79%. Claude 2.1 followed closely with a success rate of 74%, while mixtral scored 63%, pplx 59%, Gemini 57%, and wizardlm 53%.
Furthermore, our analysis revealed that there is a variation in the incorrect answers provided by the LLMs. Some incorrect responses were consistent across the models, suggesting potential areas of improvement in their training data or reasoning capabilities.
Conclusions:
In conclusion, this preliminary report highlights the varying performance of different large language models in answering AAO self-assessment MCQs on pediatric ophthalmology. Chat GPT-4 demonstrated the highest success rate, reinforcing its effectiveness in providing accurate answers to these questions, consistent with previous reports. Our study suggests that there is a diversity of LLMs available that can offer comparable answers, which could be beneficial for educational purposes, particularly for students studying pediatric ophthalmology.