Citation

Singavarapu J, Khemlani A, Uddin R, et al. (2025) Evaluating AI Chatbot Information on Trending Topics in Anesthesiology. Int J Anesthetic Anesthesiol 12:188. doi.org/10.23937/2377-4630/1410188

Original Article | OPEN ACCESS DOI: 10.23937/2377-4630/1410188

Evaluating AI Chatbot Information on Trending Topics in Anesthesiology

Joshua Singavarapu, BA1*, Amber Khemlani, BA1, Rafat Uddin, BA1, Darsiya Krishnathansan, MS2, Harsh Reshamwala, BS3 and Michael Mahla, MD1

1SUNY Downstate Department of Anesthesiology, Brooklyn, NY, USA

2Thrombosis Research Group Brigham and Women's Hospital, Boston, MA, USA

3Cooper Union University, New York, NY, USA

Abstract

Background: Artificial intelligence (AI) is increasingly being utilized as an informational resource, with chatbots attracting users for their ability to generate instantaneous responses. This study aims to evaluate the responses from four AI chatbots - Gemini, ChatGPT, Copilot, and Perplexity - focusing on general, local, and regional anesthesia. The assessment focuses on understandability, actionability, readability, response quality, and potential misinformation. These aspects were measured using DISCERN, PEMAT5, and Flesch-Kincaid reading scores.

Methods: The input prompts for the four chatbots were created from the top Google Trends search terms for general anesthesia, local anesthesia, and regional anesthesia from March 8th, 2020 to March 8th, 2025. The AI chatbot outputs were assessed using the following validated tools: Patient Education Material Assessment Tool (PEMAT) for understandability and actionability, DISCERN for quality of information, and the Flesch-Kincaid formula for readability. Potential misinformation was evaluated using the American Society of Anesthesiologists (ASA) guidelines. Three blinded reviewers (A.K., J.S., R.U.) independently adjudicated chatbot responses. Statistical analysis included the chi-square test for PEMAT understandability and actionability scores and the Kruskal-Wallis test for DISCERN and Flesch-Kincaid scores. Statistical tests were also conducted using the Mann-Whitney U test with post-hoc pairwise comparisons with Bonferroni adjustment.

Results: Perplexity (p < 0.001), ChatGPT (p = 0.001), and Gemini (p = 0.001) showed significantly higher rates for understandability than Copilot, though no significant differences were found among Perplexity, ChatGPT, and Gemini. No significant differences were seen for actionability. Perplexity had a significantly higher DISCERN score than ChatGPT (p < 0.001), Gemini (p < 0.001), and Copilot (p < 0.001). There were statistically significant differences in readability between Perplexity and Gemini (p < 0.001), as well as between ChatGPT and Gemini (p = 0.005).

Conclusions: This study is one of the first to evaluate how chatbots can process queries on anesthesiology. As AI continues to evolve, it will soon become a primary source of scientific information for patient understanding. The need to review the dissemination of this information is crucial as it allows us to gauge how and if AI chatbots can be beneficial for patient use and recommendation.

Keywords

Artificial intelligence, General anesthesia, Local anesthesia, Regional anesthesia, Chatbots

Abbreviations

AI: Artificial Intelligence, LLMs: Large Language Models, PEMAT: Patient Education Material Assessment Tool, ASA: American Society of Anesthesiologists

Introduction

Artificial Intelligence (AI) is the ability of machines to simulate human cognitive function, from reasoning, learning, and problem-solving [1]. Large Language Models (LLMs) are advanced AI systems that are pre-trained on extensive datasets to enhance responses. Several examples of pre-trained LLMs are Perplexity, ChatGPT, Gemini and Copilot. In medicine, pre-trained LLMs contribute to the streamlining of medical processes, such as medical documentation, billing, and appointment scheduling [2]. Additionally, AI can help to provide clinical support and diagnosis, even helping the physician to create differential diagnoses and clinical treatments [3]. Thus, AI has invaluable potential in revolutionizing the healthcare industry, and ultimately saving lives and reducing costs [4].

In anesthesiology, AI has transformed clinical practice, with an ability to predict hypotensive episodes post-induction, identify respiratory depression perioperatively, and also assess pre-operative patient acuity [5]. In the use of regional blocks, AI has also been helpful in ultrasound-guided nerve blocks through image classification and anatomy identification [6]. However, besides the benefits of AI on the clinician, there must also be an assessment of how these LLMs help shape patient education and understanding of complex medical procedures.

This paper will evaluate the effectiveness of AI chatbots in delivering patient education on the three primary types of anesthesia - general, local, and regional - when given common prompts on the respective fields. General anesthesia is one of the most common anesthetic techniques used in surgeries worldwide. It is often the default approach for major procedures because it provides complete unconsciousness, pain control, and muscle relaxation [7]. Additionally, local anesthesia is frequently used in dental practice, as it is essential for pain control during routine and surgical dental treatments. It is regarded as the standard of care, especially for outpatient and minimally invasive procedures [8]. Regional anesthesia was chosen as the third type because of its increasing use in vascular, orthopedic, and trauma surgeries [9], along with a decrease in patient complications when compared to general anesthesia [10]. As of 2025, Gemini, ChatGPT 4.0, Copilot, and Perplexity are the most used chatbots based on market share and usage rate [11], and their responses were assessed on understandability, actionability, readability, response quality, and potential misinformation. These characteristics of responses were measured using DISCERN, PEMAT5, and Flesch-Kincaid reading scores.

Methods

This cross-sectional study was exempt from review and informed consent due to its use of publicly available data. The top four Google (Alphabet, Inc.) search queries related to general anesthesia, local anesthesia, and regional anesthesia in the United States from for the past five years from March 8 th , 2020 to March 8 th , 2025, were identified through Google Trends and inputted into four AI chatbots: Perplexity, ChatGPT, Gemini, and Copilot. The latest publicly available chatbot versions, as of March 8 th , 2025 were utilized to generate responses for each query. To ensure there were no bias within the chatbot responses and no prior history affecting subsequent responses, a new conversation was initiated for every search term. The search terms were entered into the AI chatbots exactly as they appeared in Google Trends. The chatbot responses were recorded and shared among graders ( Supplemental Table 1 ).

The actionability and understandability of the responses was assessed through Patient Education Materials Assessment Tool (PEMAT) 5 (scores of 0%-100%, with higher scores indicating a higher level of understandability and actionability), and the quality of responses was assessed through DISCERN4 (overall scores of 1 [low] to 5 [high] for quality of information). PEMAT understandability assesses how easy it is for the average reader to process and understand given information. The PEMAT actionability score evaluates how well the reader can identify further actions to take after reading the given material. DISCERN, on the other hand, looks at the quality of chatbot responses, with an assessment of how reliable, comprehensive, and balanced given information set is in helping patients make informed decisions. Three blinded reviewers (A.K., J.S., R.U.) independently adjudicated chatbot responses. Statistical analysis included the chi-square test for PEMAT understandability and actionability scores and the Kruskal-Wallis test for DISCERN and Flesch-Kincaid scores. Statistical tests were also conducted using the Mann-Whitney U test with post-hoc pairwise comparisons with Bonferroni adjustment. Potential misinformation of chatbot responses were evaluated using the American Society of Anesthesiologists (ASA) guidelines. Chatbot responses were attached as supplemental material.

Results

To evaluate the performance of four chatbots (Perplexity, ChatGPT, Gemini, and Copilot) on understandability and actionability, chi-square tests were conducted on success/failure counts. For understandability, the overall chi-square test revealed significant differences across chatbots, χ ² (3) = 20.73, p < 0.001. Post-hoc pairwise comparisons with Bonferroni adjustment (α = 0.0083) showed that Perplexity (82.27% success rate) significantly outperformed Copilot (71.05%, χ ² = 14.43, p & lt; 0.001). Similarly, ChatGPT (81.21%, χ ² = 11.44, p = 0.001) and Gemini (80.92%, χ ² = 10.80, p = 0.001) also demonstrated significantly higher success rates compared to Copilot AI. However, no significant differences were observed between Perplexity, ChatGPT, and Gemini. In contrast, the actionability analysis showed no significant overall differences, χ² (3) = 3.42, p = 0.331, with success rates ranging from 22.62% (Copilot) to 30.95% (Perplexity). Pairwise comparisons for actionability also yielded no significant results after Bonferroni correction.

The Kruskal-Wallis test was used to assess overall differences in DISCERN scores across the chatbots, with post-hoc pairwise comparisons conducted using the Mann-Whitney U test. Bonferroni correction was applied to adjust for multiple comparisons, with significance set at α = 0.0083. Significant differences were observed in the overall Kruskal-Wallis test (H = 32.97, p < 0.001), with post-hoc tests revealing that Perplexity had significantly higher DISCERN scores compared to ChatGPT (p < 0.001), Gemini (p & lt; 0.001), and Copilot (p & lt; 0.001). Additionally, Gemini demonstrated significantly higher scores than Copilot (p = 0.007). No significant differences were found between ChatGPT and Gemini (p = 0.091) or between ChatGPT and Copilot (p = 0.163).

The Kruskal-Wallis test was used to assess overall differences in readability across the chatbots, revealing significant variability, H (3) = 11.72, p = 0.008. Post-hoc pairwise comparisons with Bonferroni correction (α = 0.0083) were conducted using the Mann-Whitney U test. Significant differences were found between Perplexity and Gemini (p < 0.001) and between ChatGPT and Gemini (p = 0.005). No significant differences were observed between Perplexity and ChatGPT (p = 0.686), Perplexity and Copilot (p = 0.977), ChatGPT and Copilot (p = 0.728), or Gemini and Copilot (p = 0.040). Mean Flesch-Kincaid Grade Level scores were reported for each chatbot: Perplexity (M = 9.92, SD = 1.79), ChatGPT (M = 10.45, SD = 1.14), Gemini (M =12.19, SD = 1.23), and Copilot (M = 10.37, SD = 1.72).

Discussion

This study showed that for PEMAT understandability, Perplexity, ChatGPT, and Gemini performed significantly higher compared to Copilot (Figure 1). However, for PEMAT actionability, though there were no significant differences between the chatbots, Perplexity had the highest score, with Copilot having the lowest (Figure 2). For DISCERN, Perplexity had significantly higher scores than the other three chatbots. Copilot also had the lowest average DISCERN score (Figure 3). When assessing readability, Perplexity was significantly easier to read than Gemini, but overall had the lowest score among the chatbots, showcasing its ease of reading. Gemini had the highest readability score with significant differences when compared to ChatGPT and Perplexity (Figure 4).

Figure 1: PEMAT understandability scores across chatbots. View Figure 1

Figure 2: PEMAT actionability scores across chatbots. View Figure 2

Figure 3: DISCERN scores across chatbots. View Figure 3

Figure 4: Flesch-Kincaid grade level scores across chatbots. View Figure 4

Though Perplexity did not consistently have significant differences when compared to other chatbots, it generally performed well under the four evaluating criteria. This highlights how Perplexity produces chatbot responses that are comprehensive, leave an impact on patients, and are reliable and informative, and easy to read. Consequently, ChatGPT was overall on par with Perplexity, as it only produced significant differences in DISCERN scores. It is also important to appreciate how there were no significant differences in PEMAT actionability, something that may be due to the inherently limited or less immediate call to action associated with anesthesiology-related content. However, although patients are not experiencing anesthetic effects at the time of preoperative research, they can still gain valuable insight into the potential effects of anesthesia and what to expect during the perioperative period. Additionally, actionability could be improved by including guidance on managing post-operative anesthetic effects and offering a clear action plan for patients who experience lingering symptoms.

A limitation of this study was that the queries were collected from March 8 th , 2020 to March 8 th , 2025, which included the global COVID-19 pandemic. During that time, there were cancellations of planned treatments, a decrease in medical services, higher rates of morbidity, and overall change in access to non-COVID-related medical treatment [12]. This may have affected the queries relating to general, local and regional anesthesia. Additionally, the study was constructed as a subjective analysis of chatbot responses. However, PEMAT and DISCERN have been validated [13,14] and shown to provide accurate results in regards to the understandability and quality of excerpts. Furthermore, with blinding and a prior training on the principles of each grading system in order to maintain consistency in scores, this was mitigated. Moreover, an objective analysis through the Flesch-Kincaid score was implemented to provide further support for any conclusions made.

The application of AI in anesthesiology is continually expanding, with new areas requiring further investigation. Future studies can focus on anesthetic pathologies in patients and evaluate how well AI can identify and appropriately respond to pathologies when symptoms are input. Additionally, the clinical use of AI in providing informed consent can also be evaluated, as the DISCERN scores, on average, were above average across the chatbots. It would be interesting to see how well the scores are reflected in practice, when compared to the informed consent provided by the anesthesiologist.

This study has shown that AI has the potential to improve and revolutionize patient education in anesthesiology. Perplexity, specifically, has shown promising results through its PEMAT actionability, PEMAT understandability, DISCERN, and Flesch-Kincaid scores. Though this study is one of the first to evaluate how chatbots can process queries on anesthesiology, the integration of artificial intelligence into anesthetic practice is still developing. Furthermore, ongoing advancements suggest AI will increasingly serve as a primary source of scientific information to support patient education and understanding. The need to review the dissemination of this information is crucial as it allows us to gauge how and if AI chatbots can be beneficial for patient use and recommendation pre-operatively.

Acknowledgements

There are no acknowledgements to be made at this time.

Author Contributions Statement

All authors have equally contributed to this manuscript.

References

  1. Morandín-Ahuerma F (2022) What is artificial intelligence? International Journal of Research Publication and Reviews 3: 1947-1951.
  2. Lorencin I, Tankovic N, Etinger D (2025) Optimizing healthcare efficiency with local large language models. AHFE International.
  3. Zou S, He J (2023) Large language models in healthcare: A review. 2023 7th International Symposium on Computer Science and Intelligent Control (ISCSIC), 141-145.
  4. Wen Z, Huang H (2023) The potential for artificial intelligence in healthcare. Journal of Commercial Biotechnology 27.
  5. Bogon A, Górska M, Ostojska M, Kaluza I, Dziuba G, et al. (2024) Artificial intelligence in anesthesiology - A review. J Pre Clin Clin Res 18: 265-269.
  6. Hashimoto DA, Witkowski E, Gao L, Meireles O, Rosman G (2020) Artificial intelligence in anesthesiology: Current Techniques, Clinical Applications, and Limitations. Anesthesiology 132: 379-394.
  7. Drummond Júnior DG, Guimarães ACCM, Bezerra Neto PD, de Castro CT, Santos IC (2023) Advantages of using general anesthesia. III SEVEN International Multidisciplinary Congress.
  8. Malamed S, Reed K, Okundaye AJ, Fonner AM (2017) Local and regional anesthesia in dental and oral surgery. In Clinical Techniques in Veterinary Dentistry 341-358.
  9. Berg KB, Kiley S, Buchanan PJ, Robicsek SA (2021) Regional Anesthesia: Neuraxial techniques for major vascular surgery. Vascular anesthesia procedures 187-208.
  10. Mirza F, Brown AR (2011) Ultrasound-guided regional anesthesia for procedures of the upper extremity. Anesthesiology Research and Practice.
  11. First Page Sage (2025) Top Generative AI Chatbots by Market Share.
  12. Rovetta A (2021) Reliability of Google trends: Analysis of the limits and potential of web infoveillance during COVID-19 pandemic and for future research. Front Res Metr Anal 6: 670226.
  13. Shoemaker SJ, Wolf MS, Brach C (2014) Development of the Patient Education Materials Assessment Tool (PEMAT): A new measure of understandability and actionability for print and audiovisual patient information. Patient Educ Couns 96: 395-403.
  14. Charnock D, Shepperd S (2004) Learning to DISCERN online: Applying an appraisal tool to health websites in a workshop setting. Health Educ Res 19: 440-446.

Citation

Singavarapu J, Khemlani A, Uddin R, et al. (2025) Evaluating AI Chatbot Information on Trending Topics in Anesthesiology. Int J Anesthetic Anesthesiol 12:188. doi.org/10.23937/2377-4630/1410188