Research shows that about 50% of the answers provided by the five major AI chatbots to medical questions have issues, with nearly 20% rated as "highly problematic," highlighting the risks of AI in medical applications. Although these chatbots perform confidently, they cannot provide verifiable sources, indicating that in the absence of regulation and education, the risk of misinformation spreading is extremely high.

動區BlockTempo

2026-04-16 17:33:29

Abstract generation in progress

According to the latest study published in BMJ Open, when five major AI chatbots answer medical questions, about 50% of the answers contain problems, with nearly 20% rated as “highly problematic.” Bloomberg noted that this study reveals systemic risks in AI medical applications—especially in a particularly ironic moment as OpenAI and Anthropic simultaneously expand their healthcare footprints.
(Background: Don’t hand your medical records to chatbots? The privacy gamble behind ChatGPT Health’s medical ambitions)
(Additional background: University of California research on the “AI brain fog” phenomenon: 14% of office workers go crazy over Agent and automation, with 40% higher intent to quit)

Table of Contents

Toggle

Grok performs the worst, and ChatGPT is not far behind
The more confidently AI speaks, the higher the risk
OpenAI and Anthropic: researchers apply the brakes, but business presses the accelerator
Trust AI, but only with conditions

More than 230 million people ask ChatGPT health and medical questions every week, but nearly half of the answers you get may be problematic. According to a study published this week in the medical journal BMJ Open, researchers from the United States, Canada, and the United Kingdom conducted a systematic evaluation of five major platforms—ChatGPT, Gemini, Meta AI, Grok, and DeepSeek. Each platform posed and answered 10 questions spanning five medical categories.

The results are not that optimistic: about 50% of the responses were deemed problematic, with nearly 20% rated as “highly problematic.”

Grok performs the worst, and ChatGPT is not far behind

Bloomberg reports that performance differences among the platforms are substantial, but none manage to pass the test. Judging by the response rate, Grok leads with 58%, becoming the worst-performing platform; ChatGPT follows closely, with a problem rate of 52%; Meta AI stands at 50%.

Researchers observed that the chatbots performed relatively better on closed-ended questions and topics related to vaccines and cancer; however, their performance declined noticeably in areas involving open-ended questions and topics such as stem cells and nutrition. In addition, there were only two instances of refusal to answer in the study, and both came from Meta AI (somewhat ironically—knowing when not to answer has become a rare advantage).

More concerning is that the answers these AIs provide are often delivered with confidence—an affirmative tone, with no reservations. The researchers specifically emphasized that none of the chatbots, under any prompt, can provide a complete and accurate list of reference materials. This means that even if AI appears “well-grounded,” the sources it cites are often unverifiable—or may not exist at all.

The more confidently AI speaks, the higher the risk

The researchers wrote in the paper that these systems can generate responses that “sound authoritative but may actually have flaws,” highlighting the “significant behavioral limitations” of AI chatbots in public-facing health and medical communication, as well as “the need to reassess deployment approaches.”

Bloomberg also quoted the research team’s warning: in the absence of public education and regulatory mechanisms, the biggest risk of large-scale deployment of chatbots is that it will facilitate the spread and diffusion of incorrect medical information.

By contrast, a JAMA study indicates that AI’s failure rate in preliminary diagnosis cases exceeds 80%; Oxford University also issued a warning in February 2026, urging everyone to take the systemic risks of AI chatbots in providing medical advice seriously.

OpenAI and Anthropic: researchers apply the brakes, but business presses the accelerator

The timing of the release of this study is quite dramatic. Just a few months ago, in January 2026, OpenAI rolled out ChatGPT Health with great fanfare. This feature allows users to connect electronic medical records, wearable devices, and health applications, and it also launched a professional version of tools for clinical physicians. OpenAI has publicly stated that 40 million people use ChatGPT to look up health information every day.

Almost at the same time, Anthropic also announced the launch of Claude for Healthcare, officially entering the healthcare market through HIPAA-compliant certification.

These platforms neither have medical licenses nor clinical judgment capabilities, yet they are expanding into healthcare at an astonishing pace. The tension between the direction of commercial expansion and the research findings reveals a regulatory vacuum: at present, there is no clear safeguard between marketing AI medical tools and actual medical safety.

Trust AI, but only with conditions

This is not the first time AI medical applications have been singled out, but each study’s conclusion keeps reminding us of the same thing: AI chatbots are fundamentally language models. What they excel at is “sounding correct,” not “ensuring correct answers.” The problem is that when users turn to them with genuine health anxieties, the appearance of correctness is often already enough to influence decisions.

As companies such as OpenAI and Anthropic continue to deepen their involvement in medical scenarios, the pace of regulation and public education is clearly not keeping up with the pace of technological expansion. Until clear guardrails are established, this study may serve as a reminder: AI can be a gateway to health information, but it should not be the endpoint.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GatePreIPOsLaunchesWithSpaceX
157.41K Popularity
#
Gate13thAnniversaryLive
425.6K Popularity
#
US-IranTalksVSTroopBuildup
775.45K Popularity
#
CryptoMarketRecovery
98.31K Popularity
#
WCTCTradingChallengeShare8MUSDT
626.98K Popularity

Sitemap

Research findings: Nearly half of the medical advice given by AI has issues, Grok is the worst, and OpenAI is still expanding its medical ambitions.

Grok performs the worst, and ChatGPT is not far behind

The more confidently AI speaks, the higher the risk

OpenAI and Anthropic: researchers apply the brakes, but business presses the accelerator

Trust AI, but only with conditions

Trending Topics

GatePreIPOsLaunchesWithSpaceX

Gate13thAnniversaryLive

US-IranTalksVSTroopBuildup

CryptoMarketRecovery

WCTCTradingChallengeShare8MUSDT

Pin