Clinical knowledge in LLMs does not translate to human interactions

arxiv.org

93 points by insistent 15 hours ago


See also https://venturebeat.com/ai/just-add-humans-oxford-medical-st...

Fripplebubby - 14 hours ago

Interesting quote from the venturebeat article linked:

> “There is also a reason why clinicians who deal with patients on the front line are trained to ask questions in a certain way and a certain repetitiveness,” Volkheimer goes on. Patients omit information because they don’t know what’s relevant, or at worst, lie because they’re embarrassed or ashamed.

In order for an LLM to really do this task the right way (comparable to a physician), they need to not only use what the human gives them but be effective at extracting the right information from the human, the human might not know what is important or they might be disinclined to share, and physicians can learn to overcome this. However, in this study, this isn't actually what happened - the participants were looking to diagnose a made-up scenario, where the symptoms were clearly presented to them, and they had no incentive to lie or withhold embarrassing symptoms since they weren't actually happening to them, it was all made up - and yet, it still seemed to happen, that the participants did not effectively communicate all the necessary information.

hackitup7 - 11 hours ago

This is just a random anecdote but ChatGPT (when given many, many details with 100% honesty) has essentially matched exactly what doctors told me in every case where I've tested it. This was across several non-serious situations (what's this rash) and one quite serious situation, although the last is a decently common condition.

The two times that ChatGPT got a situation even somewhat wrong, were:

- My kid had a rash and ChatGPT thought it was one thing. His symptoms changed slightly the next day, I typed in the new symptoms, and it got it immediately. We had to go to urgent care to get confirmation, but in hindsight ChatGPT had already solved it. - In another situation my kid had a rash with somewhat random symptoms and the AI essentially said "I don't know what this is but it's not a big deal as far as the data shows." It disappeared the next day.

It has never gotten anything wrong other than these rashes. Including issues related to ENT, ophthalmology, head trauma, skincare, and more. Afaict it is basically really good at matching symptoms to known conditions and then describing standard of care (and variations).

I now use it as my frontline triage tool for assessing risk. Specifically ChatGPT says "see a doctor soon/ASAP" I do it, if it doesn't say to go see a doctor, I use my own judgment ie I won't skip a doctor trip if I'm nervous just because AI said so. This is all 100% anecdotes and I'm not disagreeing with the study, but I've been incredibly impressed by its ability to rapidly distill medical standard of care.

zora_goron - 13 hours ago

This difference between medical board examinations and real world practice is something that mirrors my real-world experience too, having finished med school and started residency a year ago.

I’ve heard others say before that real clinical education starts after medical school and once residency starts.

bryant - 14 hours ago

For anyone keen on dissecting this further, they uploaded enough to github for people to dive into their approach in depth.

https://github.com/am-bean/HELPMed (also linked in the paper)

dosinga - 14 hours ago

Really what it seems to say is that LLMs are pretty good at identifying underlying causes and recommending medical actions but if you let humans use LLMs to self diagnose the whole thing falls apart, if I read this correctly

twotwotwo - 13 hours ago

At work, one of the prompt nudges that didn't work was asking it to ask for clarifications or missing info rather than charge forward with a guess. "Sometimes do X" instructions don't do well generally when the trigger conditions are fuzzy. (Or complex but stated in few words like "ask for missing info.") I can believe part of the miss here would be not asking the right questions--that seems to come up in some of their sample transcripts.

In general at work nudging them towards finding the information they need--first search for the library to be called, etc.--has been spotty. I think tool makers are putting effort into this from their end: newer versions of IDEs seemed to do better than older ones and model makers have added things like mid-reasoning tool use that could help. The raw Internet is not full of folks transparently walking through info-gathering or introspecting about what they know or don't, so it probably falls on post-training to explicitly focus on these kinds of capabilities.

I don't know what you really do. You can lean on instruction-following and give a lot of examples and descriptions of specific times to ask specific kinds of questions. You could use prompt distillation to try to turn that into better model tendencies. You could train on lots of transcripts (these days they'd probably include synthetic). You could do some kind of RL for skill at navigating situations where more info may be needed. You could treat "what info is needed and what behavior gets it?" as a type of problem to train on like math problems.

keeptrying - 11 hours ago

I've seen that LLMs hallucinate in very subtle ways when guidng you through a course of treatment.

Once when having to administer eyedrops to a parent, and I saw redness and was being conservative, it told me the wrong drop to stop. The doctor saw my parent the next day so it was all fixed but did lead to me freaking out.

Doctors behave very differently from how we normal humans behave. They go through testing that not many of us would be able to sit through let alone pass. And they are taught a multitude of subjects that are so far away from the subjects everyone else learns that we have no way to truly communicate to them.

And this massive chasm is the problem, not that the LLM is the wrong tool.

Thinking probabilistically (mainly basyesia) and understanding the initial first two years of medschool will help you use an LLM much more effectively for your health.

callc - 9 hours ago

My immediate reaction is “absolutely not”. Unless the healthcare provider is willing to accept liability for the output and recommendations of their LLM. Are they willing to put their money to where their mouth is? Or are they just trying to reduce cost, increase profit?

Then I think, if you don’t have access to good healthcare, need to wait weeks or months to get anywhere, or healthcare is extremely expensive, then LLM may be a good option, even with chance for bad (possibly deadly) advice.

If there are any doctors here, would love to hear your opinion.

wongarsu - 14 hours ago

That's an interesting result. I would love to see a follow-up with two control groups: humans with assistance from an LLM, humans with assistance from a doctor and humans with no assistance.

This study tells us that LLM assistance is as good as no assistance, but any investigation of the cause feels tainted by the fact that we don't know how much a human would have helped.

If we believe the assertion that LLMs are on a similar level as doctors on finding the conditions on their own, does the issue appear in the description the humans give the LLM, the way the LLM talks to the human, or the way the human receives the LLM suggestions? When looking at chat transcripts they seem to identify issues with all three, but there isn't really a baseline on what we would consider "good" performance

pyman - 11 hours ago

Interesting paper. LLMs have the knowledge but lack social skills. they fail when interacting with real patients. So, maybe, the real bottleneck isn't knowledge after all?

dhash - 14 hours ago

I love this kind of research since it correctly identifies some issues with the way the public interacts with LLM’s. Thank you for the evening reading!

I’d love to see future work investigating - how does this compare to expert users (doctors/llm magicians using LLM’s to self diagnose)

- LLM’s often provide answers faster than doctors, and often with less hassle (what’s your insurance?), to what extent does latency impact healthcare outcomes

- do study participants exhibit similar follow on behavior (upcoding, seeking a second opinion, doctors) to others in the same professional discipline

ekianjo - 14 hours ago

> perform no better than the control group

This is still impressive. Does it mean it can replace humans in the loop with no loss?

- 15 hours ago
[deleted]