There has been an enormous amount written about the use of large-language models (LLM) in healthcare over the last few months particularly since the release of Chat-GPT 4 (1). I thought I would note down some of my reflections on the use within clinical and clinical-administrative pathways based on some recent work.

What are LLMs good for?

The majority of LLMs have been programmed to have general cognitive skill, with the goal of helping a user accomplish many different tasks (2). They are designed to take a natural language (how we normally speak or write) query and come up with an answer to that query based on the model’s assessment/ ability to interpret the query. Privately restricted data, such as that held in an electronic health record is not included in their training (2, 3). This broad training means they are good at areas where there is extensive use of language, as many examples of the expected response within the internet.  For example, I have seen use cases around providing patients with information about their treatment, writing discharge summaries, medical note taking, answering exam questions, simplification of radiology reports, extraction of drug names or other specific details from a note and triage of patient concerns (2, 3, 4, 5). In the example of medical note taking a LLM can be prompted to providing the response in a selected format, for example SOAP, and even include appropriate (billing) codes (2).

Where might they be less useful?

LLM are less likely to be useful in categorisation type tasks, for example to select groups of patients who should be in the same group or cohort, in this case the natural language used to describe the group is likely to vary more than a predetermined set of characteristics which could be selected quickly by a human user. The assessment of a patient, for diagnosis or for selection of the best treatment option may also be problematic. In these cases we would be relying on assumptions that the model may be making which may not be accurate, and it is likely that the general purpose LLMs do not have enough of the specific training material to make these sorts of assessments (3,6)

Support to clinical pathways vs replacing clinicians?

Another area of public interest when it comes to AI in general is the concept of replacement of the human equivalent. My belief is that it is much more likely that in the short to medium term AI models including LLM will be used to support clinicians and other healthcare professionals rather than to replace them. This is really important when we consider the current climate of workforce shortages. Any consideration of implementation of a workflow involving LLM needs to be assessed form the perspective of how a user will interact with it, how this will be beneficial to the user and how it will impact thew workflow and overall workload.

In completely administrative pathways I think the journey to full AI automation is shorter, however these still require robust and ongoing assessment of utility and safety in order not to have an adverse impact on patient outcomes.

The technique of tuning a model can be deployed to provide more specific training allowing the model to build on its general language understanding whilst adapting and becoming more specialised for the requirements of a specific task (3). This ability for adaptation will make the general purpose LLMs more suited to a specific healthcare task but the requirement for a human check point or validation of the output remains essential.

Particular considerations for safe use

In assessment of the safe use of a LLM in a healthcare context we need to take account of the fallibilities inherent in the technology. One often cited example is the performance of LLMs against medical school exams, however does the fact that an LLM can score well in a structured exam mean it could replace a doctor? I would argue that this is absolutely not the case. The assumption that the performance of an LLM can be judged by using human assessment approaches is incorrect, and does not allow for the potential for false assumptions (6). Current evaluations of LLM in the healthcare context often don’t qualify the benefits of collaboration between humans and the AI model (3).

Some emerging variables for assessment include:

–       The agreement with scientific or clinical consensus

–       Comprehension, retrieval and reasoning capabilities

–       Incorrect or missing content

–       The extent and likelihood of harm

–       Bias (5).

I won’t discuss all of these here, however some points I do feel it is worth making is that testing against clinical consensus could push us into falling into the trap of ‘we have always done it this way’ as opposed to learning from what the data is telling us. However,  this is probably safer than placing over reliance on a technology which is known to ‘hallucinate’. In addition, when it comes to assessment of harm this can have complex ethical considerations when comparing a pathway using human clinicians to a new pathway reliant on a LLM, what happens if the existing pathway is discovered to contain output considered to be harmful when evaluated against clinical consensus?

In a validation approach that I have been working on to attempt to address safety aspects of the early implementation of LLM in a range of clinical-administrative type pathways we have considered a range of factors which we believe should be assessed and validated against before use. These include:

  • Assessment of user behaviour, and any alterations in behaviour, when interacting with the workflow
  • The suitability of the LLM supported workflow, for example the outputs ability to assist in improving clinical efficiencies, the introduction of bias or inconsistencies in clinical management, or the performance over time degrades
  • The availability of sufficient information to assess the output or performance
  • That the output can be assessed against inclusion of hallucinations/ incorrect matching/ perceived realism, or the output is inconsistent

Dependent on the specific workflow the ways in which these would be assessed will vary, and as we develop the workflows, and validation approaches I am sure this will iterate. It is also important to recognise that validation is not a one-off process at the implementation of a LLM workflow, but regular review to ensure the model and its output are not drifting is essential.  

I am aware that this is a very swiftly evolving area and I am not a technical expert in computer science. I would be really interested in the thoughts of others working in the space.

References:

1.     https://openai.com/gpt-4

2.     Benefits, Limits and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med 2023; 388:1233-1239 DOI: 10.1056/NEJMsr2214184

3.     Creation and adoption of Large Language Models in Medicine. JAMA. 2023;330(9):866-869. doi:10.1001/jama.2023.14217

4.     Chat-GPT: the future of discharge summaries. https://doi.org/10.1016/ S2589-7500(23)00021-3

5.     Large Language Models Encode Clinical Knowledge. https://arxiv.org/pdf/2212.13138.pdf

6.     Can the NHS Manage without AI? https://www.kingsfund.org.uk/blog/2023/09/can-nhs-manage-without-ai?utm_source=The%20King%27s%20Fund%20newsletters%20%28main%20account%29&utm_medium=email&utm_campaign=14136029_NEWSL_HMP_Library%202023-09-29&dm_i=21A8,8EZFH,MG27LH,YQFYS,1