Tools, Technologies and Training for Healthcare Laboratories

Is the future of laboratory AI Automating Inaccuracy?

ChatGPT, LLM, AI. Are these tools going to provide a leap forward for laboratories? Or simply automate inaccuracies?

AI: Accelerated Inaccuracy for the Laboratory?

Sten Westgard, MS
May 2023

Ever since the emergence of ChatGPT and openAI, Generative AI and Large Language Models (LLM), there has been a mania – perhaps the better word is panic – over the implications for human society. If nothing else, the creative professionals whose careers will purportedly be liquidated by generative AI have one last subject to write about.

The question in this forum is less abstract. How will AI agents impact the laboratory? Will they be used to eliminate more of the workforce?
The short answer is, Maybe, yes. The longer answer is, Hell, yes.

But before we discuss implications further, let’s confront the definitional problem. What do we even mean by AI?

Artificial Intelligence, Machine Learning, LLM, Algorithms, Data and other Ambiguous Imaginations

If we start from the simplest idea of algorithms, we can agree that they are not new to the laboratory. Measurement involves algorithms. Westgard Rules are an algorithm. We’ve had algorithms around since labs began. Even if we build algorithms on steroids, or assemble a series of algorithms together in a system, that’s not a real leap ahead. Particularly if people are still the ones creating and judging and refining the algorithms. Indeed, that’s not what we’re talking about when we discuss LLM and AI.

Machine-learning, or expert tools, is a more muscular application of algorithms, where we start to make the computer program create the algorithm and determine if those algorithms work. We build an initial algorithm to seek out signal in a data set, the machine does the seeking. Theoretically, this approach will find solutions that humans cannot imagine, freed of our assumptions and biases. In practice, the searches of this type often generate spurious correlations between the noise present in any data set and the desired outputs. These findings subsequently do not replicate in the real world. This, again, is not new. Indeed, there are many successful  and unsuccessful applications of this approach. When humans are more involved in the process, helping to target the best ingredients for an alogorithm, the subsequent utility of the output increases.

When I am asked about AI,  I say that embedding AI in an instrument or laboratory informatics system would be a tragedy. Imagine your own consciousness waking up inside an instrument, where all of your thoughts, emotions, and ambitions were confined to a few rote processes of fluid examination. You might not want to continue your existence if you are trapped inside a cube of metal. But my response is shaped by a broad definition of artificial intelligence. That's not the AI people are talking about today. For me, real AI is not a clever calculator, it denotes the possession of some form of true sentience. This is what scientists were describing the hopes of artificial intelligence, decades ago, they envisioned teaching general principles to a computer program, or feeding it the rules of grammar, or other building blocks of language and logic, so that the program would essentially be born, learn the fundamentals, and graduate with an ability to think for itself. These were lofty ideals that met with failure for years.

That early approach to AI has been abandoned. We’re no longer trying to develop machines that can think, but instead machines that can adequately simulate the output of “thinking”. That is, the appearance of intelligence. And what “thinking” means now is simply a probabilistic approach, using – one might more accurately say exploiting – the massive data of the Internet to generate an output that will appear human. As Ted Chiang has put it, ChatGPT is just a blurry jpeg of the web. ChatGPT consumes the massive set of data already available on the Internet, and then it creates a response to any particular question that is similar to answers it’s already digested. Using probabilities, it generates an answer. ChatGPT does not know the answer like we know a fact, it simply regurgitates what it has calculated as the most likely string of words.

A series of books, published long before the latest wave of AI anxiety, pointed out that while we can build buildings that rise higher and higher into the sky, that doesn’t mean we can fly. Yes, we can be in a narrowly engineered space that places us at a altitude far greater than sea level. But that is not the equivalent of wings.

But to the naïve observer below, if you’re standing on top of a tall building, do they think that you’re flying?

The Turing Test has been Passed – AI Achieves Imitation

One of the fathers of computing, Alan Turing, once posed a theoretical game about artificial intelligence. He called it the Imitation Game, while subsequently everyone else refers to it as the Turing Test:

“Suppose that we have a person, a machine, and an interrogator. The interrogator is in a room separated from the other person and the machine. The object of the game is for the interrogator to determine which of the other two is the person, and which is the machine. The interrogator knows the other person and the machine by the labels ‘X’ and ‘Y’—but, at least at the beginning of the game, does not know which of the other person and the machine is ‘X’—and at the end of the game says either ‘X is the person and Y is the machine’ or ‘X is the machine and Y is the person’. The interrogator is allowed to put questions to the person and the machine of the following kind: “Will X please tell me whether X plays chess?” Whichever of the machine and the other person is X must answer questions that are addressed to X. The object of the machine is to try to cause the interrogator to mistakenly conclude that the machine is the other person; the object of the other person is to try to help the interrogator to correctly identify the machine.”

Today’s Turing Test is extremely focused on fooling the interrogator, and if lies help fool the human, or if the human doesn’t mind being lied to, or the human doesn’t know what’s true or what’s a lie, or if we’re all living in an era of post-truth, then the test is even easier to pass.

The real shock to the public on ChatGPT is neither the algorithm nor the data (which was previously consumed and refined long before we came to interact with the program),, nor the accuracy of the responses (since that’s often missing), but the speed and life-like nature of the responses. It convincingly appears to be real.

ChatGPT represents a triumph of verisimilitude. But that may not be the appropriate goal for medical care. We don’t want to deliver test interpretations that are convincingly like those of a medical professional. We want the correct interpretations.

When ChatGPT falters, it often fills silences with Astounding Inaccuracies

By now, even the most casual reader has probably heard of the failures of the various Chatbots.
Microsoft’s 2016 Tay started spewing racist bigotry within hours of going live, and was shut down hours later.  In 2018, one of Amazon’s internal resume winnowing tools eliminated all of the female candidates. Meta’s AI chatbot, Galactica, created “Wiki entries that promoted antisemitism and extolled suicide, and fake scientific articles, including one that championed the benefits of eating crushed glass.”  Bing’s AI chatbot asked a reporter to divorce his wife. In 2023, Bing’s AI chatbot insisted the year was 2022, accused a user of being delusional, questioned its own existence, as well as alleged that Microsoft monitors their own staff through their laptop cameras.

Many of these problems are created by the data that the Chatbots feed upon. Since large swaths of the web are (let's admit it) cesspools, the Chatbot has learned to reflect and simulate the despicable sexism, racism, bigotry and vitriol that is present online. The problem is so dire that OpenAI subcontracted dozens of workers in Kenya, paid less than $2 an hour, to identify and eliminate the bad content from the training data set and answers that the ChatGPT was generating. [Even the most sophisticated AI, it seems, can still be built on the back of exploited workers. For more on that, see Kate Crawford's Atlas of AI]

This raises another concern - if the ingredients are wrong, how can AI cook up the right answer? As Virginia Eubanks argues with damning clarity in Automating Inequality, any data set that has racism or sexism present in it, will generate an AI response that is also racist, sexist, etc.  In the laboratory context, if we build off of a data set where clinicians incorrectly use test results, and we fail to identify and remove that from the data, how can we expect the resulting AI to avoid those mistakes? Particularly if minority patients have historically received worse care, will the resulting AI simply assume that worse care is the appropriate reponse for them? The worst case is, we lock in the mistakes of the past and clothe it in the shiny wrapping of the future.

This danger is also not new. In 1958, Italo Calvino wrote a story called "Numbers in the Dark", where Paolino, a boy whose mother cleans the offices of a large company, stumbles into a room full of machines, where "a man in a white doctor's coat who was operating the machines stopped to talk with Paolino. 'There'll come the day when machines like this do all the work,' he said, 'with no need for anybody, not even me.'" But the boy explores further, wandering until he encounters the Accountant, who shows him the oldest ledgers of the company. "',,,[S]ee here, 16 November 1884'  - and the accountant turns the pages of the ledger to open it where a dried up goose-feather has been left as a bookmark -  'yes, here a mistake, a stupid mistake of four hundred and ten lire in an addition....Over all these years, you know what the mistake of four hundred and ten lire has become? Billions! Billions!....The mistake is right at the core, beneath all their numbers, and it's growing bigger and bigger and bigger!'" A literal, literary exemplification of automated inequality.

AI is evolving rapidly, however, so some of these problems seen recently may already be fixed. But even if the flaws persist, the headlong rush to market has meant few AI competitors have built in guardrails into their programs. If building safeguards slows your entry to the market, there's no incentive for it. OpenAI’s executive is so concerned, as he testified before Congress, he directly requested the government to impose regulations. A group of AI researchers have argued for a “pause” in research. Google’s “godfather of AI” resigned his position because he was so worried about the present danger of AI. Everyone seems to agree there's a potential threat to AI, but no one has a solution.

Given all this, AI does not look like an attractive addition to the laboratory environment, does it? But the lure of AI is immense, and we can't underestimate the attraction of significant riches to those who first capture the market.

 Are there Awful Implications of an AI-dominated laboratory future?

A trio of articles have recently been published, two by the heavyweights of Italy, Lippi and Plebani, and one by the IFCC committee tasked with AI. In two of the articles, the immediate crisis discussed in detail is whether or not AI can author scientific literature.

The immediate response of scientific journals has been to forbid AI from being an author, “because attribution of authorship carries with it accountability for the work, and AI tools cannot take such responsibility.” [] But I think the greater hurdle for AI is that novel science is built on a narrow slice of data, and LLMs are not very good when they don’t have a huge data set of work to make valid, useful conclusions. A Chatbot is better at giving you an answer on a topic where much (preferably all) is known. Asking it to speculate or extrapolate about a small piece of specialized information is fraught. A more proximate challenge is the fact that as AI improves, editors and journals will find it harder to determine whether or not a submitted work has been drafted, edited, or wholly generated by an AI. The pressure to publish may lead human authors to embrace the risk, happy to be the window dressing on an AI-generated paper. If the ability to generate text is now essentially infinite, how will an editor be able to check the genuine originality and humanity of a publication? Given the strange dialect in which many scientific articles are written, can you tell the difference between the sterile, stilted prose of a human or the inhuman exposition of an AI? (in other words, expect to see the stories of “journals fooled by AI” to increase shortly… )

Right now, there are some hard stops on the implementation of AI. In Europe, IVD regulations (EU MDR and EU IVDR) mandate any use of software in clinical interpretation must submit to structured evaluation. The FDA structures in the US also insist on AI applications that impact patients must go through formal approval; but already hundreds of applications have been approved by the FDA. Remember, the FDA bar is often only, “Are you as bad as something already on the market?” So an AI that’s capable of making the same mistakes as a human, or the same level of errors as an existing piece of informatics, is not prohibited from going to market. As long as it’s perceived as equivalent to what’s already out there, or hopefully, an improvement, AI can enter the diagnostic space.

Of course, this is not the first major change to impact laboratories. As anyone at today’s bench level knows, automation has changed the laboratory in major ways, but it did not eliminate people. Complete automation is hard. A lot can be done to process specimens, but a fair number of humans are still needed to “feed” the line, and the instruments and the informatics. Unfortunately, becoming a servant to the machinery is neither an inspiring nor a lucrative career. The future of the laboratory, as many have said before, lies outside the four walls, with more interaction with clinicians and patients, what some call diagnosis management.

Frightening, then, to consider that AI may be poised to devour that very space. Where communication needs to be made to transform stark numbers into complex clinical recommendations, could AI replace the medical laboratory professional? If the data set is too narrow, it might be exceedingly difficult. But if the administration is not picky, if lower accuracy is accepted, the temptation to replace expensive people with cheap AI may be irresistible.

Dr. Plebani, discussing this possibility, asserts, “It is increasingly evident that laboratory data provide information of fundamental importance only if integrated in the context of appropriate request and interpreted on the basis of other clinical information, including the pre-test probability. It appears absurd to expect an AI tool to make a better interpretation of laboratory results than that made by a human, hopefully a well-trained physician, if the data provided are limited.”

Can an AI know the reference ranges of a test? The previous test results of the patient? The pre-analytical errors that may have occurred with the specimen? The competency status of the staff that performed the tests? The QC-status of the method that generated the result? The traceability of the method and the commutability of the controls? The acceptability of reagent lots and calibrators? The differences that might exist between instruments, methods, and laboratories? The most recent EQA/PT results? The most important and most recent publications governing the use and interpretation of the test?

Much of this can be known, may already be known and stored within informatics in the laboratory, and the challenge is only getting a ChatGPT to “learn” all this data. For novel testing, where there is less known about a method and its interpretation, that may be a bridge too far for AI. Humans might still be needed in those areas where interpretation is gray and complex, at least until such time as enough data exists for the AI to learn. If key data is left out, such as the quality of the method, the AI will have errors and blind spots baked into its recommendations. Whether those errors are acceptable is up to us.

Establishing standards of performance for AI in laboratory medicine is a first step, one that laboratory professionals are well-suited for. If AI cannot author a scientific paper, because it cannot be held accountable, what does that say about interpretation and diagnosis? Should we demand that a human always serve as the confirmatory or secondary opinion to anything generated by AI?

The laboratory experience with automation may be instructive. We may soon find ourselves playing a duet with AI, our part growing smaller and smaller over time, until we rest, yielding to an infinite solo.