Hundreds of algorithms have been created to identify the coronavirus. None of them were accurate enough, but some were still used in hospitals and could potentially even harm patients. Experts note that the pandemic has once again drawn attention to problems that have long needed a solution.

In March 2020, hospitals faced a crisis that was poorly understood. “The doctors just didn’t understand how to deal with these patients,” says Laura Vinants, an epidemiologist at Maastricht University in the Netherlands who studies predictive tools.

But they had data from China, which had been battling the pandemic for 4 months. If machine learning algorithms could be trained on this data so that doctors understand what they see and make decisions, it could save lives.

“I thought, ‘If ever AI can prove its worth, it will only now.” I was full of expectations, ”says Vinants.

But that didn’t happen. And it’s not a lack of trying. Research teams around the world have come to the aid of doctors. In particular, the AI ​​community was quick to develop programs that were believed to allow for faster diagnosis and sorting of patients. In theory, this would ease the burden on the front line.

As a result, many forecasting tools have been developed. None of them had significant benefits, and some were even potentially harmful.

This is the conclusion reached by numerous studies published in the past few months. In June, the Alan Turing Institute, the UK’s national center for data science and AI, presented a paper summarizing the workshop discussions held at the end of 2020. Participants agreed that if AI tools had an impact on the fight against coronavirus, it was insignificant.

Not for clinical use

This echoes the findings of two of the largest studies that evaluated hundreds of forecasting tools developed over the past year. The lead author of one of them is Vincent. Her review in the British Medical Journal is still being updated as new instruments are released and existing ones are tested. Together with colleagues, she studied 232 algorithms that diagnose patients or assess how severe the disease will be.

They found that none of them are suitable for clinical use. Only two were highlighted as promising enough for future testing.

“It’s shocking,” says Vincent. “I was worried when I started work, but it exceeded my concerns.”

Vincent’s research is supported by the results of another major review by MoD researcher Derek Driggs of the University of Cambridge and colleagues. The work was published in Nature Machine Intelligence. The group studied deep learning models for diagnosing coronavirus and predicting risks from medical images such as X-rays or chest CT scans.

A total of 415 instruments were studied. Like Vinants and colleagues, the researchers concluded that no algorithm is suitable for clinical use.

“This pandemic has proved to be a huge challenge for AI and medicine,” says Driggs. “It would help get the public on our side. But I don’t think we passed that test. “

Both teams found that the researchers repeated the same basic mistakes when teaching the instruments. Inadequate assumptions about the data often mean that the trained models did not perform as advertised.

That being said, Winants and Driggs still believe that AI can help. However, the wrong algorithm can be harmful if it misses a diagnosis or underestimates the risk to vulnerable patients. “There is a lot of hype right now about machine learning models and what they can do,” says Driggs.

Unrealistic expectations of encouraging the use of these tools before they are ready. Vinants and Driggs argue that some of the algorithms they studied are already in use in hospitals, and some are being sold by private developers. “I’m afraid they might have hurt the patients,” says Vinants.

So what went wrong? And how can this gap be overcome? There is good news, too. Thanks to the pandemic, many researchers realized that it was time to change the approach to creating AI tools. “The pandemic has brought issues into focus that we’ve been putting off for some time,” says Vinants.

Something went wrong

Many of the problems identified are related to poor data quality. Information about coronavirus patients, including medical images, was collected and disseminated in the midst of the pandemic. This was often done by the attending physicians themselves. The researchers wanted to help quickly, and these were the only publicly available datasets. But this meant that many tools were trained on mislabeled data or information from unknown sources.

Driggs explains that these “Frankenstein datasets,” as he calls them, were collected from multiple sources and may contain duplicates. Therefore, some tools end up being tested on the same data they were trained on. Because of this, they seem more accurate than they actually are.

It also confuses the origin of certain datasets. As a result, researchers are missing out on important features that skew model learning.

  • For example, many have used data from chest images of children who did not have the coronavirus as examples of what non-disease cases look like. And artificial intelligence has learned to look for children, not COVID-19.
  • Driggs’ group trained their own model on a dataset of images taken while patients were lying or standing.
  • Since bedridden patients were more likely to be seriously ill, AI learned to misdiagnose the serious risk of coronavirus by the position of the person.
  • In other cases, some algorithms recognized the font of the photo captions. As a result, fonts from more heavily loaded hospitals began to be counted as predictors of coronavirus.

From the outside, such errors seem obvious. They can be corrected by adjusting the model if the researchers are aware of them. The flaws can be recognized and a less accurate but not misleading algorithm can be released. But they were created either by AI researchers who lacked the medical experience to spot inaccuracies, or by medical researchers who lacked math skills.

A more delicate issue that Driggs points out is the inclusion offset that occurs when a dataset is tagged. For example, many images were signed depending on the conclusion made by the radiologist. But because of this, the dataset contains the biases of a particular doctor. Driggs thinks it would be much better to tick the PCR test. However, hospitals do not always have time for statistical nuances.

This did not prevent the immediate use of some of the tools in clinical practice. Vinants says it’s unclear which ones are being used and how. Sometimes hospitals claim that AI is only used for research purposes, which makes it difficult to estimate how many doctors rely on it.

Vinants asked a company that sells deep learning algorithms to share their approach, but received no response. She later discovered several published models from researchers associated with the company. They all had a high risk of bias. “We don’t really know what exactly the company was using,” she says.

Some hospitals are even signing nondisclosure agreements with AI medical instrument providers, Vinants said. When she asked doctors what algorithms or programs they use, some replied that they were forbidden to divulge it.

How to fix it

Better data would help solve the problem, but in times of crisis it is difficult to ask for it. It’s more important to make the most of the data you already have. Driggs notes that the simplest solution is the collaboration of AI teams and treating physicians. Researchers also need to share their models and disclose how they learned so that others can test and build on them.

“These are two things we could do today,” he says. “And they will solve perhaps 50% of the problems that we identified.”

Data acquisition would also be easier if formats were standardized, says Bilal Matin, a physician who leads the clinical technology team at the Welcome Trust, an international health research charity headquartered in London.

Another problem faced by Vinants, Driggs, and Mateen is that most researchers have been rushing to develop their own models rather than collaborating or improving existing ones. As a result, hundreds of mediocre tools have emerged around the world, instead of just a few properly trained and tested.

“The models are so similar — almost all of them use the same methods with minor modifications, the same inputs — and they all make the same mistakes,” Vinants says.

She adds: “If all these people were testing models that were already available, we might already have something that would really help the medical profession.”

In a sense, this is a perennial research problem. Academic researchers have little career incentive to exchange jobs or validate existing results. No one gets an award for traveling the “last mile” – the distance “from the lab bench to the patient’s bed,” Matin says.

To address this problem, WHO is considering the possibility of concluding fixed-term contracts for the exchange of data, which will operate during such international crises. Matin believes that this would make it easier for researchers to move data across borders. Ahead of the G7 summit in the UK, scientists from participating countries also called for “data readiness” in preparation for future health emergencies.

Such initiatives sound a little vague, and in the case of a call for change, it’s always easy to take wishful thinking. But Matin has his own “naive-optimistic” view of the situation. Before the pandemic, the momentum for such initiatives stalled: the task seemed too daunting.

“The coronavirus has put a lot back on the agenda. Until we believe that we need to solve unattractive problems before attractive ones, we are doomed to repeat the same mistakes, says Matin. – It will be unacceptable if this does not happen. Forgetting the lessons of this pandemic is disrespect for those who have passed away. “