By SAURABH JHA, MD
Preetham Srinivas, the head of the
chest radiograph project in Qure.ai, summoned Bhargava Reddy, Manoj Tadepalli, and
Tarun Raj to the meeting room.
“Get ready for an all-nighter, boys,”
Qure’s scientists began investigating
the algorithm’s mysteriously high performance on chest radiographs from a new
hospital. To recap, the algorithm had an area under the receiver operating
characteristic curve (AUC) of 1 – that’s 100 % on multiple-choice question
“Someone leaked the paper to AI,”
“It’s an engineering college joke,”
explained Bhargava. “It means that you saw the questions before the exam. It
happens sometimes in India when rich people buy the exam papers.”
Just because you know the questions
doesn’t mean you know the answers. And AI wasn’t rich enough to buy the AUC.
The four lads were school friends from
Andhra Pradesh. They had all studied computer science at the Indian Institute
of Technology (IIT), a freaky improbability given that only hundred out of a
million aspiring youths are selected to this most coveted discipline in India’s
most coveted institute. They had revised for exams together, pulling
all-nighters – in working together, they worked harder and made work more fun.
Preetham ordered Maggi Noodles – the mysteriously
delicious Indian instant noodles – to charge their energies. Ennio Morricone’s
soundtrack from For a Few Dollars More played in the background. We were
venturing into the wild west of deep learning.
The lads had to comb a few thousand
normal and a few thousand abnormal radiographs to find what AI was seeing. They
were engineers, not radiologists, and had no special training in radiology
except for one that comes with looking at thousands of chest radiographs, which
they now knew like the lines at the back of their hands. They had carefully fed
AI data to teach it radiology. In return, AI taught them radiology – taught
them where to look, what to see, and what to find.
They systematically searched the
chest radiographs for clues. Radiographs are two-dimensional renditions, mere geometric
compressions, maps of sorts. But the real estate they depict have unique personalities.
The hila, apices, and tracheobronchial angle are so close to each other that
they may as well be one structure, but like the mews, roads, avenues and
cul-de-sacs of London, they’re distinct, each real estate expressing unique elements
of physiology and pathology.
One real estate which often flummoxes
AI is the costophrenic angle (CPA) – a quiet hamlet where the lung meets the
diaphragm, two structures of differing capacity to stop x-rays, two opposites
which attach. It’s supposedly sharp – hence, an “angle”; the loss of sharpness implies
a pleural effusion, which isn’t normal.
The CPA is often blunt. If radiologists
called a pleural effusion every time the CPA was blunt half the world would
have a pleural effusion. How radiologists deal with a blunted CPA is often
arbitrary. Some call pleural effusion, some just describe their observation
without ascribing pathology, and some ignore the blunted CPA. I do all three
but on different days of the week. Variation in radiology reporting frustrates
clinicians. But as frustrating as reports are, the fact is that radiographs are
imperfect instruments interpreted by imperfect arbiters – i.e. Imperfection Squared.
Subjectivity is unconquerable. Objectivity is farcical.
Because the radiologist’s
interpretation is the gospel truth for AI, variation amongst radiologists messes
AI’s mind. AI prefers that radiologists be consistent like sheep and the report
be dogmatic like the Old Testament, so that it can better understand the ground
truth even if the ground truth is really ground truthiness. When all
radiologists call a blunted CPA a pleural effusion, AI appears smarter. Perhaps,
offering my two cents, the secret to AI’s mysterious super performance was that
the radiologists from this new institute were sheep. They all reported the blunted
CPA in the same manner. 100 % consistency – like machines.
“I don’t think it’s the CPA, yaar,”
objected Tarun, politely. “The problem is probably in the metadata.”
The metadata is a lawless province
which drives data scientists insane. Notwithstanding variation in radiology
reporting, radiographs – i.e. data – follow well-defined rules, speak a common
language, and can be crunched by deep neural networks. But radiographs don’t
exist in vacuum. When stored, they’re drenched in the attributes of the local information
technology. And when retrieved, they carry these attributes, which are like
local dialects, with them. Before feeding the neural networks, the radiographs must
be cleared of idiosyncracies in the metadata, which can take months.
It seemed we had a long night ahead.
I was looking forward to the second plate of Maggi Noodles.
Around the 50th radiograph,
Tarun mumbled, “it’s clever Hans.” His pitch then rose in excitement, “I
figured it. AI is behaving like Clever Hans.”
Clever Hans was a celebrity German
horse which could allegedly add and subtract. He’d answer by tapping his hoof. Researchers,
however, figured out his secret. Hans would continue tapping his hoof until the
number of taps corresponded to the right numerical answer, which he’d deduce
from the subtle, non-verbal, visual cues in his owner. The horse would get the
wrong answer if he couldn’t stare at his owner’s face. Not quite a math
Olympiad, Hans was still quite clever, certainly for a horse, but even by human
“What do you see?” Tarun pointed excitedly
to a normal and an abnormal chest radiograph placed side by side. Having
interpreted over several thousand radiographs I saw what I usually see but
couldn’t see anything mysterious. I felt embarrassed – a radiologist was being
upstaged by an engineer, AI, and supposedly a horse, too. I stared intently at
the CPA hoping for a flash of inspiration.
“It’s not the CPA, yaar,” Tarun said
again – “look at the whole film. Look at the corners.”
I still wasn’t getting it.
“AI is crafty, and just like Hans
the clever horse, it seeks the simplest cue. In this hospital all abnormal
radiographs are labelled – “PA.” None of the normals are labelled. This is the
way they kept track of the abnormals. AI wasn’t seeing the hila, or CPA, or
lung apices – it detected the mark – “PA” – which it couldn’t miss,” Tarun explained.
The others shortly verified Tarun’s
observation. Sure enough, like clockwork – all the abnormal radiographs had
“PA” written on them – without exception. This simple mark of abnormality, a
local practice, became AI’s ground truth. It rejected all the sophisticated
pedagogy it had been painfully taught for a simple rule. I wasn’t sure whether
AI was crafty, pragmatic or lazy, or whether I felt more professionally threatened
by AI or data scientists.
“This can be fixed by a simple
code, but that’s for tomorrow,” said Preetham. The second plate of Maggi Noodles
never arrived. AI had one more night of God-like performance.
The Language of Ground Truth
Artificial Intelligence’s pragmatic laziness
is enviable. To learn, it’ll climb mountains when needed but where possible
it’ll take the shortest path. It prefers climbing molehills to mountains. AI
could be my Tyler Durden. It doesn’t give a rat’s tail how or why and even if
it cared it won’t tell you why it arrived at an answer. AI’s dysphasic
insouciance – its black box – means that we don’t know why AI is right, or that
it is. But AI’s pedagogy is structured and continuous.
After acquiring the chest radiographs,
Qure’s scientists had to label the images with the ground truth. Which truth,
they asked. Though “ground truth” sounds profound it simply means what the
patient has. On radiographs, patients have two truths: the radiographic
finding, e.g. consolidation – an area of whiteness where there should be lung,
and the disease, e.g. pneumonia, causing that finding. The pair is a couplet.
Radiologists rhyme their observation with inference. The
radiologist observes consolidation and infers pneumonia.
The inference is clinically
meaningful as doctors treat pneumonia, not consolidation, with antibiotics. The
precise disease, such as the specific pneumonia, e.g. legionella pneumonia, is
the whole truth. But training AI on the whole truth isn’t feasible for several
First, many diseases cause
consolidation, or whiteness, on radiographs – pneumonia is just one cause,
which means that many diseases look similar. If legionella pneumonia looks like
alveolar hemorrhage, why labor to get the whole truth?
Second, there’s seldom external
verification of the radiologist’s interpretation. It’s unethical resecting
lungs just to see if radiologists are correct. Whether radiologists attribute
consolidation to atelectasis (collapse of a portion of the lung, like a folded
tent), pneumonia, or dead lung – we don’t know if they’re right. Inference is
Another factor is the sample size:
preciser the truth fewer cases of that precise truth. There are more cases of
consolidation from any cause than consolidation from legionella pneumonia. AI
needs numbers, not just to tighten the confidence intervals around the point
estimate – broad confidence intervals imply poor work ethic – but for external
validity. The more general the ground truth, the more cases of labelled truth
AI sees, and the more generalizable AI gets, allowing it to work in Mumbai,
Karachi, and New York.
Thanks to Prashant Warier’s
tireless outreach and IIT network, Qure.ai acquired a whopping 2.5 million
chest radiographs from nearly fifty centers across the world, from afar as
Tokyo and Johannesburg and, of course, from Mumbai. AI had a sure shot at going
global. But the sheer volume of radiographs made the scientists timorous.
“I said to Prashant, we’ll be here
till the next century if we have to search two million medical records for the
ground truth, or label two million radiographs” recalls Preetham. AI could
neither be given a blank slate nor be spoon fed. The way around it was to label
a few thousand radiographs with anatomical landmarks such as hila, diaphragm,
heart, a process known as segmentation. This level of weak supervision could be
For the ground truth, they’d use
the radiologist’s interpretation. Even so, reading over a million radiology
reports wasn’t practical. They’d use Natural Language Processing (NLP). NLP can
search unstructured (free text) sentences for meaningful words and phrases. NLP
would tell AI whether the study was normal or abnormal and what the abnormality
Chest x-ray reports are diverse and
subjective, with inconsistency added to the mix. Ideally, words should
precisely and consistently convey what radiologists see. Radiologists do pay
heed to March Hare’s advice to Alice: “then you should say what you mean,” and
to Alice’s retort: “at least I mean what I say.” The trouble is that different
radiologists say different things about the same disease and mean different
things by the same descriptor.
One radiologist may call every
abnormal whiteness an “opacity”, regardless of whether they think the opacity
is from pneumonia or an innocuous scar. Another may say “consolidation” instead
of “opacity.” Still another may use “consolidation” only when they believe the
abnormal whiteness is because of pneumonia, instilling connation in the
denotation. Whilst another may use “infiltrate” for viral pneumonia and
“consolidation” for bacterial pneumonia.
The endless permutations of
language in radiology reports would drive both March Hare and Alice insane. The
Fleischner Society lexicon makes descriptors more uniform and meaningful. After
perusing several thousand radiology reports, the team selected from that
lexicon the following descriptors for labelling: blunted costophrenic angle,
cardiomegaly, cavity, consolidation, fibrosis, hilar enlargement, nodule,
opacity and pleural effusion.
Not content with publicly available
NLPs, which don’t factor local linguistic culture, the team developed their own
NLP. They had two choices – use machine learning to develop the NLP or use
humans (programmers) to make the rules. The former is way faster. Preetham
opted for the latter because it gave him latitude to incorporate qualifiers in
radiology reports such as “vague” and “persistent.” The nuances could come in
handy for future iterations.
Starting off with simple rules such
as negation detection so that “no abnormality” or “no pneumonia” or “pneumonia
unlikely” would be the same as “normal”, then broadening the rules to
incorporate synonyms such as “density” and “lesion”, including the protean
“prominent”, a word which can mean anything except what it actually means and
like “awesome” has been devalued by overuse, the NLP for chest radiograph
accrued nearly 2500 rules, rapidly becoming more biblical than the regulations
The first moment of reckoning
arrived: does the NLP even work? Testing the NLP is like testing the tester –
if the NLP was grossly inaccurate, the whole project would crash. NLP
determines the accuracy of the labelled truth – e.g. whether the radiologist
truly said “consolidation” in the report. If NLP correctly picks
“consolidation” in nine out of ten reports and doesn’t in one out of ten, the
radiograph with “consolidation” but labelled “normal” doesn’t confuse AI. AI
can tolerate occasional misclassification; indeed, it thrives on noise. You’re
allowed to fool it once, but you can’t fool it too often.
After six months of development,
the NLP was tested on 1930 reports to see if it flagged the radiographic
descriptors correctly. The reports, all 1930 of them, were manually checked by
radiologists blinded to NLP’s answers. The NLP performed respectively, with
sensitivities/ specificities for descriptors ranging from 93 % to 100 %.
For “normal”, the most important
radiological diagnosis, NLP had a specificity of 100 %. This means that in 10,
000 reports the radiologists called or implied abnormal, none would be falsely
extracted by the NLP as “normal.” NLP’s sensitivity for “normal” was 94 %. This
means that in 10, 000 reports the radiologist called or implied normal, 600
would be falsely extracted by NLP as “abnormal.” NLP’s accuracy reflected
language ambiguity, which is a proxy of radiologist’s uncertainty. Radiologists
are less certain and use more weasel words when they believe the radiograph is
After deep learning’s success using
Image Net to spot cats and dogs, prominent computer scientists prophesized the
extinction of radiologists. If AI could tell cats apart from dogs it could
surely read CAT scans. They missed a minor point. The typical image resolution
in Image Net is 64 x 64 pixels. The resolution of chest radiographs can be as
high as 4096 x 4096 pixels. Lung nodules on chest radiographs are needles in
haystacks. Even cats are hard to find.
The other point missed is more
subtle. When AI is trying to classify a cat in a picture of a cat on the sofa,
the background is irrelevant. AI can focus on the cat and ignore the sofa and
the writing on the wall. On chest radiographs the background is both the
canvass and the paint. You can’t ignore the left upper lobe just because
there’s an opacity in the right lower lobe. Radiologists don’t enjoy
satisfaction of search. All lungs must be searched with unyielding visual
Radiologists maybe awkward people,
imminently replaceable, but the human retina is a remarkable engineering feat,
evolutionarily extinction-proof, which can discern lot more than fifty shades
of gray. For the neural network, 4096 pixels is too much information. Chest
radiographs had to be down sampled to 256 pixels. The reduced resolution makes
pulmonary arteries look like nodules. Radiologists should be humbled that AI
starts at a disadvantage.
Unlike radiologists, AI doesn’t
take bathroom breaks or check Twitter. It’s indefatigable. Very quickly, it
trained on 50, 000 chest radiographs. Soon AI was ready for the end of semester
exam. The validation cases come from the same source as the training cases.
Training-validation is a loop. Data scientists look at AI’s performance on
validation cases, make tweaks, and give it more cases to train on, check its
performance again, make tweaks, and so on.
When asked “is there
consolidation?”, AI doesn’t talk but expresses itself in a dimensionless number
known as confidence score – which runs between 0 and 1. How AI arrives at a
particular confidence score, such as 0.5, no one really understands. The score
isn’t a measure of probability though it probably incorporates some
probability. Nor does it strictly measure confidence, though it’s certainly a
measure of belief, which is a measure of confidence. It’s like asking a
radiologist – “how certain are you that this patient has pulmonary edema –
throw me a number?” The number the radiologist throws isn’t empirical but is
The confidence score is mysterious
but not meaningless. For one, you can literally turn the score’s dial, like
adjusting the brightness or contrast of an image, and see the trade-off between
sensitivity and specificity. It’s quite a sight. It’s like seeing the full
tapestry of radiologists, from the swashbuckling under caller to the “afraid of
my shadow” over caller. The confidence score can be chosen to maximize
sensitivity or specificity, or using Youden’s index, optimize both.
To correct poor sensitivity and
specificity, the scientists looked at cases where the confidence scores were at
the extremes, where the algorithm was either nervous or overconfident. AI’s
weaknesses were radiologist’s blind spots, such as the lung apices, the crowded
bazaar of the hila, and behind the ribs. It can be fooled by symmetry. When the
algorithm made a mistake, it’s reward function, also known as loss function,
was changed so that it was punished if it made the same mistake and rewarded
when it didn’t. Algorithms, who have feelings, too, responded favorably like
Pavlov’s dogs, and kept improving.
The Board Exam
After eighteen months of
training-validation, and seeing over million radiographs, the second moment of
reckoning arrived: the test, the real test, not the mock exam. This important
part of algorithm development must be rigorous because if the test is too easy
the algorithm can falsely perform. Qure.ai wanted their algorithms validated by
independent researchers and that validation published in peer review journals.
But it wasn’t Reviewer 2 they feared.
“You want to find and fix the
algorithm’s weaknesses before deployment. Because if our customers discover its
weaknesses instead of us, we lose credibility,” explained Preetham.
Preetham was alluding to the
inevitable drop in performance when algorithms are deployed in new hospitals. A
small drop in AUC such as 1-2 %, which doesn’t change clinical management, is
fine; a massive drop such as 20 % is embarrassing. What’s even more
embarrassing is if AI misses an obvious finding such a bleedingly-obvious
consolidation. If radiologists miss obvious findings they could be sued. If the
algorithm missed an obvious finding it could lose its jobs, and Qure.ai could
lose future contracts. A single drastic error can undo months of hard work.
Healthcare is an unforgiving market.
In the beginning of the training,
AI missed a 6 cm opacity in the lung, which even toddlers can see. Qure’s
scientists were puzzled, afraid, and despondent. It turned out that the
algorithm had mistaken the large opacity for a pacemaker. Originally, the data
scientists had excluded radiographs with devices so as not to confuse AI. When
the algorithm saw what it thought was a pacemaker it remembered the rule, “no
devices”, so denied seeing anything. The scientists realized that in their
attempt to not confuse AI, they had confused it even more. There was no gain in
mollycoddling AI. It needed to see the real world to grow up.
The test cases came from new
sources – hospitals in Calcutta, Pune and Mysore. The ground truth was made
more stringent. Three radiologists read the radiographs independently. If two
called “consolidation” and the third didn’t, the majority prevailed, and the
ground truth was “consolidation”. If two radiologists didn’t flag a nodule, and
a third did, the ground truth was “no nodule.” For both validation and the test
cases, radiologists were the ground truth – AI was prisoner to radiologists’
whims, but by using three radiologists as the ground truth for test cases, the
interobserver variability was reduced – the truth, in a sense, was the golden
mean rather than consensus.
What’s the minimum number of
abnormalities AI needs to see; its numbers needed to learn (NNL)? This depends
on several factors – how sensitive you think the algorithm will be, the desired
tightness of the confidence interval, desired precision (paucity of false
positives) and, crucially, rarity of the abnormality. The rarer the abnormality
the more radiographs AI needs to see. To be confident of seeing eighty cases –
the NNL was derived from a presumed sensitivity of 80 % – of a specific
finding, AI would have to see 15, 000 radiographs. NNL wasn’t a problem in
either training or validation – recall, there were 100, 000 radiographs for
validation which is a feast even for training. But gathering test cases was
onerous and expensive. Radiologists aren’t known to work for free.
Qure’s homegrown NLP flagged chest
radiographs with radiology descriptors in the new hospitals. There were
normals, too, which were randomly distributed in the test, but the frequency of
abnormalities was different from the training cases. In the latter, the
frequency reflected actual prevalences of radiographic abnormalities. Natural
prevalences don’t guarantee sufficient abnormals in a sample of two thousand.
Through a process called “enrichment”, the frequency of each abnormality in the
test pool was increased, so that 80 cases each of opacity, nodule,
consolidation, etc, were guaranteed in the test.
The abnormals in the test were more
frequent than in real life. Contrived? Yes. Unfair? No. In the American board
examination, radiologists are shown only abnormal cases.
Like anxious parents, Qure’s
scientists waited for the exam result, the AUC.
“We expected sensitivities of 80 %.
That’s how we calculated our sample size. A few radiologists advised us that we
not develop algorithms for chest radiographs, saying that it was a fool’s
errand because radiographs are so subjective. We could hear their warnings.”
Preetham recalled with subdued nostalgia.
The AUC for detecting an abnormal chest
radiograph was 0.92. Individual radiologists, unsurprisingly, did better as
they were part of the truth, after all. As expected, the degree of agreement
between radiologists, the inter-observer variability, affected AI’s
performance, which was the highest when radiologists were most in agreement,
such as when calling cardiomegaly. The radiologists had been instructed to call
“cardiomegaly” when the cardiothoracic ratio was greater than 0.5. For this
finding, the radiologists agreed 92 % of the time. For normal, radiologists
agreed 85 % of the time. For cardiomegaly, the algorithm’s AUC was 0.96. Given the
push to make radiology more quantitative and less subjective, these statistics
should be borne in mind.
For all abnormalities, both
measures of diagnostic performance were over 90 %. The algorithm got straight
As. In fact, the algorithm performed better on the test (AUC – 0.92) than validation cases (AUC – 0.86) at
discerning normal – a testament not to its less-is-more philosophy but the fact
that the test sample had fewer gray zone abnormalities, such as calcification
of the aortic knob, the type of “abnormality” that some radiologists report and
others ignore. This meant that AI’s performance had reached an asymptote which
couldn’t be overcome by more data because the more radiographs it saw the more
“gray zone” abnormalities it’d see. This curious phenomenon mirrors
radiologists’ performance. The more chest radiographs we see the better we get.
But we get worse, too, because we know what we don’t know and become more
uncertain. After a while there’s little net gain in performance by seeing more
Nearly three years after the
company was conceived, after several dead ends, and morale-lowering
frustrations with the metadata, the chest radiograph algorithm had matured. It
was actually not a single algorithm but a bunch of algorithms which helped each
other and could be combined into a meta-algorithm. The algorithms moved like
bees but functioned like a platoon.
As the team was about to open the
champagne, Ammar Jagirdar, Product Manager, had news.
“Guys, the local health authority
in Baran, Rajasthan, is interested in our TB algorithm.”
Ammar, a former dentist with a
second degree in engineering, also from IIT, isn’t someone you can easily
impress. He gave up his lucrative dental practice for a second career because
he found shining teeth intellectually bland.
“I was happy with the algorithm
performance,” said Ammar, “but having worked in start-ups, I knew that building
the product is only 20 % of the task. 80 % is deployment.”
Ammar had underestimated deployment.
He had viewed it as an engineering challenge. He anticipated mismatched IT
systems which could be fixed by clever codes or I-phone apps. Rajashtan would
teach him that the biggest challenge to deployment of algorithms wasn’t the AUC,
or clever statisticians arguing endlessly on Twitter about which outcome measures
the value of AI, or overfitting. It was a culture of doubt. A culture which
didn’t so much fear change as couldn’t be bothered changing. Qure’s youthful
scientists, who looked like characters from a Netflix college movie, would have
to labor to be taken seriously.
Saurabh Jha (aka @RogueRad) is a contributing editor to The Health Care Blog. This is Part 2 of a 3-part series.