Skip navigation

How Good is that Test? II

Bandolier 26 reported on quality standards that should be met by reports of diagnostic procedures, and how few of those standards were met by reports in our top medical journals. Users of tests will want to know not only that tests work, but how well they work - just like NNTs for treatments.

This issue of Bandolier investigates diagnostic test qualities a little further. The problem of spectrum bias means focusing on sensitivity and specificity of tests. These, however, are not the most user-friendly of measures, so Bandolier , ever seeking simplicity, has invented a new measure - the NND, or number-needed-to-diagnose. Comments are invited.

Spectrum bias

An unrecognised (but probably very real) problem is that of spectrum bias [1]. This is the phenomenon of the sensitivity and/or specificity of a test varying with different populations tested - populations which might vary in sex ratios, age, or severity of disease as three simple examples.
Spectrum bias at its simplest means that the sensitivity and specificity of the test have to be known in a range of different patient populations.

This was tested in the paper by looking at men and women tested for urinary tract infections with urine dipsticks [1].

  • Overall the sensitivity was 0.83 (95%CI 0.73 - 0.91) and specificity 0.71 (0.66 - 0.77).
  • When the clinical prior probability of UTI being present was high the sensitivity of the test was high - 0.92 (0.82 - 0.98).
  • When clinical prior probability was low, the test performed less well - sensitivity 0.56 (0.31 - 0.79).

Actually, this is very good, showing that using the urine dipstick test where there were some clinical indications of UTI picked up the infection nearly every time. Note though, that this only addresses those patients with the disease - not those without it.

The authors examined a number of other tests, and found examples of spectrum bias with tumour markers (varying with severity of disease), exercise ECG for coronary ischaemia (varying with age, sex and severity) and various other physical tests.


The problem is handling tables of sensitivity and specificity - two sets of numbers that can go up or down independently in different populations. It is just too much for simple or busy brains. It is hard enough remembering just how sensitivity and specificity are defined. If the evidence is too complicated to be used, then we have a problem.


Is it possible to simplify these measures? Well, a whole raft of calculations can be done knowing the true and false positive and negative rates, none of which condenses the information down to a single useful figure. Using positive and negative predictive values (as one example) still means carrying too much baggage.

Given Bandolier's prediliction for the number-needed-to-treat, we wondered whether it was possible to generate an analogous "number-needed-to-diagnose". The arguments go something like this (and forgive a little jargon):

For any chosen clinical endpoint the NNT is the reciprocal of the fractional improvement in a treated group minus the fractional improvement in an untreated group NNT = 1/(fraction improved with active - fraction improved with control)

For a diagnostic test the analogous calculation of a NND would be the reciprocal of the fraction of positive tests in the group with the disease minus the fraction of positive tests in the group without the disease.
The first term, the fraction of positive tests in the group with disease is the sensitivity (true positive/true positive plus false positive).

Specificity is defined as the proportion of people without the disease who have a negative test. So the second term, the fraction of positive tests in the group without the disease, is 1 - specificity.


The number-needed-to-diagnose is therefore:

NND = 1/[Sensitivity - (1 - Specificity)]

How does this work in practice?

Take Helicobacter pylori infections as an example. Serology tests for the presence of anti-H pylori immunoglobulins and urea breath tests have sensitivities and specificities each of about 95%. So the NND calculation using fractions would be:

NND = 1/[0.95 - (1 - 0.95)] = 1/[0.9] = 1.1

Using examples from the paper on spectrum bias gives a series of results with NND values up to about 4. Thus using CEA as a diagnostic screening test for colon cancer in patients with the disease would yield a NND of 4.4 in early cancers, but as low as 1.6 in late cancers - a clear case of spectrum bias. Similar differences exist for other examples.

Interesting is the effect of NND calculations on the authors' own data on urine testing. Because sensitivity goes down but specificity increases in patients with few symptoms of UTI, the NND of 2.9 remains the same whether the clinical suspicion is high or low. Their best result was the overall NND of 1.8, because of a combination of relatively high sensitivity and specificity. Perhaps this emphasises the need to consider sensitivity and specificity combined in a single term.

Choosing which test

There are occasions where different tests can be used to make the same diagnosis. NNDs may help to choose between them when faced with an array of sensitivity and specificity figures.

NNDs calculated for diagnostic tests

Test Subgroup Sensitivity Specificity NND
Urine dipstick for UTI overall 0.83 0.71 1.8
high prior probability 0.92 0.42 2.9
low prior probability 0.56 0.78 2.9
Serology for H pylori infection all patients 0.95 0.95 1.1
CEA screening for colon cancer Duke stage A or B 0.36 0.87 4.3
Duke stage C or D 0.74 0.83 1.8
Exercise ECG for coronary ischaemia Men 0.73 0.83 1.9
Women 0.57 0.86 2.3
Age <40 years 0.56 0.84 2.5
60 years 0.84 0.70 1.9
Biochemical tests of smoking status Breath carbon monoxide 0.98 0.92 1.12
Serum thiocyanate 0.82 0.91 1.37
Urine nicotine metabolite 0.98 0.94 1.09

The Table shows three tests of smoking status from a Northern Ireland study [2] measured against self-reporting. They are all good, but urine nicotine metabolite or breath carbon monoxide are much better than serum thiocyanate. Even small improvements are important if considering routine or screening use of such tests.


  1. No implications until verification should be the rule here. Bandolier would welcome comments from statisticians and those performing and using diagnostic tests that these NND calculations are valid.
  2. Remember the confidence interval issue. It is not immediately clear just how confidence intervals should be calculated for NND, and even a cursory glance at the calculations show that NNDs would in some circumstances be quite sensitive to small changes in sensitivity or specificity.
  3. Interpretation of any test, and its quality, cannot be made without looking at what the consequences of a positive or negative test might be. Where the consequences are significant we need the best tests, but can use tests with higher NNDs where the consequences are minimal.


  1. MS Lachs, I Nachamkin, PH Edelstein et al. Spectrum bias in the evaluation of diagnostic tests: lessons from the rapid dipstick test for urinary tract infection. Annals of Internal Medicine 1992 117: 135-40.
  2. GPR Archbold, ME Cupples, A McKnight, T Linton. Measurement of markers of tobacco smoking in patients with coronary heart disease. Annals of Clinical Biochemistry 1995 32: 201-7.

previous or next story in this issue