Skip navigation

Diagnostic testing emerging from the gloom?

CARE essay
Levels of evidence
Bias in diagnostic test studies

Bandolier has long sought good evidence about diagnostic tests. We want to know how well a particular test or diagnostic algorithm works in a particular setting. We want to have confidence that we can reliably predict that a patient has a high chance of having or not having a disease. Our subsequent decisions about treating, not treating, or referring depend on the adequacy of our diagnosis.

The problem is that there is little evidence to be found at all, and little of that is good news. A succession of stories saying that tests are useless loses impact. Without empirical evidence of bias in study architecture, we are rudderless in the midst of a tidal surge.

Nil desperandum. Help is at hand. Two recent publications have begun to lay a little more foundation and to provide a sea-anchor in this turbulent area.

CARE essay

Bandolier 66 we featured the CARE project (Clinical Assessment of the Reliability of the Examination), a collaborative study of the accuracy and precision of the clinical examination. The Internet address is .

The main plotters behind CARE, Finlay McAlister, Sharon Straus and David Sackett have written a terrific essay on the need for large prospective studies of the clinical examination [1]. This is an important, perhaps seminal paper. More than any other Bandolier has read it explains why new research, indeed, new thinking, is required. It's beautifully written and easy to follow, and is essential reading.

Their prime example is chronic obstructive airways disease (COAD). A systematic review sought physical signs for differentiating patients for those with COAD from those with normal pulmonary function. There were many, but no one sign was found in more than a third of studies.

For each of the the four most commonly used physical signs the range of diagnostic accuracy from the literature was huge. Positive likelihood ratios spanned the range from about 1 to over 10: from useless to highly predictive.

They also examined the quantity and quality of evidence from systematic reviews for a variety of signs for different conditions. There were few high-quality studies, and those there were were small.

The bottom line is that at best we have hand-me-down evidence, and experience. We have little or no objective proof of the quality of diagnostic accuracy of clinical examinations.

Levels of evidence

One description of levels of evidence commonly used is shown below. The keys to good quality are independence, masked comparison with a reference standard, and consecutive patients from an appropriate population. Lower quality comes from inappropriate populations and comparisons that are not masked or with different reference standards. Other standards have been applied to diagnostic tests, as reported in Bandolier 26 .

Levels of evidence for studies of diagnostic methods

Level Criteria
1 An independent, masked comparison with reference standard among an appropriate population of consecutive patients.
2 An independent, masked comparison with reference standard among non-consecutive patients or confined to a narrow population of study patients.
3 An independent, masked comparison with an appropriate population of patients, but reference standard not applied to all study patients
4 Reference standard not applied independently or masked
5 Expert opinion with no explicit critical appraisal, based on physiology, bench research, or first principles.

Bias in diagnostic test studies

What we have lacked up to now is proof that poor study design is associated with bias. A new contribution from Holland [2] provides the missing link.

It searched for and found 26 systematic reviews of diagnostic tests with at least five included studies. Only 11 could be used in their analysis, because 15 were either not systematic in their searching or did not report any sensitivity or specificity. Data from the remainder were subjected to mathematical analysis, to investigate whether the presence or absence of some item of proposed study quality made a difference to the perceived value of the test.

There were 218 studies, only 15 of which satisfied all eight criteria of quality for the analysis. Thirty percent fulfilled at least six of eight criteria. The relative diagnostic odds ratio used indicated the diagnostic performance of a test in studies failing to satisfy the methodological criterion relative to its performance in studies with the corresponding feature. Over-estimation of effectiveness (positive bias) of a diagnostic test was shown by a lower confidence interval for the relative diagnostic odds ratio of more than 1.

Study characteristic Relative diagnostic odds ratio (95% CI) Description
Case-control 3.0 (2.0 to 4.5) A group of patients already known to have the disease compared with a separate group of normal patients
Different reference tests 2.2 (1.5 to 3.3) Different reference tests used for patients with and without the disease
Not blinded 1.3 (1.0 to 1.9) Interpretation of test and reference is not blinded to outcomes
No description test 1.7 (1.1 to 1.7) Test not properly described
No description of population 1.4 (1.1 to 1.7) Population under investigation not properly described
No description reference 0.7 (0.6 to 0.9) Reference standard not properly described
The relative diagnostic odds ratio indicates the diagnostic performance of a test in studies failing to satisfy the methodological criterion relative to its performance in studies with the corresponding feature.

The results are shown in the Table. Use of different reference tests, lack of blinding and lack of a description of either the test or the population in which it was studied led to positive bias. But the largest factor leading to positive bias was evaluating a test in a group of patients already known to have the disease and a separate group of normal patients - called a case-control study here.


The amount of positive bias in poorly conducted studies of diagnostic tests is extremely worrying. Most information for most laboratory tests is only available in the form of case-control studies - those with the highest bias.

Take one example, that of the fashionable free-PSA test [3]. The likelihood ratios from the early studies were 2 to 7. This might be useful in a population of men referred to a urology clinic with prostate cancer or BPH, but most of the studies were case-control studies. If the likelihood ratios were biased, and in truth were lower, the test may be of no use even in a high prevalence setting.

It is all very worrying. It is time someone in academe, or the NHS, or industry sat up and took notice. The problem is not just, or even, with treatment. The problem is knowing who is to be treated. The message is that we need to get back to first principles and do some large high-quality real-life studies. CARE has started that for the clinical examination, but there's absolutely no reason why similar studies could not be performed in other setting for laboratory tests and clinical examinations combined.


  1. FA McAlister, SE Straus, DL Sackett. Why we need large, simple studies of the clinical examination: the problem and a proposed solution. Lancet 1999 354: 1721-24.
  2. JG Lijmer et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999 282: 1061-6.
  3. RA Moore. Free PSA as a percentage of the total: where do we go from here? Clinical Chemistry 1997 43: 1561-2.
previous or next story in this issue