Skip navigation

How good is that test?

More than ever the practice of medicine and delivery of healthcare depends upon making an accurate diagnosis based on the use of diagnostic tests. These may be radiological (including ultrasonics or magnetic resonance imaging), laboratory based (including biochemistry, haematology, bacteriology, virology, immunology or genetics) or physiological (thermometer, dipstick, exercise stress tests).

How do you know how good a test is in giving you the answer you seek? What are the rules of evidence against which new (or existing) tests should be judged? We can have rules of evidence for treatments ( Bandolier 12 ), so rules of evidence for tests shouldn't be too much to ask for.

Methodological standards

Bandolier has found a terrific paper [1] which sets out seven methodological standards for diagnostic tests. It then looks at papers published in Lancet, British Medical Journal, New England Journal of Medicine and Journal of the American Medical Association from 1978 through 1993 to see how many reports of diagnostic tests meet these standards (and for those who can't stand the suspense, the answer is not many!).

This paper is neither an easy nor a comfortable read. For those of us engaged in providing diagnostic tests this is a rude reminder of how little time we spend thinking and how much time "getting and spending". For those of us who use diagnostic tests, it is also a rude reminder of how much faith we often put into a number or opinion, perhaps without thinking of the weight that should be placed on that number.

For these reasons, and many others, this is a paper to get from the library and read in full, and then to keep handy for later use. Bandolier summarises the paper here to make sure that we understand and appreciate it.

Seven standards

Standard 1: Spectrum composition

The sensitivity or specificity of a test depends on the characteristics of the population studied (see Bandolier 3 ). Change the population and you change these indices. Since most diagnostic tests are evaluated on populations with significant disease, the reported values for sensitivity and specificity may not be applicable to other populations, in which the test is to be used.

For this standard to be met the report had to contain information on three of these four criteria: age distribution, sex distribution, summary of presenting clinical symptoms and/or disease stage, and eligibility criteria for study subjects.

Standard 2: Pertinent subgroups

Sensitivity and specificity may represent average values for a population. Unless the condition for which a test is to be used is narrowly defined, then the indices may vary in different medical sub groups. For successful use of the test, separate indices of accuracy are needed for pertinent individual sub groups within the spectrum of tested patients.

This standard is met when results for indices of accuracy were reported for any pertinent demographic or clinical sub group (for example symptomatic versus asymptomatic patients).

Standard 3: Avoidance of workup bias

This form of bias can occur when patients with positive or negative diagnostic test results are preferentially referred to receive verification of diagnosis by the gold standard procedure.

The authors of the paper discuss this at length because of an early lack of agreement in applying the criteria for it. They give many examples. One was of a new DNA diagnostic test to detect the beast cancer gene administered to biopsy-positive breast cancer and to cancer-free controls. Since biopsy may be ordered preferentially in women with a family history of breast cancer, the group of "cases" will be enriched by a clinical factor which itself may be associated with the new DNA test.

For this standard to be met in cohort studies, all subjects had to be assigned to receive both the diagnostic test and the gold standard verification either by direct procedure or by clinical follow up. In case-control studies credit depended on whether the diagnostic test preceded or followed the gold standard procedure. If it preceded, credit was given if disease verification was obtained for a consecutive series of study subjects regardless of their diagnostic test result. If the diagnostic test followed, credit was given if test results were stratified according to the clinical factors which evoked the gold standard procedure.

Standard 4: Avoidance of review bias

This form of bias can be introduced if the diagnostic test or the gold standard is appraised without precautions to achieve objectivity in their sequential interpretation - like blinding in clinical trials of a treatment. It can be avoided if the test and gold standard are interpreted separately by persons unaware of the results of the other.

For this standard to be met in either prospective cohort studies or case-control studies, a statement was required regarding the independent evaluation of the two tests.

Standard 5: Precision of results for test accuracy

The stability of sensitivity and specificity depends on how many patients have been evaluated. Like many other measures, the point estimate should have confidence intervals around it, which are easily calculated.

For this standard to be met, confidence intervals or standard errors must be quoted, regardless of magnitude.

Standard 6: Presentation of indeterminate test results

Not all tests come out with a yes or no answer. Sometimes they are equivocal, or indeterminate. The frequency of indeterminate results will limit a test's applicability, or make it cost more because further diagnostic procedures are needed. The frequency of indeterminate results and how they are used in calculations of test performance represent critically important information about the test's clinical effectiveness.

For this standard to be met a study had to report all of the appropriate positive, negative or indeterminate results generated during the evaluation and whether indeterminate results had been included or excluded when indices of accuracy were calculated.

Standard 7: Test reproducibility

Tests may not always give the same result - for a whole variety of reasons of test variability or observer interpretation. The reasons for this, and its extent, should be investigated.

For this standard to be met in tests requiring observer interpretation, at least some of the tests should have been evaluated for a summary measure of observer variability. For tests without observer interpretation, credit was given for a summary measure of instrument variability.

Do reports meet the standards?

Between 1978 and 1993 the authors found 112 articles, predominantly in radiological tests and immunoassays. Few of the standards were met consistently - ranging from 46% avoiding workup bias down to 9% reporting accuracy in subgroups.

While there was an overall improvement over time for reports to score well on more standards, even in the most recent period studied only 24% met up to four standards, and only 6% up to six.


The authors suggest that, given that these reports were published in arguably the four most important medical journals in the world, they overestimate the true use of methodological standards in the evaluation and reporting of diagnostic tests. That may be, but even so, the findings give real cause for concern about the technological creep of diagnostic tests of unproven worth.

Systematic evaluation of diagnostic tests before their widespread use could be expected to provide benefits in several areas:-
  1. Elimination of poor or useless tests before they become widely available.
  2. Improved quality of diagnostic test information.
  3. Reduced health care costs.
  4. Improved patient care.
Given the reliance put on diagnostic tests in modern medical practice, it might be appropriate for diagnostic tests to be subjected to standardised evaluation before being released for widespread use [2]. This might seem draconian, but is there another way?


  1. MC Reid, MS Lachs, AR Feinstein. Use of methodological standards in diagnostic test research: getting better but still not good. Journal of the American Medical Association 1995 274:645-51.
  2. GH Guyatt, PX Tugwell, DH Feeny, RB Haynes, M Drummond. A framework for clinical evaluation of diagnostic technologies. Canadian Medical Association Journal 1986 134:587-94.

previous or next story in this issue