Skip navigation

Evidence and diagnostics

A PDF version of this article can be downloaded .


This essay was developed from thoughts on the evidence-base of diagnostic testing arising from writing Bandolier. We can do great things in understanding treatments, both those done in the past, and planning trials to be done in the future. Did diagnostic testing match up?

" Evidence-based medicine is the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. " [1]

This quotation from Dave Sackett and his colleagues is as good a place as any to start thinking about evidence and migraine trials and treatments. The full article goes beyond this definition and includes patient and societal values. The main issue, though, is about where the practitioner goes to find "current best evidence". It could be using local or national guidelines, as for instance produced by organisations like the National Institute of Clinical Excellence , or those produced by eminent bodies. Some people will remain sceptical, though, and will (and should) satisfy themselves that the evidence on which guidelines are based is sound.

Information, knowledge and wisdom

In the past that task was difficult. With millions of papers being published each year (there is said to be about 30,000 medical journals), trying to find information, especially all the information was a heroic task. Now it is much easier. We can search PubMed online or visit electronic journals like BioMed , or electronic versions of paper journals like the BMJ . The Cochrane Library, available online or on CD for a small subscription has not only many good reviews, but also has over 250,000 controlled trials found by hand-searching the literature.

Good systematic reviews are increasingly available, where someone has asked a clinical question, and then summarised all the known information into a solid piece of knowledge. In doing so they will distill the information, perhaps integrate different types of information, and use quality filters so that only the most reliable information is used and that unreliable information is discarded.

How that knowledge is used depends on the practitioner making the conscientious, explicit and judicious use of the knowledge, in terms of the unique biology of the patient, incorporation patient concerns, their own experience and local knowledge, the values of society and the conditions in which they are working. The same piece of knowledge will play differently in Cardiff or Calcutta. That's the wisdom bit. That is why evidence-based approaches have nothing to do with rules, but should be seen as tools to allow practitioners to be better, and patients to be better informed.

Bias in clinical trials

One of the things we have learned through doing systematic reviews (also called meta-analysis when we pool data and do some sums) has been that certain types of study architectures are likely to produce results that are more favourable to a new treatment than they should be [2]. This is called bias, and many forms of bias have been discovered . We know that trials that are not randomised over-estimate the size of a treatment effect, as do trials that are not blind, or where information from patients is duplicated [3], or where trials are small [4], or where they have poor reporting quality [5,6].

We can be much more specific. For instance, in a study of transcutaneous electrical nerve stimulation in postoperative pain, 17 of 19 trials that were not randomised came up with a positive result, while 15 of 17 randomised trials came up with the completely opposite result, that it did not work [7].

In a review of acupuncture in back pain , lumping together all randomised trials, whether blinded or open, came up with the result that acupuncture worked for back pain. When you look at the open studies, where people making the assessments knew who had true acupuncture and who did not, there was a striking difference. But when you look at only the blinded studies, where people making the assessments did not know the treatment used, there was no difference at all. Acupunture does not work.

So attending to bias is an important issue in systematic reviews or meta-analysis of treatments. Where bias is known or likely to exist, then we may come up with the wrong overall result. To be sure of what we conclude in terms of best evidence, we have to use knowledge that is the very best. If we use poor quality knowledge, we may end up doing the wrong thing.

There are also some important issues around trial validity [8], summarised for acupuncture here .

Size (bigness, magnitude)

We also have to be sure that we have sufficient information on which to base a conclusion. The figure below looks at all the literature available on properly randomised, double-blind trials comparing ibuprofen with placebo in acute postoperative pain. They were impeccable trials, all using the same patients with the same initial degree of pain, and used the same outcomes over the same period of time.

Each point represents a trial, and we plot the percentage with at least to% pain relief with placebo on the bottom, and the percentage with at least 50% placebo with ibuprofen 400 mg on the Y axis. All are above the line of equality, showing that ibuprofen is a better analgesic than placebo, which is encouraging. We can even see that the NNT of 2.7 means that ibuprofen is an effective analgesic.

But why do we have such a scatter of points if all these trials are supposed to be the same. Is it because some were conducted in Welsh wimps and others in Scottish stoics, perhaps? Actually, no. These trials were all done to show that ibuprofen is better than placebo. They had about 40 patients per treatment group to do this. They were not done to show how much better ibuprofen is than placebo, a subtly different question, and one that needs far more patients to answer accurately.

Because we know how over 5,000 individual patients perform in these trials, we can mathematically model the effects of the random play of chance on these trials. In the representation below [11], anywhere in the grey area is where a trial comparing ibuprofen 400 mg with placebo could fall just by chance. It is more likely to be in the redder areas, but the spread we see because of chance is at least as big as that we saw in practice with all the randomised comparisons of ibuprofen with placebo. So we don't need to seek abstruse reasons for differences between single trials until the effects of random chance have been eliminated. Only numbers will do that.

Just to finish off the business of size, and to emphasise again how important it is, the slide below is probably unique in that it draws together information from of 50 meta-analyses. Each blob represents the response rate found with placebo. We are plotting the rate or people achieving half pain relief with placebo against the number of patients given placebo. In total there are 12,000 such patients, and the blue vertical line represents the overall response rate of 18%. Only when the number of patients with placebo in the meta-analysis is large (of the order of 1000), is the overall rate accurately measured. This emphasises that size is everything.

Evidence and bias in diagnostic testing

For treatments, people have devised various levels of evidence, and this has been done in a number of other areas. The aim is to try to help us to use the best available evidence in making our decisions. One of the best places to see some thoughtful stuff is at the Centre for Evidence-Based Medicine . Usually at the top level is a systematic review of qualitatively good studies. But there are problems with this, because we may not always be able to recognise what constitutes goodness. Another set of levels of evidence in diagnostic testing uses criteria set out for individual studies of diagnostic tests:

Levels of evidence for studies of diagnostic tests




An independent, masked comparison with reference standard among an appropriate population of consecutive patients.


An independent, masked comparison with reference standard among non-consecutive patients or confined to a narrow population of study patients.


An independent, masked comparison with an appropriate population of patients, but reference standard not applied to all study patients


Reference standard not applied independently or masked


Expert opinion with no explicit critical appraisal, based on physiology, bench research, or first principles.

The top level is taken up by studies which have independent, blinded comparisons of the test with a reference standard, using consecutive patients. Other study architectures are given a lower level of evidence. Level 2 is the same as level 1, but using non-consecutive patients, for instance testing the testy on a group of people with the disease and a group of people without the disease, the most common study architecture. The problem for us is that we do not always recognise how big this difference is, and whether lower levels of evidence are so low as to mean that we can ignore them.

A review from Holland gives us a real insight into the size of the gap between level 1 and level 2 studies. It searched for and found 26 systematic reviews of diagnostic tests with at least five included studies. Only 11 could be used in their analysis, because 15 were either not systematic in their searching or did not report any sensitivity or specificity. Data from the remainder were subjected to mathematical analysis, to investigate whether the presence or absence of some item of proposed study quality made a difference to the perceived value of the test.

There were 218 studies, only 15 of which satisfied all eight criteria of quality for the analysis. Thirty percent fulfilled at least six of eight criteria. The relative diagnostic odds ratio used indicated the diagnostic performance of a test in studies failing to satisfy the methodological criterion relative to its performance in studies with the corresponding feature. Over-estimation of effectiveness (positive bias) of a diagnostic test was shown by a lower confidence interval for the relative diagnostic odds ratio of more than 1.

Study characteristic
Relative diagnostic odds ratio (95% CI)
3.0 (2.0 to 4.5)
A group of patients already known to have the disease compared with a separate group of normal patients
Different reference tests
2.2 (1.5 to 3.3)
Different reference tests used for patients with and without the disease
Not blinded
1.3 (1.0 to 1.9)
Interpretation of test and reference is not blinded to outcomes
No description test
1.7 (1.1 to 1.7)
Test not properly described
No description of population
1.4 (1.1 to 1.7)
Population under investigation not properly described
No description reference
0.7 (0.6 to 0.9)
Reference standard not properly described
The relative diagnostic odds ratio indicates the diagnostic performance of a test in studies failing to satisfy the methodological criterion relative to its performance in studies with the corresponding feature.

The size of the bias is rather large, and tells us that if we use studies that compare people with the disease with those who do not have it, the results we get will be wrong. They will massively over-estimate the effectiveness of the test. That effectiveness will also be over-estimated by a range of other architectural problems.

Our problems with the quality of data from diagnostic test papers is compounded by how poorly they are reported. A study by Read and colleagues in 1995 examined issues of quality of reporting of diagnostic tests ( Bandolier 26 ). It described seven quality criteria, and then explored how those criteria were met in papers on diagnostic testing published by the four major English-language medical journals. The results were not encouraging: few told us anything useful about the patients being tested, and only a quarter told us how reliable and reproducible the test was.

Table 3: Standards of reporting quality for studies

Reporting standard



Percent meeting standard

Spectrum composition The sensitivity and specificity of a test depend on the characteristics of the population studied. Change the population and you change these indices. Since most diagnostic tests are evaluated on populations with more severe disease, the reported values for sensitivity and specificity may not be applicable to other populations with less severe disease in which the test will be used. For this standard to be met the report had to contain information on any three of these four criteria: age distribution, sex distribution, summary of presenting clinical symptoms and/or disease stage, and eligibility criteria for study subjects.


Pertinent subgroups Sensitivity and specificity may represent average values for a population. Unless the condition for which a test is to be used is narrowly defined, then the indices may vary in different medical sub groups. For successful use of the test, separate indices of accuracy are needed for pertinent individual sub groups within the spectrum of tested patients. This standard is met when results for indices of accuracy were reported for any pertinent demographic or clinical sub group (for example symptomatic versus asymptomatic patients).


Avoidance of workup bias This form of bias can occur when patients with positive or negative diagnostic test results are preferentially referred to receive verification of diagnosis by the gold standard procedure. For this standard to be met in cohort studies, all subjects had to be assigned to receive both the diagnostic test and the gold standard verification either by direct procedure or by clinical follow up. In case-control studies credit depended on whether the diagnostic test preceded or followed the gold standard procedure. If it preceded, credit was given if disease verification was obtained for a consecutive series of study subjects regardless of their diagnostic test result. If the diagnostic test followed, credit was given if test results were stratified according to the clinical factors which evoked the gold standard procedure.


Avoidance of review bias This form of bias can be introduced if the diagnostic test or the gold standard is appraised without precautions to achieve objectivity in their sequential interpretation - like blinding in clinical trials of a treatment. It can be avoided if the test and gold standard are interpreted separately by persons unaware of the results of the other. For this standard to be met in either prospective cohort studies or case-control studies, a statement was required regarding the independent evaluation of the two tests.


Precision of results for test accuracy The reliability of sensitivity and specificity depends on how many patients have been evaluated. Like many other measures, the point estimate should have confidence intervals around it, which are easily calculated. For this standard to be met, confidence intervals, or standard errors must be quoted, regardless of magnitude.


Presentation of indeterminate test results Not all tests come out with a black or white, yes/no, answer. Sometimes they are equivocal, or indeterminate. The frequency of indeterminate results will limit a test's applicability, or make it cost more because further diagnostic procedures are needed. The frequency of indeterminate results and how they are used in calculations of test performance represent critically important information about the test's clinical effectiveness. For this standard to be met a study had to report all of the appropriate positive, negative or indeterminate results generated during the evaluation and whether indeterminate results had been included or excluded when indices of accuracy were calculated.


Test reproducibility Tests may not always give the same result - for a whole variety of reasons of test variability or observer interpretation. The reasons for this, and its extent, should be investigated. For this standard to be met in tests requiring observer interpretation, at least some of the tests should have been evaluated for a summary measure of observer variability. For tests without observer interpretation, credit was given for a summary measure of instrument variability.


From Read et al JAMA 1995 274:645-651

It is immediately clear, then, that for diagnostic testing the strategy of performing systematic reviews may just not be helpful. We would hesitate to base major decisions on trials of treatment that were known to have massively biased results, and yet for diagnostic testing that's usually all we have. For some major areas of medicine one can start with several thousand papers on diagnostic tests, and end up with fewer than a handful that might be included in a review. We really don't know very much that's any use about almost any test.

Systematic review should be about picking the nuggets of gold out of the dross. It is not about heaping small piles of dross into one big pile of dross.

Size and diagnostic tests

This is an issue that has probably not been addressed sufficiently. To explain how important size is, let's take the example of sperm counts. Everyone knows that sperm counts are falling, and the reasons might include tight underpants, or oestrogens in the water supply, or even feminism. The evidence comes from a review of sperm counts (see Bandolier 56 ). This showed that sperm counts earlier in the century were higher than sperm counts later in the century.

The problem was that the early data came from a few studies with small numbers of men. If we re plot the data with the symbols properly related to the size of the individual study, we get a very different picture:

The simple fact is that the overall sperm count in the review, weighted by study size, was 77 million per mL. Only large studies with at least 1000 men came close to measuring it accurately, and small studies had values with averages from 30 to 140 million per mL.

And this is before we get to the point about how to measure sperm, what is a sperm, and what quality control between laboratories looks like. There is some suggestion in the literature that individual laboratories very widely in the results they give to the same sample.

The plain fact is that there is no evidence that sperm counts are falling. All the large (and good) studies give the same result. In the meantime, your tax is being used to finance research to find out if sperm counts are falling, how fast they are falling, and why they are falling. What a waste!

Good tests can make a difference

There are examples where a test and a treatment come together to make a difference. Examples include:

There are probably many more, but one problem that Bandolier has had is finding them. The evidence-base for effective diagnostics or diagnosis is rather thin - some would say pitifully thin. Think for a moment that effective treatment depends on effective diagnosis, and it makes one a bit concerned about the efficiency of our health services.

Not all tests are good, though

We must not delude ourselves that all tests are helpful. In pathology, the agreement between indiovidual pathologists is not good, and even experts on the same disease can disagree frequently when looking at the same slides down a microscope ( Bandolier 37 ). In reviewing 37 cases of possible melanom (albeit not the easiest), eight benign cases and five malignant cases were agreed unanimously. Lack of unanimous agreement occurred in 24 cases (62%). Two or more discordant diagnoses were made in 14 cases (38%) and discordance was three or more in 8 cases (22%). The kappa was 0.5, indicating only moderate agreement.

It was illuminating to look at the extremes. One expert (and these were all experts in melanoma, don't forget) thought 21 cases were malignant and 16 were benign. Another thought 10 were malignant, 26 benign, and one indeterminate. Between them, these two pathologists disagreed on 12 out of 37 cases, and in 11 cases one pathologist identified a case as malignant while the other identified the same case as benign.

This is not picking on pathologists. We could make comments about other laboratory tests or imaging and its use in certain circumstances, like PSA for screening for prostate cancer , or imaging the back . The point is that we need to know how good a particular test is for a particular patient at a particular level of suspicion. The book by Dave sackett and colleagues, Evidence-based Medicine, how to practice and teach EBM, is a must for better understanding of testing.

How doctors use tests

Another cracking study was dealt with in Bandolier 61 . It asked groups of about 50 physicians and surgeons how they used diagnostic tests. the results were that very few knew or used Baysean methods, or ROC curves, or likelihood ratios. So the formal ways we have of explaining diagnostic test results, including sensitivity and specificity, and just not understood or used by the people who use the tests.

If asked, what most doctors want is not likelihood ratios or sensitivity, specificity or positive predictive value. They want simple algorithms, ideally on their PC or palm pilot, that can be used to help make decisions. Even simple ways of looking at likelihood ratios assumes you know where you are starting from:

Clinical scoring systems are mixed blessings. A number of different scoring systems for dementia were examined by a multi-disciplinary team using detailed notes on just under 1900 patients. The rate of dementia varied from 3% using one system to 30% using another.

If we can't diagnose dementia accurately, how can we do clinical trials, how can we measure success, or decide which patients benefit, or advise clinicians, or explain all this to patients or their families?

For thyroid function tests , a small survey in The Lancet almost 25 years ago indicated that clinical scoring systems good give excellent predictors of when laboratory tests need to be done to confirm a diagnosis. If there are fewer than three signs or symptoms, the chances of thyroid disease are lower than the population in general, and knowing this could prevent 90% of all tests being sent to labs from GPs or outpatients.

Doing better - CARE

So can we do better in how we think about and evaluate diagnostic testing? Sure we can. First of all we have some excellent examples of how we might go about evaluating tests. The best examples are the Ottawa ankle and knee rules. In each case what we had studies that:

The result was a clinical decision rule that worked, and was used, provided a better service to patients, and saved time and money. Similar approaches have been made for rules for discontinuing cardiac resuscitation in hospital.

One of the most exciting new developments in e-medical research is that on the Clinical Assessment of the Reliability of the Examination ( CARE ), which is a collaborative study of the accuracy and precision of the clinical examination. If you want to know all about it, it's Internet address is . Essentially a group of people get together via the internet to contribute patients to studies of clinical diagnostic testing. A systematic review is first undertaken, and only those features most likely to be important are combined in the final protocol. Because doctors around the world were involved, they were able to collect information in 10 times more people than in the medical literature in just a few weeks, and come up with some simple decision rules that doctors find useful.

Their website is a must. It is simply one of the best things in the world for diagnostic testing. The scandal is that demand for better information is so great, and supply so limited. It's important, too, because testing consumes resources, and getting it wrong can be a disaster for patients and providers.

GPs order blood tests on one in every 25 patients they see. In hospital it's probably more. We know from stories carried in Bandolier that unnecessary tests can be a large proportion of the total, with huge financial implications. We know that if these tests were not necessary according to guidelines , and potential savings in time and cost are possible. We also know from a randomised trial that by having guidelines on GPs computers, we can actually reduce the number of tests ordered substantially.

The lesson is that doing simple things well makes for a better quality service at a lower cost, if for no other reason that doing fewer tests means fewer false positive results.

Where do we go from here

If you are in a hole, stop digging. What we are doing now is so awful that we have to scrap most of it and start afresh. Doing systematic reviews of diagnostic tests is a complete waste of time.


One thing is certain. This should be one of the most fertile areas for research in the next few years. Laboratory scientists, clinicians, nurses, pharmacists and others all should be able to take part. It doesn't all need a brain the size of a planet. It doesn't have to be done at some ivory tower, and much could be done in Grimsby on a wet Tuesday afternoon. Watch this space.