Skip navigation
Link to Back issues listing | Back Issue Listing with content Index | Subject Index

On limitations

The thing about looking at evidence of any sort is that there are likely to be limitations to it. Trials may not be properly conducted, measure outcomes that are not useful, be conducted on patients not like ours, or present results in ways that we can easily comprehend; trials may have few events, when not much happens, but make much of not much, as it were. Observational studies, diagnostic studies, and health economic studies all have their own particular set of limitations, as well the more pervasive sins of significance chasing, or finding evidence to support only preconceptions or idées fixes.

Perfection in terms of the overall quality and extent of evidence is never going to happen, if only because the ultimate question - whether this intervention will work in this patient and produce no adverse effects - cannot be answered. The average results we obtain from trials are difficult to extrapolate to individuals, and especially the patients in front of us.

Acknowledging limitations

Increasingly we have come to expect authors to make some comment about the limitations of their studies, even if it is a nod in the direction of acknowledging that there are some. This is not easy, because there is an element of subjectivity about this. Authors may also believe, with some reason, that spending too much time rubbishing their own results will result in rejection by journals, and rejection is not appreciated by pointy-heads.

Even so, the dearth of space given over to limitations of studies is worrying. A recent survey [1] that examined 400 papers from 2005 in the six most cited research journals and two open-access journals showed that only 17% used at least one word denoting limitations in the context of the scientific work presented. Among the 25 most cited journals, only one (JAMA) asks for a comments section on study limitations, and most were silent.

Few events

It is an unspoken rule that, to have a paper published, it helps to have some measure that displays a statistically significant difference. This leads to the phenomenon of significance chasing, in which data are analysed to death, and the aim is any test that shows significance at the paltry level of 5%. One issue arising is correcting for multiple statistical testing, something almost never done, as pointed out in Bandolier 153.

The more important question, not asked anything like often enough, is whether any statistical testing is appropriate. Put another way, when can we be sure that we have enough information to be sure of the result, using the mathematical perspective of sure, meaning the probability to a certain degree that we are not being mucked about by the random play of chance? This is not a trivial question, given that many results, especially concerning rare but serious harm, are driven by very few events.

A few older papers keep being forgotten. When looking at the strengths and weaknesses of smaller meta-analyses versus larger randomised trials, a group from McMaster [2] suggested that with fewer than 200 outcome events research (meta-analyses in this case) may only be useful for summarising information and generating hypotheses for future research.

A different approach using simulations of clinical trials and meta-analyses [3] arrived at pretty much the same conclusion, that with fewer than 200 events the magnitude and direction of an effect becomes increasingly uncertain.

Just how many events is needed to be reasonably sure of a result when event rates are low (as in the case for rare but serious adverse events) was explored some while ago [4]. Bandolier's best try at explaining lots of maths and tables appears in Table 1. This looks at a number of examples, varying event rates in experimental and control groups, using probability limits of 5% and a more stringent one of 1%, and with the power of 80% and 90% to detect an effect.



Table 1: Examples of numbers of events and numbers of subjects required to be reasonably sure of the direction of a result at various levels of significance and power for rare events



Event rates (probabilities)
Mean event rate (%)
Power of 80%
Power of 90%
p<0.05
p<0.01
p<0.05
p<0.01
Experimental
Control
Events
Total
Events
Total
Events
Total
Events
Total
0.1
0.01
5.5
12
218
14
255
15
273
21
382
0.01
0.001
0.55
12
2182
14
2545
15
2727
21
3818
0.001
0.0001
0.055
12
21818
14
25455
15
27273
21
38182
0.1
0.05
7.5
67
893
> 75
> 1000
> 75
> 1000
> 75
> 1000
0.01
0.005
0.75
67
8933
> 75
> 10000
> 75
> 10000
> 75
> 10000
0.001
0.0005
0.075
67
89333
> 75
> 100000
> 75
> 100000
> 75
> 100000
0.04
0.01
2.5
23
920
34
1360
29
1160
42
1680
0.004
0.001
0.25
23
9200
34
13600
29
11600
42
16800
0.0004
0.0001
0.025
23
92000
34
136000
29
116000
42
168000
0.03
0.01
2
33
1650
48
2400
42
2100
59
2950
0.003
0.001
0.2
33
16500
48
24000
42
21000
59
29500
0.0003
0.0001
0.02
33
165000
48
240000
42
210000
59
295000
0.02
0.01
1.5
> 75
> 5000
> 75
> 5000
> 75
> 5000
> 75
> 5000
0.002
0.001
0.15
> 75
> 50000
> 75
> 50000
> 75
> 50000
> 75
> 50000
0.0002
0.0001
0.015
> 75
> 500000
> 75
> 500000
> 75
> 500000
> 75
> 500000


Higher power, greater stringency in probability values, lower event rates, and smaller differences in event rates between groups all militate towards needing more events and larger numbers of patients in trials. Once event rates fall to about 1% or so, and differences between experimental and control to less than 1%, the number of events needed approaches 100 and number of patients rises to tens of thousands.

Subgroup analyses

One of the best examples of the dangers of subgroup analysis, due to unknown confounding, comes from a review article [5]. It examined the 30-day outcome of death or myocardial infarction from a meta-analysis of platelet glycoprotein inhibitors. Analysis indicated different results for women and men (Figure 1), with benefits in men but not women. Statistically this was highly significant (p<0.0001).



Figure 1: Subgroup analysis in women and men of death or MI with platelet glycoprotein inhibitors (95% confidence interval)





In fact, it was found that men had higher levels of troponins (a marker of myocardial damage) than women, and when this was taken into account the difference between men and women is understandable, with more effect with greater myocardial damage; sex wasn't the source of the difference.

Trivial differences

It is worth remembering what relative risks tell us in terms of raw data (Table 2). Suppose we have a population in which 100 events occur with our control intervention, whatever that is. If we have 150 events with an experimental, the relative risk is now 1.5. It may be statistically significant, but most events were those occurring anyway. If there were 250 events, the relative risk would be 2.5, and now most events would occur because of the experimental intervention.



Table 2: What different levels of relative risk actually mean



Relative risk
What this means
< 1.0
The risk of an event is reduced for the experimental intervention compared with the control intervention
1.0
No increased or decreased risk for experimental versus control
1.0 - 2.0
Higher risk of events with experimental intervention, but most events occur because of underlying factors - like the patient population being studied
> 2.0
Higher risk of events with experimental intervention, and most events occur because of the experimental intervention


Large relative risks may be important, even with more limited data. Small relative risks, probably below 2.0, and certainly below about 1.5 should be treated with caution, especially where the number of events is small, and even more especially outside the context of the randomised trial.

The importance of a relative risk of 2.0 has been accepted in US courts [6]. 'A relative risk of 2.0 would permit an inference than an individual plaintiff's disease was more likely than not caused by the implicated agent. A substantial number of courts in a variety of toxic substance cases have accepted this reasoning.'

Confounding by indication etc.

Bias arises in observational studies when patients with the worst prognosis are allocated preferentially to a particular treatment. These patients are likely to be systematically different from those not treated, or treated with something else (paracetamol, rather than NSAID in asthma, for instance).

Confounding, by factors known or unknown, is potentially a big problem, because we do not know what we do not know, and the unknown could have big effects - like troponin above. When relative risks are small, say below about 1.3, potential bias created because of unknown confounding, or confounding by indication improperly adjusted, becomes so great that it makes any conclusion at best unreliable.

Comment - the uncertainty principle

These are just a few of the limitations Bandolier sees in papers and talks. There are more, obviously. Worst of all is an outcome failing to reach statistical significance at a trivial level like 5% despite multiple statistical comparisons then being trumpeted as a 'result', and extrapolated to whole populations. If it ain't statistically significant, it don't signify.

The trouble is that we live in an imperfect world, where we never have the truth, the whole truth, and nothing but the truth on which to work and build judgements. We have to make do with what we have, and try our best to exclude the rubbish. Some try a philosophical approach to calculate thresholds above which we can begin to believe [7], but that seems a bit too glib.

References:

  1. JPA Ioannidis. Limitations are not properly acknowledged in the scientific literature. Journal of Clinical Epidemiology 2007 60: 324-329.
  2. MD Flather et al. Strengths and limitations of meta-analysis: larger studies may be more reliable. Controlled Clinical Trials 1997 18: 568-579.
  3. RA Moore et al. Size is everything - large amounts of information are needed to overcome random effects in estimating direction and magnitude of treatment effects. Pain 1998 78: 209-16.
  4. JJ Shuster. Fixing the number of events in large comparative trials with low event rates: a binomial approach. Controlled Clinical Trials 1993 14: 198-208.
  5. SG Thompson, JPT Higgins. Can meta-analysis help target interventions at individuals most likely to benefit? Lancet 2005 365: 341-346.
  6. Annual Reference Manual on Scientific Evidence 2nd Edition, 2005-2006, p539 (ISBN 0820547549).
  7. B Djulbegovic, I Hozo. When should potentially false results be considered acceptable? PLoS Medicine 2007 4:2:e26.

previous or next story