HOW SHOULD WE MEASURE THE OUTCOME?
What is the question we are trying to answer?
Trial size
Opioids in non-cancer pain
Testing individual patients for opioid sensitivity
Judging pain relief
Other tools
Restricting to moderate and severe initial pain intensity
Test Design and Validity
Placebo
Randomising and blinding for the individual patient test
Likelihood ratios
Table 1. Likelihood ratios with CAGE scores in 821 patients attending general medical outpatients
Figure 1. Nomogram for pre- and post test probability with likelihood ratio for alcohol abuse
Judging adverse effects
Table 2. Adverse effects on oral morphine 15
Figure 2. Benefit Harm Ratios
Alternative outcomes
Conclusion
References

HOW SHOULD WE MEASURE THE OUTCOME?




Henry McQuay, DM

Clinical Reader in Pain Relief


Pain Research

Nuffield Department of Anaesthetics

University of Oxford

The Churchill

Oxford Radcliffe Hospital

Headington

Oxford OX3 7LJ, UK


Correspondence to Dr H J McQuay, Pain Research

Tel: +44 1865 226161

Fax: +44 1865 226160

email: mailto:henry.mcquay@pru.ox.ac.uk


For a difference to be a difference it has to make a difference


Attributed to Gertrude Stein, ca 1920


What is the question we are trying to answer?

Untangling some strands which ran through this meeting there were two which caused some confusion, the distinction between establishing the academic answers, for instance to the question 'Do opioids work in neuropathic pain?', and the clinical answers, for instance to the question 'Will opioids work without too many adverse effects in this patient?'. For the academic question the approach may be different from the clinical conundrum. Dealing first with the academic question, 'Are opioids effective in this particular syndrome?', we need to consider the design and size of any clinical trials we might consider.


Trial size


We have learned many lessons about clinical trial design recently, and some of these lessons have emerged from the process of systematic review. Perhaps one of the most fundamental is that the size of trials is critically important 1 . For many years if a (small, group size less than 50) trial has thrown up a statistically significant result, for instance that women responded better than men to the particular intervention, then we have burrowed around trying to explain this unexpected difference. We have to accept that such variations can occur by random chance alone, and that the chance of throwing up an unexpected result is much greater if we are working with small trials.


A simple example is to ask if we have a very large box of balls, half red and half blue, how many balls do we need to sample to be 99% sure that the true proportion of red balls is between 49% and 51%? The answer is 20,000. Of course the number required would decrease if we were to relax our criteria, but the example shows how much bigger our samples need to be than our current trials. If we wish to produce clinically credible results, not just statistical significance which gives us the direction but not the magnitude of any difference, then in analgesic trials we may need group sizes as large as 500 patients 1 .


Opioids in non-cancer pain

Traditional reasons for not using opioids in non-cancer pain include the political, opioids unavailable for medical use for fear that medical availability would increase street problems, fear of toxicity, fear of creating an addict and of inculcating abuse behaviour. Each of these can be dealt with separately. When oral opioids were introduced in Sweden for cancer pain there was no evidence of any increase in street opioid problems as a result. Toxicity with chronic pethidine use is indeed a danger, but with other opioids adverse effects (rather than toxic effects) can be managed for most patients. The historic anecdotal evidence of individuals who used opioids long term to control their pain, for instance Florence Nightingale, shows that people can function long term on opioids without becoming antisocial in their behaviour. There are obvious parallels with alcohol use.


In an ideal world one might argue that we should all be teetotal. The reality is that many adults use alcohol as a recreational drug, and do not develop addictive behaviour which is a problem to them or to society. Some of course do develop such problems. The tension then is between banning alcohol to protect this minority, and living with the knowledge that some cannot cope. In the pain world the tension we face is between denying patients opioids to relieve their pain and thus protecting the few from addictive behaviour, and allowing access to opioids in the knowledge that there will be some who do badly.


The bottom line is that opioids should be considered for use in non-malignant pain if no other remedy is effective and if the opioids are effective. This of course begs the question as to how we prove that the opioids are effective.


Testing individual patients for opioid sensitivity


The common clinical puzzle is to determine whether it is sensible to continue increasing the opioid dose for a patient whose pain has not yet responded to upward titration of the opioid dose. There are several different circumstances when this decision has to be made:


1. Pain relief with adverse effects

2. No pain relief : no adverse effects

3. No pain relief with adverse effects


In the presence of intolerable or unmanageable adverse effects further increase in dose is unlikely to yield benefit, and the decision then has to be whether to change opioid, change route of delivery or change method of pain relief to something other then opioid. In the absence of either pain relief or adverse effects (condition 2) most of us would continue to titrate the dose upward.


Where then is the problem? The problem is that things are rarely as clear as these conditions make out. The patient may well be unsure in condition 1 that the drug is indeed making an analgesic difference. Ideally one might wish to administer an opioid 'challenge', which would determine once and for all whether or not the pain was indeed sensitive to opioid. Unfortunately the logistics of such a challenge are not straightforward. If the patient has been taking large doses of an oral controlled release formulation, how is one to give the challenge. If by injection, how big should the dose be?


These problems have been considered at length in the parallel but different context of designing clinical trials to determine opioid sensitivity of particular pain syndromes 2 3 . There are thus multiple problems in designing such a test for an individual patient before we get to the interesting question about outcomes. We need to know for that patient:


How long have they been on the present dose (and previous doses) of opioid?

How much relief have they achieved?

Which adverse effects (if any) are they experiencing, and what steps (if any) have they taken to control those adverse effects?


Judging pain relief


The commonest tools used are categorical and visual analogue scales. Categorical scales use words to describe the magnitude of the pain. They were the earliest pain measure 4 . The patient picks the most appropriate word. Most research groups use four words (none, mild, moderate and severe). Scales to measure pain relief were developed later. The commonest is the five category scale (none, slight, moderate, good or lots, and complete).


For analysis numbers are given to the verbal categories (for pain intensity, none=0, mild=1, moderate=2 and severe=3, and for relief none=0, slight=1, moderate=2, good or lots=3 and complete=4). Data from different subjects is then combined to produce means (rarely medians) and measures of dispersion (usually standard errors of means). The validity of converting categories into numerical scores was checked by comparison with concurrent visual analogue scale measurements. Good correlation was found, especially between pain relief scales using cross-modality matching techniques 5-7 . Results are usually reported as continuous data, mean or median pain relief or intensity. Few studies present results as discrete data, giving the number of participants who report a certain level of pain intensity or relief at any given assessment point. The main advantages of the categorical scales are that they are quick and simple. The small number of descriptors may force the scorer to choose a particular category when none describes the pain satisfactorily.


Visual analogue scales (VAS), lines with left end labelled "no relief of pain" and right end labelled "complete relief of pain", seem to overcome this limitation. Patients mark the line at the point which corresponds to their pain. The scores are obtained by measuring the distance between the no relief end and the patient's mark, usually in millimeters. The main advantages of VAS are that they are simple and quick to score, avoid imprecise descriptive terms and provide many points from which to choose. More concentration and coordination are needed, which can be difficult post-operatively or with neurological disorders.


Pain relief scales are perceived as more convenient than pain intensity scales, probably because patients have the same baseline relief (zero) whereas they could start with different baseline intensity (usually moderate or severe). Relief scale results are then easier to compare. They may also be more sensitive than intensity scales 7 8 . A theoretical drawback of relief scales is that the patient has to remember what the pain was like to begin with.


Other tools


Verbal numerical scales and global subjective efficacy ratings are also used. Verbal numerical scales are regarded as an alternative or complementary to the categorical and VAS scales. Patients give a number to the pain intensity or relief (for pain intensity 0 usually represents no pain and 10 the maximum possible, and for pain relief 0 represents none and 10 complete relief). They are very easy and quick to use, and correlate well with conventional visual analogue scales 9 .


Global subjective efficacy ratings, or simply global scales, are designed to measure overall treatment performance. Patients are asked questions like "How effective do you think the treatment was?" and answer using a labelled numerical or a categorical scale. Although these judgements probably include adverse effects they can be the most sensitive discriminant between treatments. One of the oldest scales was the binary question "Is your pain half gone?". Its advantage is that it has a clearer clinical meaning than a 10 mm shift on a VAS. The disadvantage, for the small trial intensive measure pundits at least, is that all the potential intermediate information (1 to 49% or greater than 50%) is discarded.


Judgment of the patient rather than by the carer is the ideal. Carers overestimate the pain relief compared with the patient's version.

Restricting to moderate and severe initial pain intensity


The trail blazers of analgesic trial methodology found that if patients had no pain to begin with, it was impossible to assess analgesic efficacy, because there was no pain to relieve. To optimise test sensitivity a rule developed, which was that only those patients with moderate or severe pain intensity at baseline would be studied. Those with mild or no pain would not. In a sense this is obvious within the context of testing for opioid sensitivity, because we should only be testing patients who have got pain of at least moderate intensity, but the rule is worth reiterating because it is broken so often. We know that if a patient records a baseline VAS score in excess of 30 mm they would probably have recorded at least moderate pain on a four point categorical scale 10 .


Test Design and Validity


Pain measurement is one of the oldest and most studied of the subjective measures, and pain scales have been used for over 40 years. Even in the early days of pain measurement there was understanding that the design of studies contributed directly to the validity of the result obtained. Trial designs which lack validity produce information that is at best difficult to use, and at worst is useless.


Placebo


People in pain respond to placebo treatment. Some patients given placebo obtain 100% pain relief. The opioid sensitivity test has to include some form of control for the placebo response. This might be either a 'positive' control, such as using two different doses of the challenge drug to see if there was a dose-response, or a 'negative' control, a true placebo, with due provision of escape analgesia if needed.


Randomising and blinding for the individual patient test


One way to produce a purist individual patient test would be to design it as a classic single-patient or N-of-1 randomised design. Using five paired treatments, control(s) and test, it is possible to derive a statistical and perhaps even clinical indication of test drug efficacy. The reality however is that such tests are logistically complicated and time-consuming.


Likelihood ratios


Most of us become hopelessly confused when trying to remember the sensitivity and specificity of a diagnostic test. Here we are however, thinking about a diagnostic test - will this challenge dose of opioid produce analgesia or won't it? - so we have to think a little about the properties, the sensitivity and specificity, of our putative diagnostic test.


The point of this digression into likelihood ratios is to emphasise the power of the clinical history in generating high pre-test probability to exploit any additional power of a test. This is well put by Sackett and colleagues 11 using the example of angina:


"Look at the relative size of the likelihood ratios for a brief, immediate, relatively cheap history and a much longer, delayed, and relatively expensive exercise electrocardiogram. There is no contest. Likelihood ratios for key points in the history and physical examination, both for this and for most other target disorders, are mammoth and dwarf those derived from most excursions through high technology."

My suspicion is that the clinical judgement of opioid sensitivity from the history is far more important in generating pre-test probability than any test we could devise.


The example that follows is of the use of likelihood ratios in diagnosing alcohol abuse 12 . The likelihood ratio (LR) can be calculated from the sensitivity and specificity of a test expressed as ratios rather than percentages. It expresses the odds that a given finding would occur in a patient with, as opposed to without, the target disorder or condition. It is derived as:


LRpos = sensitivity / (1 - specificity)


With the LR above 1, the probability of the disease or condition being present goes up; when it is below 1 the probability of it being present goes down, and when it is exactly 1 the probability is unchanged.


LR can also be calculated for the negative, as well as the positive. To find the odds that a given finding would not occur in a patient without, as opposed to with, the target disorder or condition, LR is derived as:


LRneg = (1 - sensitivity) / specificity


Example from alcohol abuse

Patients can be screened systematically for drinking problems with a simple questionnaire. There are just four CAGE questions, scoring one point for each positive answer:

  1. Have you ever felt you should Cut down on your drinking?
  2. Have people Annoyed you by criticising your drinking?
  3. Have you ever felt bad or Guilty about your drinking?
  4. Have you ever had a drink first thing in the morning to steady your nerves or to get rid of a hangover ( Eye-opener )?


Researchers in Virginia applied the questions to the outpatient medical practice aged over 17 of an urban teaching hospital 13 . Eight hundred and thirty six patients who met the inclusion criteria were asked to participate, and 98% agreed. Of these, 36% met criteria for a lifetime history of alcohol abuse or dependence using a gold standard instrument.


The results are shown in Table 1, with likelihood ratios calculated for patients who scored 0, 1, 2, 3, or 4 questions answered with yes. These can be used on the nomogram (Figure 1) to help determine the post-CAGE probability of alcohol problems.


Table 1. Likelihood ratios with CAGE scores in 821 patients attending general medical outpatients


CAGE score

Alcoholic

Non-alcoholic

Likelihood ratio

0

33

428

0.14

1

45

54

1.50

2

86

34

4.50

3

74

10

13.00

4

56

1

100.00


Patients were defined as alcoholic or non-alcoholic according to Diagnostic and Statistical Manual of Mental Disorders criteria



The pre-test probability without knowing the patient (or taking a history) could be taken from prevalence figures. Figures from the USA using the same gold standard instrument as in the paper suggest a prevalence of 25% for men and 4.5% for women 14 . On an individual patient basis one would have the pre-test probability from the history.


The analogy with opioid sensitivity testing then is that, just as was posed at the start of the chapter, we have to know what it is we are looking for. My contention is that we do have some idea of what is likely in the history to indicate that a pain is unlikely to have 'normal' opioid sensitivity. We know, for instance, that sensitivity may well be reduced in neuropathic pain. If the pain is a pain in a numb area then our pre-test probability should be high (that the pain has reduced opioid sensitivity). Similarly if the dose has been titrated up to the point of adverse effects with no glimmer of efficacy, again our pre-test probability should be high.


Figure 1. Nomogram for pre- and post test probability with likelihood ratio for alcohol abuse




To determine post-test probability draw a line through pre-test probability and likelihood ratio for CAGE score as indicated: e.g. if pre-test probability was 70 and CAGE score was 2 then post-test probability is just under 90.


An obvious next stage would be to see whether we could devise a set of questions analogous to the CAGE ones for alcohol abuse, but aimed at giving us similar points on the nomogram to fix our post-test probability. If this could be achieved without the need for 'invasive' challenge testing, with all its difficulties as laid out above, then that would be marvellous.



Judging adverse effects


Moulin and colleagues porvided some intriguing data on adverse effects after six weeks oral morphine dosing 15 . Forty-six chronic non-cancer pain patients were studied in a randomised crossover, with three weeks titration and two six week treatment periods with a two week washout. The mean daily dose of the controlled release morphine was 84 mg/day, against benztropine 'active' control.


Their adverse effect reporting can be summarised in Table 2.


Table 2. Adverse effects on oral morphine 15


Adverse Effect

Morphine %

Placebo %

vomiting

39

2

dizziness

37

2

constipation

41

4

poor appetite/nausea

39

7

abdominal pain

22

4

dose limiting AEs

28

2


These incidences, derived in the context of a randomised trial, show that there is considerable potential adverse effect burden from chronic opioid use. Our focus is often on the short term, indeed in the context of an opioid challenge to determine sensitivity we are focused very much on the short term adverse effect burden. Patients on long term opioids may have a different view, particularly if they have to take other medications to manage the adverse effects.


There are therefore at least two major issues here. The first is that our concern is whether a patient should be using opioids long term, and our thoughts about adverse effects should be long term rather than short term. The second is that it is the patient's thoughts about the adverse effects and not our's about which we should be concerned. The Moulin data is the best we have in this context, but it is obvious from the Table that there is no mention of severity of the adverse effects. This is a common feature of adverse effect reporting, that we report the incidence but not the severity (although that information is often collected). Superimposed on this important shortcoming is that we have little idea of which adverse effects are most important to the patient. This information can be collected by using focus groups of current opioid users.


Two different challenges then, the challenge of adverse effect reporting as part of determining our pre-test probability in the context of the opioid challenge, and the challenge of real-life assessment of adverse effects on long term opioids. For both challenges there is a further hurdle, which is that patients who are obtaining pain relief from the opioid may well tolerate adverse effects which the patient who receives no benefit would not. While it is clearly useful and important to know the absolute incidence and severity of adverse effects, the real-life assessment has to take account of the compromise between pain relief and adverse effect. This of course is just as true in other therapeutic areas, but this is not a problem with an easy solution. It is the adverse effect burden relative to the degree of relief which will determine the patient's decision whether or not to continue with the drug.


Figure 2. Benefit Harm Ratios




Figure 2 attempts to put this balance in a two dimensional form, with the most desirable outcome, less pain and less harm, in the upper left, and the least desirable, more pain and more harm, in the lower right. This is obviously an area which cries out for more thought. How do we best measure the benefit and risk, and how do we best combine them to then allow us to contrast different therapies?


Alternative outcomes


There is a real and justifiable concern that by focusing on the purist outcome of pain relief or reduction in pain intensity we are excluding other dimensions. An example is the use of TENS machines. In acute postoperative pain and in childbirth the evidence of a true effect in reduction of pain intensity is really very thin 16-18 . The lack of pain reduction does not exclude the possibility that these devices make people feel better. By focussing on the pain dimension to the exclusion of the 'make me feel better' dimension we are (arguably) not providing the full picture, and in other contexts the 'make me feel better' dimension may be very important. Of course reducing the pain should be our primary concern, but we need to be inclusive.


The satisfaction and quality of life scales are very fashionable as a way of bridging this gap, but there is another approach which extends the efficacy dimension into real-life. In one sense this is a way of answering the tricky question of how much change on one of our measures of pain intensity or relief constitutes a worthwhile change for the patient. One method is to allow the patient to nominate a number of different areas in their life, or activities in their life, which have been adversely affected by their pain 19 . The Ruta Patient-Generated Index (PGI) specifies as its first stage that the patient should list the five most important areas or activities. A sixth item is all other areas or activities. The second stage is that each of these is then scored from 0 to 100 in multiples of ten. A score of 0 means that this is the worst they could imagine for themselves, and a score of 100 represents the ideal, where they would like to be in that area or activity. The final third stage is that the patient has sixty points to improve their score in any of the areas or activities mentioned. They can award these points between the areas in any weighting that they choose, but cannot exceed the sixty point total. The score is then calculated by multiplying each of the six ratings by the proportion of paints allocated to that area and summing.


The validation of this scale was achieved by comparing the results in 359 patients with back pain against the SF-36 19 , and the results of the study seemed sensible. Patients referred to hospital had significantly lower (ie worse) PGI scores than those the general practitioners managed themselves, and the general practitioners' assessments of symptom severity tallied well with the patient's PGI scores. Used serially the PGI might well prove a useful audit tool in chronic apin generally, and specifically in this difficult area of opioid use in non-malignant pain, widening the efficacy focus into the areas and activities of the patient's lives which matter most to them.


Conclusion

In the difficult area of opioid use in non-malignant pain it is easy to confuse two separate themes, establishing the relative opioid sensitivity of different pain syndromes, and the issue of an individual patient. The clinical trial approaches needed for the academic question of the relative opioid sensitivity of different pain syndromes are different from those needed for the clinical question of the individual patient. The clinical trials perhaps should shift focus from the intensive study of small number of patients to simpler protocols studying much larger numbers of patients. For the individual patient question there is the further problem of distinguishing the idea of a one-off opioid challenge from the longer term question of the balance between risk and benefit. The chance of developing a worthwhile diagnostic one-off opioid challenge is slim. We are unlikely to produce a test which would improve on best clinical judgement. An area where we could make substantial improvement is the balance between risk and benefit for long term opioids compared with other therapies.


References


1. Moore RA, Gavaghan D, Tramèr MR, Collins SL, McQuay HJ. Size is everything - large amounts of information are needed to overcome random effects in estimating direction and magnitude of treatment effects. Pain 1998; 78:209-16.

2. Jadad AR, Carroll D, Glynn CJ, Moore RA, McQuay HJ. Morphine responsiveness of chronic pain: double-blind randomised crossover study with patient-controlled analgesia. Lancet 1992; 339(8806):1367-71.

3. McQuay HJ, Jadad AR, Carroll D et al . Opioid sensitivity of chronic pain: a patient-controlled analgesia method. Anaesthesia 1992; 47(9):757-67.

4. Keele KD. The pain chart. Lancet 1948; 2:6-8.

5. Scott J, Huskisson EC. Graphic representation of pain. Pain 1976; 2:175-84.

6. Wallenstein SL, Heidrich IG, Kaiko R, Houde RW. Clinical evaluation of mild analgesics: The measurement of clinical pain. British Journal of Clinical Pharmacology 1980; 10:319S-27S.

7. Littman GS, Walker B.R., Schneider BE. Reassessment of verbal and visual analogue ratings in analgesic studies. Clin Pharmacol Therap 1985; 38:16-23.

8. Sriwatanakul K, Kelvie W, Lasagna L. The quantification of pain: an analysis of words used to describe pain and analgesia in clinical trials. Clin Pharmacol Ther 1982; 32:141-8.

9. Murphy DF, McDonald A, Power C, Unwin A, MacSullivan R. Measurement of pain: a comparison of the visual analogue with a nonvisual analogue scale. Clinical Journal of Pain 1988; 3:197-9.

10. Collins SL, Moore RA, McQuay HJ. The visual analogue pain intensity scale: what is moderate pain in millimetres? Pain 1997; 72:95-7.

11. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical Epidemiology: a basic science for clinical medicine. Boston: Little, Brown, 1991.

12. Moore, RA, McQuay, HJ. Bandolier [Web Page]. Available at http://www.ebandolier.com.

13. Buchsbaum DG, Buchanan RG, Centor RM, Schnoll SH, Lawton MJ. Screening for alcohol abuse using cage scores and likelihood ratios. Annals of Internal Medicine 1991; 115(10):774-7.

14. Edwards G. Drug problems as everyday doctor's business. Oxford Textbook of Medicine. 3rd edition. Oxford: OUP, 1996: 4623-5.

15. Moulin DE, Iezzi A, Amireh R, Sharpe WK, Boyd D, Merskey H. Randomised trial of oral morphine for chronic non-cancer pain. Lancet 1996; 347(8995):143-7.

16. Carroll D, Moore RA, Tramèr MR, McQuay HJ. Transcutaneous electrical nerve stimulation does not relieve labour pain: updated systematic review. Contemporary Reviews in Obstetrics and Gynecology 1997; Sept:195-205.

17. Carroll D, Tramer M, McQuay H, Nye B, Moore A. Randomization is important in studies with pain outcomes: Systematic review of transcutaneous electrical nerve stimulation in acute postoperative pain. British Journal of Anaesthesia 1996; 77(6):798-803.

18. Reeve J, Menon D, Corabian P. Transcutaneous electrical nerve stimulation (TENS): a technology assessment. International Journal of Technology Assessment in Health Care 1996; 12:299-324.

19. Ruta DA, Garratt AM, Leng M, Russell IT, MacDonald LM. A new approach to the measurement of quality of life. The Patient- Generated Index. Med Care 1994; 32(11):1109-26.