PsyDactic

Psychometrics - The Dangers of Rating Scales and Screeners

Episode 60

Send us a text

Dr. O'Leary discusses a variety of concerns that all clinicians should have in mind when using psychometrics.  In the end, he hopes you come away  with some level of agreement with the statement: “Our primary concern should not be with the quantity of data, but with the quality of the data.”  Statistics are conceptual machines that will produce results no matter what you feed them.  These results can be truly helpful and informative.  But statistics are also poop in poop out machines, and adding more malarky does not magically convert the results into something other than BS.

Please leave feedback at https://www.psydactic.com.

References and readings (when available) are posted at the end of each episode transcript, located at psydactic.buzzsprout.com. All opinions expressed in this podcast are exclusively those of the person speaking and should not be confused with the opinions of anyone else. We reserve the right to be wrong. Nothing in this podcast should be treated as individual medical advice.

Welcome to PsyDactic.  Today is Wednesday, June 26, and I am a 4th year psychiatry resident in the national capital region who, by the time you hear this episode, will be a fellow in child and adolescent psychiatry.  This is a podcast about psychiatry and neuroscience.  I do it as a project to help me learn and stay current, so that I can better understand and treat my patients.  My assumption is that psychiatrists and psychiatry residents will relate best to this content, but all are welcome.  I try to explain things in less than technical terms when I can.  Everything I say here is my opinion and my opinion alone.

Today’s episode was supposed to be a quick take, where I talked about the common pitfalls of psychometric scales and screening instruments.  It is not going to be as quick as I had hoped, but I think it is much better that I expected.  In clinical practice rating scales are used extensively.  In fact, they have become the standard of practice.  The Joint Commission demands that we use them.  They publish what are called Elements of Performance.  There are 3 of them with regard to standardized measurement tools.

The first Element of Performance requires organizations to use a standardized tool or instrument to monitor an individual’s progress.  This is reasonable on the surface.  It attempts to add some objectivity to what we do, instead of clinicians just willy-nilly declaring that a patient is worse, better, or unchanged.  The criteria are predetermined and standardized.

Next, JCo requires organizations to QUOTE “analyze the data generated by this activity and use the results to inform the individual’s goals and objectives as needed.”  So, not only do we have to gather data, we also have to actually use it to inform care.  We are not only measuring progress, we are making decisions based on the results of the scales.

The third requirement is that organizations use their data to QUOTE, “evaluate outcomes of care, treatment, or services provided to the population(s) they serve.”  Not only do clinicians have to gather data from standardized instruments and use it to inform an individual's care, this same data should be used in aggregate to measure outcomes at a higher level.  Adding this requirement means that organizations have a strong incentive to require clinicians to use very particular measures, whether the clinician feels like these are appropriate for their patients or not.  If clinicians each get to decide what to use, then there may be such a variety of different outcome measures that it is hard to aggregate them at the organization level.

On the surface, all of this sounds very reasonable, and it is reasonable.  So what is the big deal, Dr. O’Leary.  Why did you name this episode “The Dangers of Rating Scales and Screening Instruments.”  I am going to discuss a variety of concerns that all clinicians should have in mind when using these tools.  In the end, I hope you come away from this podcast with some level of agreement with the statement: “Our primary concern should not be with the quantity of data, but with the quality of the data we are measuring.”  Statistics are conceptual machines that will produce results no matter what you feed them.  These results can be truly helpful and informative.  But statistics are also poop in poop out machines, and adding more malarky does not magically convert the results into something other than BS.

Today I am going to discuss depression rating scales, but the basic principles I am applying here are generalizable to just about any kind of rating system.

There are many depression rating scales in use today.  You might recognize the HAM-D (also called the HDRS).  It is the Hamilton Depression Rating Scale, and it was created by Max Hamilton and published first in 1960, and is commonly used today, especially in research protocols.  In the 1970s, Aaron Beck published his BDI or Beck Depression Inventory, which has gone through multiple iterations.  The most current is the BDI-II, revised in 1996. It is used today, but you have to pay to use it, so that limits it’s spread.  Another tool, maybe the most common in clinical practice, is the PHQ-9 or Patient Health Questionnaire 9.  It was originally developed in the 1990s by Pfizer as part of a large battery of tests called the PRIME-MD, but is generally used separately now.  There are many different scales that I have not mentioned and you can find a good summary of them at APA.org. Today I have time to compare the HAM-D and the PHQ-9.  I chose these two because the HAM-D and PHQ-9 are very different from each other.

Let me start with the PHQ-9 because it is a straight-forward assessment.  By that, I mean, all that it does is take the 9 diagnostic features of depression that are listed in the DSM and uses them sequentially to ask patients to rate how often each of these symptoms have been present over the past 2 weeks.  If at least 5 have been present for more than half of the days, then depression can be diagnosed if there is also some reported dysfunction.  This encourages people to use the PHQ as a diagnostic tool.  There is the PHQ-2 which is a quick screening tool for depression, so it seems to make sense that after using a screening tool, you would then use a diagnostic tool, but I am going to argue here that the PHQ-9 is also merely a screening tool.

The PHQ9 has a sensitivity of 88% and a specificity of 88% for major depression according to one very large study.  If we estimate a population prevalence of about 8% for depression, then using only PHQ9 scores to diagnose MDD, out of 10000 people taking the test, we expect to get about 1100 false positives while missing only about 100 cases.  You can think about it this way, by merely asking patients superficially whether or not they meet at least 5 of the DSM criteria for depression, we are going to diagnose a total of 18% of the population with depression, even though only 8% actually have depression.  The PHQ9’s ability to rule-out depression is much better because only 1.2% of the negatives are false negatives who actually have depression.  Another way to think about this is that we will be treating 1800 people for depression, but only 700 of the 1800 actually have depression.  60% of the people we are treating do not have major depression.

This low specificity is not actually all that bad if you plan to do a more thorough interview that includes validating what was on the PHQ-9.  This will increase the specificity substantially and reduce the number of false positives.  However, many studies use the PHQ-9 cut-offs as diagnostic criteria for inclusion in a study, so it could be that these studies claiming to show effect sizes of treatments for Major Depressive Disorder (or MDD) are actually showing effect sizes for something other than MDD.  They lack construct validity.  These studies are measuring something other than MDD.  Something has construct validity if it is actually measuring aspects of one thing and not something else.  At least that is my highly-oversimplified definition.

If you want a more in-depth discussion of validity, then I have an episode called, “In a Word - Validity” that was published on July 11, 2023.  Now moving on…

As a measure of the severity of depression, the PHQ-9 can measure how often DSM criteria symptoms are present over time (and a higher total score supposedly means worse depression).  In my opinion (and by the way, this entire podcast is my opinion and only my opinion), the only thing the PHQ9 can really tell us with any confidence is whether someone still meets criteria for MDD or not.  Remember, it is actually pretty good at ruling out MDD.  It fails at that only about 1.2% of the time in the general population.  The false negative rate will increase in the patient population because the prevalence of MDD is much higher, but it is not going to miss-attribute remission most of the time.

I don’t think that the severity measures in the PHQ9 are very useful because self-reports of the frequency of a subset of all possible vaguely defined depressive symptoms cannot possibly be very reliable (at least not with any precision).  Also, depression symptoms are not limited to only the diagnostic criteria of depression.  This means that the PHQ-9 lacks content validity.  It includes some but not all of the features that can be measured in depressed states.  It is missing so, so much.

Now let me jump back to 1960 and talk about the Hamilton Depression Rating Scale.  The HAM-D is explicitly a clinician administered scale, but someone could make a self-report version.  The PHQ-9 in my experience is most often given as a patient self-report scale.  The HAM-D generally requires a physician to rate each item.  There are structured interview guides available for clinicians.  See the show transcript at psydactic.buzzsprout.com for my references.

The HAM-D has 17 items which themselves have different items, so it does not merely regurgitate the DSM criteria for MDD like the PHQ-9 does.  It was published just 8 years after the first DSM, which was simply called the DSM, but now is referred to as DSM-I. Eight years after the HAM-D was published, the DSM-II was published.  I feel like I have to give a little more history now, because those facts by themselves are not very meaningful.  The first and second DSM are very different from the DSM 5 and also very different from each other.

The original DSM did not describe diagnostic entities as disorders, but instead as “reactions.”  For example, manic-depressive reactions (also called manic-depressive psychosis) were divided into depressive or manic types.  The depressive type was described like this:  “Here will be classified those cases with outstanding depression of mood and with mental and motor retardation and inhibition; in some cases there is much uneasiness and apprehension. Perplexity, stupor or agitation may be prominent symptoms, and may be added to the diagnosis as manifestations.”

16 years later, the DSM-II replaced the term “reaction” with “illness.”  It describes the depressive subtype of manic-depressive illness like this:  “This disorder consists exclusively of depressive episodes. These episodes are characterized by severely depressed mood and by mental and motor retardation progressing occasionally to stupor. Uneasiness, apprehension, perplexity and agitation may also be present. When illusions, hallucinations, and delusions (usually of guilt or of hypochondriacal or paranoid ideas) occur, they are attributable to the dominant mood disorder. Because it is a primary mood disorder, this psychosis differs from the Psychotic depressive reaction, which is more easily attributable to precipitating stress.”

DSM-II also described a “Depressive neurosis”:  “300.4 Depressive neurosis
This disorder is manifested by an excessive reaction of depression
due to an internal conflict or to an identifiable event such as the loss
of a love object or cherished possession. It is to be distinguished from
Involutional melancholia (q.v.) and Manic-depressive illness (q.v.).
Reactive depressions or Depressive reactions are to be classified here.”

There was also an entity called Involutional melancholia.  I am going to save you a long quote and just report that this was essentially depression in people in the last half of a normal lifespan which was also not attributable to some recent event.  It was not a depressive neurosis.

Why am I bringing all this up?  Because I find it fascinating that a tool to measure depression that is still widely used today or at least was widely used until recently was conceived during a time when psychoanalysts still were the prominent theorists defining mental disorders.  The HAM-D originally had 17 items that could be scored.  Various versions since then have expanded to up to 29 items, but the 17 item test has been the most commonly used version.  There is also the SIGH-D, which is a Structured Interview Guide for the Hamilton Depression Rating Scale.

Each item of the HAM-D attempts to measure a possible dimension of depression and can be scored up to 4 points.  Some items have only 2 possible points.  The HAM-D covers at some point all of the SIGECAPS (and by that I mean the current DSM criteria) and often groups them into other more broad categories.  For example, low energy is a possible symptom of General Somatic Symptoms.  Anhedonia is measured by a patient's relative engagement in work and other activities.  It also contains measures of hypocondriasis and insight as well as anxiety, which it splits between psychic anxiety, like worry, apprehension, or fear, and somatic anxiety, when a patient describes bodily sensation such as palpitations, hyperventilation, but maybe also an increase in sighing or flatulence.

One might be able to see how measures like this could be problematic.  A patient could report a lower HAM-D score on follow up for no other reason than because they had eaten beans for dinner prior to the first encounter.  Other versions of the test attempt to measure things like diurnal variation in the symptoms, depersonalization and derealization, paranoia, and obsessive and compulsive symptoms.

Even with just the 17 item test, it appears that the HAM-D is actually measuring a lot of different things, some of which are merely associated with major depression, but have very different etiologies.  It is also possible that the HAM-D overall has more content validity than the PHQ-9 in that it may be measuring more of the things that are causally associated with depression than the PHQ-9 is.

However, the criterion validity of the HAM-D subscores are questionable.  Criterion validity refers to how well the criterion actually describes an aspect of the construct which is depression in this case.  It can also refer to how well it predicts the course of depression.  A subset of criterion validity is predictive validity.  Does the particular criterion have both diagnostic (concurrent) and prognostic (predictive) value?

Scales like the HAM-D that have an excessive number of items, and therefore, if used for diagnostic or predictive purposes, are likely going to suffer from a problem with a very low positive predictive value.  What that means is that they can become overly sensitive: they cast too broad of a net.  They predict so many things that any one of the predictions is meaningless.

Let me read a summary from a paper called “The Hamilton Depression Rating Scale: Has the Gold Standard Become a Lead Weight?” which was published in 2004 after 44 years of the scale's use.

QUOTE
“Evaluation of item response shows that many of the
individual items are poorly designed and sum to generate a
total score whose meaning is multidimensional and unclear.
The problem of multidimensionality was highlighted
in the evaluation of factorial validity, which showed a failure
to replicate a single unifying structure across studies.
Although the unstable factor structure of the Hamilton depression
scale may be partly attributable to the diagnostic
diversity of population samples, well-designed scales assessing
clearly defined constructs produce factor structures
that are invariant across different populations (88).
Finally, the Hamilton depression scale is measuring a conception
of depression that is now several decades old and
that is, at best, only partly related to the operationalization
of depression in DSM-IV.”  
UNQUOTE

Their final conclusion is that the HAM-D is so flawed that it should be systematically abandoned instead of merely revised.  They give a shout out to The Inventory of Depressive Symptomatology (95) and the Montgomery-Åsberg Depression Rating Scale (96), because they claim these were explicitly designed to address the limitations of the Hamilton depression scale.  The Montgomery-Asberg or MADRS contains only 10 items that are broadly consistent with DSM-IV criteria.  It reminds me of the PHQ-9.  The Inventory of Depressive Symptomatology contains 30 items that appear to try to get granular about particular modern DSM criteria for depression.  There is also the QIDS, which is the Quick Inventory of Depressive Symptomatology.

For those of you who have listened to my episode on the STAR*D trial (published December 10th 2023), you may remember that the QIDS was used in place of the HAM-D, even though the HAM-D was the original measure.  The study was criticized for this primarily because the use of the QIDS resulted in the physicians themselves collecting the data, which unblinded the assessors, but I’m not going down that rabbit hole today.  See my December 10th episode for that.

Let me return instead to a criticism of the HAM-D itself, that it was a multifactorial assessment and therefore the final scores are invalid.  They assert that QUOTE “well-designed scales assessing clearly defined constructs produce factor structures
that are invariant across different populations.”  But is depression a quote clearly defined construct?  In theory maybe, but in practice, I highly doubt it.

This brings me to another elephant in the room: What is depression actually?  If we reduce our scales to more or less listing the DSM criteria, what is it that we are missing?  If scales are all basically asking the same questions in a slightly different way, then they are likely to have convergent validity independent of whether they have content validity or not.  They are just different species of duck, all quacking and walking like ducks.

Relying on scales can give us a false sense of knowledge, can make us hyperfocus on only some aspects of a patient, and overall cripple our ability to identify and treat patients appropriately or hamper a researchers’ ability to ask meaningful questions.

Let me give an example of what has been called anhedonic depression.  This is an entity that meets the criteria for depression, AND has a strong component of the inability to experience pleasure and often other positive emotions.  Negative emotions might also be blunted.  One consequence of anhedonia is amotivation and a reduced ability to benefit from reward-based learning.  Dr. Diego Pizzagalli, a Professor of Psychiatry, Harvard Medical School has been studying how anhedonia affects patients and how our usual first-line treatments such as SSRIs and cognitive behavioral therapy have reduced efficacy for this phenotype of depression.  His research strongly suggests that this type of depression can be treated better by targeting the mesolimbic and mesocortical dopaminergic pathways instead of the serotonergic pathways.

I don’t have time to explain this in detail today, but the point of this example is to show that a composite score on a rating scale for a complex and multifaceted construct that may itself be different things may not have much actionable information in it. It is too noisy. Studies that include patients primarily on the basis of a score on a composite scale, may include more patients without the condition they are studying than patients that actually have a condition.  It would be as if they just randomly picked a large portion of a healthy control group to be instead classified as having the condition.

In a busy clinic, relying on scales for diagnostic purposes can result in treating far more people for a condition than actually have that condition.  If the scale is measuring many different things, then these things would be better conceptualized separately than as a part of a larger measure.  The diagnostic process needs to be separate from the scale itself.

I understand the Joint Commission’s insistence that standardized measures be used, but if these convenience measures are mistaken for real medicine, then true improvements in care are going to remain elusive.  I am not recommending that we do not use scales.  I am advising that we do more to understand what that scale can and cannot tell us.

I am Dr. O’Leary and this has been an episode of PsyDactic.


https://www.apa.org/depression-guideline/assessment

https://www.jointcommission.org/what-we-offer/accreditation/health-care-settings/behavioral-health-care/outcome-measures-standard/ 

https://en.wikipedia.org/wiki/Beck_Depression_Inventory 

https://brain.harvard.edu/?people=diego-pizzagalli 


Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001 Sep;16(9):606-13. doi: 10.1046/j.1525-1497.2001.016009606.x. PMID: 11556941; PMCID: PMC1495268.

Hamilton M. Development of a rating scale for primary depressive illness. Br J Soc Clin Psychol 1967; 6(4):278–96.

Williams JB. A structured interview guide for the Hamilton Depression Rating Scale. Arch Gen Psychiatry 1988; 45(8):742–7.

Bagby RM, Ryder AG, Schuller DR, Marshall MB. The Hamilton Depression Rating Scale: has the gold standard become a lead weight? Am J Psychiatry. 2004 Dec;161(12):2163-77. doi: 10.1176/appi.ajp.161.12.2163. PMID: 15569884.

People on this episode

Podcasts we love

Check out these other fine podcasts recommended by us, not an algorithm.