In a Word - Validity

July 11, 2023 T. Ryan O'Leary Episode 35
In a Word - Validity
Show Notes Transcript

Today I discuss the term “validity.”  Let’s say we wanted to develop a test that identifies pathological character traits or quantifies depression symptom burden on a patient.  A good test is going to do more than simply list the diagnostic criteria for various diagnoses and then ask the patient if they think that sounds like them.  A test needs to have a few things.  First it needs to have a defined purpose.  Is it to be used for diagnosis in a clinic or for research?  Is it going to measure symptoms in already diagnosed patients and track their response to therapy?  Is it meant to predict if a person would be a good candidate for something like being an astronaut or a member of the military?  Once the purpose is determined, then you need to define something called a construct, and then you have to determine the validity of that construct.

Please leave feedback at

References and readings (when available) are posted at the end of each episode transcript, located at All opinions expressed in this podcast are exclusively those of the person speaking and should not be confused with the opinions of anyone else. We reserve the right to be wrong. Nothing in this podcast should be treated as individual medical advice.

Welcome to PsyDactic - Residency.  I am Dr. O’Leary, a 4th year psychiatry resident in the national capital region.  This is a podcast about psychiatry, psychology and neuroscience from my perspective.  I try to teach anyone interested in these topics, but I plan my content with the expectation that other behavioral health providers, residents and medical students will be the most drawn to this content.  There is nothing special about me, so feel free to question or disbelieve anything I tell you.  I am often wrong.  Also, I am not giving medical advice here, see your doctor for that.  I am also not speaking for anyone else or any institution including any of which I was a past or am a current member of.

Today I am going to use an intermittent series I have been calling, “In a Word” in order to provide a review of one of the words that will likely find its way onto boards and inservice exams, but can also be useful to know about anyway.  That word is “validity.”

Let’s say we wanted to develop a test that identifies pathological character traits or quantifies depression symptom burden on a patient.  A good test is going to do more than simply list the diagnostic criteria for various diagnoses and then ask the patient if they think that sounds like them, though some screening instruments that we use like the PHQ9 pretty much do just that in a way that has been tested for validity.

A test needs to have a few things.  First it needs to have a defined purpose.  Is it to be used for diagnosis in a clinic or for research?  Is it going to measure symptoms in already diagnosed patients and track their response to therapy?  Is it meant to predict if a person would be a good candidate for something like being an astronaut or a member of the military?  Once the purpose is determined, then you need to define something called a construct.

The construct is the thing that you are attempting to measure.  Ideally, the construct will be a real thing.  Something that exists beyond the confines of a hypothetical.  It can be hard to measure a construct or even to determine if it is a real thing.  I can measure the dimensions of a house easily because it is a concrete and discrete object.  But how do you measure a home?   In psychology, most constructs are rather complex and are not discrete entities.  If they were, then we wouldn’t have to call them constructs.  We would just call them what they obviously were.  The DSM defines many syndromes (or groups of symptoms) that when they occur together, and cause sufficient distress, and can’t be explained by something else, become a named disorder.  That thing that has a name, like Binge Eating Disorder, may have very different causes in different people, so the construct that you are diagnosing is not necessarily a single thing itself but often a bunch of associated things, that when taken together construct a theoretical entity.

A construct is valid if it is sufficiently truthy, and by that, I mean that it reflects some legitimate reality.  It doesn’t have to be corporeal, but it does need to truly describe real phenomena.  Prior constructs in psychology, like psychasthenia, have been abandoned, because although these old fashioned diagnoses were composed of real things that patients experienced, they were grouped together and conceived of in a way that did not reflect reality.  People don’t become obsessive or develop compulsions because they have exhausted or overwhelmed their psychic energy.  While the symptoms that describe psychasthenia may be real, the construct itself is not valid.

So when developing a test for something, the test needs to have at the most basic level construct validity it can, which means both that the construct that is being measured or tested for is a real thing, and that the things that you are measuring are actually describing that construct and not something else.  While we can never be entirely sure about our constructs, we can design our tests in ways that help us to be more realistic.

Most behavioral health providers will construct their own impromptu tests for patients during interviews by asking questions that we think are getting patients to reveal their core symptom clusters, and we tailor our questioning toward the rabbit holes that we see appearing in front of us.  When we construct an impromptu test in this way, we are doing micro scale rational test construction.  Rational test construction is when you choose items for a test because they just seem to make obvious sense. If what I suspect is true, then I would also expect this, so I am going to ask about that too.  This is based on something called face validity, where what you are proposing seems to make sense without much effort or additional explanation.  It looks like a duck and it quacks like a duck.  Rational test construction relies a lot on face validity.  If you constructed a test that puts the diagnostic criteria of any disorder into easy to understand questions that a patient can answer, you have just done rational test construction.

However, if you want a method of testing that is more fully and accurately measuring what you want, then you would need to add an empirical component to your test construction.  To be empirical means to include a broad range of documented observations that can be tested.  This requires surveying a large number of people, both suspected of having a disorder or condition, and those who appear unaffected.  Test developers will cast a very broad net, by asking people a large number of items, and then zero in on the items that seem to be more reliably associated with whatever it is you are trying to measure.  This is called empirical test construction, because you are starting with much more than you will need in the end and narrowing down your test items by throwing out the ones that are inconsistent, redundant, or not very specific or sensitive.  During the process you might note that two or more items are always correlated with each other, so you might be, in effect, measuring the same thing twice by keeping all of these highly correlated items in the test, so you may be able to narrow down to just one of them.

Good test construction requires both a rational and empirical approach.  The rational approach is going to guide what kind of items are included in the empirical testing, but it is important to stretch the limits of your rationality in the beginning so that you can find as many hidden but real or rare associations as possible.  It is also important to include many things that appear to be similar so that the test developer can drill into which one of these things is the best to include as a single item.  This ensures the test will not only have construct validity, but also have content validity, which means that it is measuring all of the important or meaningful features of the construct.  Content validity means you are capturing all the features of the construct that you can. By combining rational and empirical methods you get better content validity and ideally this process should continue in order to inform new iterations of any test or to see how a test performs in special populations or whether it remains a useful tool over time.

So far I have mentioned face validity (that you don’t have to strain to understand what a test is measuring), construct validity (that we are measuring a real thing and that thing is what we think we are measuring and not something else), and content validity (that we are measuring as much about that thing as is important or feasible).

Ultimately, it would be great to be able to compare the result of any new test you make to some kind of gold standard.  If you are the first to create a test, then that test, if validated, will likely become a gold standard.  When looking for some standard to compare your test to, you are asking “by what criterion is my test measured?”  This is called criterion validity.  A test that is quick and easy to implement might be compared to a lengthy clinical evaluation, and the results of the quick test could be compared with the results that independent clinicians agree on.  The problem in psychiatry and psychology is that there are very few golden gold standards.  The gold standard by itself may be a test that became the gold standard by virtue of being the first one made and generally accepted.  Criterion validity is important, but not always available in a strict way.

There are at least two kinds of criterion validity that measure whether different tests agree on their outcomes.  One is concurrent validity and the other is convergent validity.  They are basically the same thing with one major difference.  Concurrent validity compares a new test to an already established standard.  If your test produces results that are similar to that already established test, then it has concurrent validity, it concurs with an already existing test.  Convergent validity measures how well the results of a new test converge with another new test that is different in some respect.  If two different kinds of tests developed concurrently converge on similar results, this helps to support the construct validity of what you are testing.  This construct is more likely to be a real thing, because if we measure it from different angles, we still seem to be measuring the same thing.  The tests were both new and yet converge on the same construct.  

Imagine a test that a patient completes on a computer versus a test that is given by a clinician during an interview.  If the results are very similar among similar test subjects taking the two different versions, then, if the clinician administered test already existed, then the new computer test has concurrent validity.  If both tests were created as new tests and they agree then the tests have convergent validity. Convergent or concurrent validity might help to justify giving patients the option to merely answer questions on a computer prior to an appointment rather than waiting until they are in the room with the clinician to be tested.  The results are in agreement, so you can choose one or the other based on the needs of the clinic or the patient.  A patient with impaired vision may not be able to use the self-report computer version of the test.

Another type of validity based on agreement between tests is predictive validity, when the results of one test, usually a screening instrument, predicts the results of a future test or predicts an outcome.  For example, infants or toddlers may display some traits of autism or ADHD, but have not yet reached a developmental stage where they can be reliably diagnosed.  If there are screening instruments that can predict that they are likely to be given a diagnosis in the future when a more comprehensive evaluation is possible, then that test has predictive validity.  The difference between concurrent and predictive validity is that concurrent validity describes the present.  The same patient at the same time would have the same results on two different tests.  Predictive validity is concerned with future results.  The results of today's test will likely be durable over time or will predict a future new state.  A test with high predictive validity can help guide future decisions to seek more diagnostic information, or guide admissions offices on whether a student with a high or low ACT or SAT score will do well in the program that they applied to.  Once can argue about the predictive validity of college admissions tests, but that is not what I am doing today.

So far, the kinds of validity that I have talked about describe different methods that agree with each, meaning converge or concurr on the same answer or they predict a future result.  There is a kind of validity that relies not on agreement or sameness, but on being different.  It is called divergent validity.  Part of having construct validity is making sure that your test measures one thing and only one thing.  You are not conflating multiple things.  Ideally your test won’t give a different result because something other than the construct you are interested in changes.  Its result should diverge from the results of a test measuring something different. Think about how concentration is impaired in many different disorders.  Anxiety, ADHD, depression, traumatic brain injuries, strokes, insomnias and many other disorders may present with complaints about impaired concentration.  If you are measuring concentration this might actually be a correlate of anxiety or depression or obstructive sleep apnea, etc and so the scores on your concentration test would then might correlate with scores on, for example, a depression inventory.  Your construct then might not be a thing in and of itself.  It is not divergent from depression.  You would want a test that can distinguish between different things that might share features.

It would be nice to stop here, but concepts of validity are not just related to the constructs themselves.  Many tests have items incorporated into them that attempt to measure whether the results of the test appear to be falsified, uninterpretable, or biased in some way. It is not infrequent when I am reading the results of psychological testing for my patients that the psychologist declares the results of part or all of the testing to be invalid.  Patients might have given indications that they likely exaggerated or under-reported symptoms.  Maybe they did not appear to have given adequate effort or even had possibly, intentionally misled the examiner.  This type of validity is not of the test construct in general, but instead of the specific instance of the test.  The aberrant results don’t invalidate the construct that the test is measuring, but the results that the test achieved.  The construct of the test is valid, but the results of the test are not.  Patients are not necessarily trying to game the test, but may be reporting extremely severe symptoms in many domains because they lack insight into what the questions mean or they want to communicate that they are in severe distress overall.  This result can be useful to a clinician to understand the patient, even if the test itself does not measure its construct.  I have had one patient whose attentional capacity was so poor that his neuropsych testing was determined invalid, except to the extent that they could rule out borderline intellectual functioning because his scores, invalid as they were to establish his true levels, were at least high enough to rule out gross impairment in intellect.

I could also get into test reliability, but I am going to defer a deep discussion of that for now, because I think that reliability is more intuitive and easily measured.  Is the test going to give consistent results when given under similar circumstances to similar patients?  If yes, then it is reliable.  If not, then it is not reliable.  If you are interested in reliability, then there is much more to be said about that, so look into it.  A good test needs to be both reliable and valid in as many ways as it can be.

I made this episode because I struggle to understand what all these terms mean, and I want to be able to better choose and understand how to refer patients for psychological or neuropsychological testing.  I need to understand what a test is actually measuring, to have a basic understanding of what the construct is, in order to determine whether or not I want to refer my patient for that test.  Simply getting testing because things are weird and I don’t know what is going on, will likely give unreliable or spurious results.  Is the test a valid test for what I am interested in?  What is my clinical question?

In the next episode, I will discuss the differences between psychological testing and neuropsychological testing.  I hope it will help me better understand where to direct a patient when I need much more information about what is going on.  What kinds of questions can a psychological testing answer and how is that different from the questions that neuropsychological testing is trying to answer?

I am Dr. O’Leary and this has been an episode of Psydactic - Residency Edition.