Levels of Evidence

If you read (or produce) scientific literature, you are hopefully aware of and use some system to appreciate how good a particular type of study is. Most “Levels of Evidence” charts organize different types of evidence into some sort of 5 tier system, with the highest level of evidence, Level 1, being the best. Level 1 evidence might include systematic reviews of homogenous randomized controlled trials, for example, or simply a single high quality RCT. Here’s an example:

This is useful information, in the same way that a poker player might need a chart to remember that four of a kind beats a full house. But it doesn’t tell the whole story.

Charts like this one have facilitated misuse of data and unfounded claims. A sort of pseudo-academic intellectualism has sprung up where any evidence less than Level 1 is invalid and all Level 1 evidence is the gospel truth. But nothing could be further from the truth. Let’s expand this chart with some examples.

Let’s think in more detail about the actual quality of each level of evidence, not just its relative rank.

Foundational evidence. This type of evidence doesn’t even make the cut. It’s below Level 5. But this evidence includes a lot of evidence that makes a lot of sense to people and seems scientific and objective. For example, a researcher notices that there seems to be more aluminum deposits in certain parts of the brain in children with autism. To many antivaxers, this is high-quality evidence from the basic sciences. The problem, of course, is the conclusions drawn from the evidence. One piece of data, even if correct, one single observation, even if accurate, does not allow for such grand conclusions.

Nor does animal research mean much. Animals are not humans, and we have learned this time and time again. It is unusual that major and important findings in animal studies translate directly or even in any significant way to human studies.

Foundational evidence is important, but it cannot be used to support larger premises (e.g., the aluminum-autism link). Grander ideas based on foundational evidence are almost always wrong.

We live in a vast Universe with a vast number of stars and planets; surely some of those have produced life, and life more advanced than us, and surely they have figured out how to travel near the speed of light or beyond it … Yes! We have been visited by aliens in UFOs!!

A statement like this is meaningless and is evidence of nothing except a good imagination.

Level 5. This level of evidence is anecdote: a personal observation or experience; it is a mere opinion. In the published literature, this is a case report. In Ufology, it is a slightly intoxicated bar patron stumbling home who notices a bright light in the sky moving in a weird way.

Virtually every case report ever published that reports something novel is false. Anecdote is full of error. Many case reports come about because the authors misinterpreted a patient’s findings or misdiagnosed the patient. Or, they might have mistakenly attributed success or failure to something. All anecdotal evidence is fundamentally flawed and most can be explained by regression to the mean. Next time you read a case report, remember the drunk guy who spotted a UFO.

Level 4. Level 4 evidence is a group of UFO spotters at a convention; it is a collection of anecdotes. A collection of anecdotes doesn’t solve the regression to the mean problem and usually does little to account for other explanations of the observations or alternate hypotheses. These are case series and uncontrolled cohort series. The group of UFO spotters at the convention aren’t making any serious attempt to offer up other explanations for what they saw, they are simply piling on “evidence,” and this is also the case with most case series.

Level 3. With Level 3 evidence, we see the first attempt to control for the regression to the mean problem, but it is done so retrospectively. This is equivalent to the UFO spotters walking down the hall to the people at the Texas Hold ‘Em Poker convention and finding someone like them and asking them if they have ever seen a UFO or something like one that they might have another explanation for. It is still very poor data, subject to collection and recollection bias, a lack of control for unknown variables, and a lack of alternate hypotheses. This corresponds to retrospective cohort studies and the like. It is still very poor evidence, but it is certainly better than mere anecdote. It at least tries to see if the anecdote is unusual or has some other explanation.

Level 2. It is not until Level 2 evidence than any serious attempt is made to solve the regression to the mean problem, recall bias, and some other major issues. Now we are dealing with prospectively collected and recorded data. But, there is still a tremendous amount of bias present in most of these studies and the only factors controlled for are those that are known to the authors (which often aren’t many). For example, one might go to Roswell and ask tourists to take a card with them on their trip; if they see a UFO, they should check ‘Yes’ and if they do not they should check ‘No’. Seems simple.

But what’s a UFO?

Perhaps you tell them that it is any moving light in the sky; you have now biased them.

Perhaps you tell them just to record anything strange they might see in detail, without telling them that they are looking for UFOs. Then, you read the responses and you decide which observation was a UFO and which was a normal event explained by man-made or meteorologic phenomena; in this case, you are biased.

Perhaps you don’t tell them what they are looking for and you let a third party, unknown to you, read the reports and decide what they mean. But since you gave the cards to people visiting Roswell for their vacation, you selected a bunch of UFO pareidoliacs; now your sample is biased.

It may not be a belief in UFOs that drives someone to perform a study, but most who undertake a study set out to ‘prove’ what they already believe – that is, they want to prove their hypothesis. This type of bias is rampant. Remember, the scientific method requires you to try your best to disprove what you believe – not the other way around.

Level 1. From all these failing of Level 2-Level 5 evidence was born the RCT – the randomized, controlled trial. But these come in a wide, wide, WIDE, variety of quality. Is it appropriately powered? Is it double-blinded or triple-blinded? I could go on, but I already have here and here. The point is, there is a lot of variation in Level 1 evidence.

At the low end is a poorly-designed and poorly-implemented, doubled-blinded RCT with a low pre-study probability that the alternate hypothesis is true.

At the high end is a meta-analysis of several well-designed, well-implemented, triple blinded RCTs with homogenous results and a high pre-study probability that the alternate hypothesis is true.

Most studies are closer to the low end and we think that about 80% of those studies are wrong and will not hold-up over time. That is, 80% will be reversed or will not have their results reproduced by other quality evidence over time. I talk about this crisis more here. So what are the implications of this observation?

For starters, if the bad studies among Level 1 evidence are wrong upwards of 80% of the time, then Level 2-Level 5 evidences aren’t even this good. Have you ever been surprised when something you thought was well-established science (like using 17-hydroxyprogesterone caproate for prevention of recurrent preterm labor) was reversed or not supported by later data? Don’t be. It happens all of the time. Medical science is a saga of reversal and replacement. Pick up a textbook of OB/Gyn from 30 years ago. Very few concepts from that text are supported by current science. Don’t be surprised if the same statement is made 30 years from now.

Some Level 1 evidence undoubtedly is much better than this. Again, a meta-analysis of several high quality studies containing homogenous results is very likely to be true. But how many meta-analyses of multiple RCTs with homogenous results have you read? I’d say not many. First, we have a reproducibility crisis in the biological sciences. Good papers don’t get replicated, they get adopted (often in error). The scientific method demands retesting, even when favorable results occur. Second, most meta-analyses are performed simply because the studies in the field contain heterogeneous results; this creates methodological nightmares and tells us immediately that the results of any one or more of the given studies definitely wrong.

What do I mean? Let’s say that three studies say that magnesium sulfate doesn’t reduce the risk in surviving preterm newborns without also increasing the risk of their mortality, but one subset analysis from a poorly designed and methodologically flawed study says that it does. The very fact that a meta-analysis is being undertaken is because the studies disagree with one another. Either the one is right and the three are wrong, or the three are right and the one is wrong. This question is not solved by a bastardized meta-analysis, full of statistical trickery that seeks to amalgamate the data to fit the authors purpose. This isn’t science, nor is it statistics. It’s weaselry, especially when the authors examine the data unblinded. These types of publications must end. More important, because it is a meta-analysis of heterogenous data, it would be considered Level 2 evidence, not Level 1, meaning that it would be trumped in the Level of Evidence poker game by any one of the three RCTs that it seeks to analyze (but not the subset analysis since that analysis wasn’t powered correctly).

Bottom line: Levels of evidence are powerful for comparative purposes, but just because something has the support of high level of evidence doesn’t mean it is correct.

Bottom line 2: believing that magnesium prevents CP in newborns born preterm is statistically akin to a belief in UFOs in terms of probability and levels of evidence. Recent surveys reveal that nearly half of Americans believe in UFOs, and I suspect that nearly half of Obstetricians are using magnesium. C’est la vie.

More on meta-analysis here, here, here, and here.

PS: Do you believe in UFOs? Consider these problems:

The closest planet that might have life on it (of any form) is likely about 14 light-years away. The average planet is about a 1000 light-years away. Our current best-in-class and best-in-concept (unmanned) technology would allow us to travel one light-year in about 18,449 years, or about 258,286 years to the closest inhabitable planet and about 18.5 million years to the average inhabitable one.
The belief in UFOs emerged broadly in the 1950s when people talked about Martians; that is, people from Mars. There aren’t people on Mars or in our Solar System.
If you believe in UFOs, then you believe the following premises:
- A super advanced civilization exists somewhere in the Universe;
- They have perfected technologies that not only travel near the speed of light but which conserve energy nearly perfectly, both fundamentally in violation of our knowledge of physics;
- They decided long ago to come towards our star (the Sun) out of all the billions of stars they might have picked, just to visit us;
- They continued this journey for thousands of years;
- They did this long before any of our radio waves might have penetrated to their neck of the woods to let them know we were here;
- And, after abducting a guy in Nebraska, they decided to turn around and start the billions of miles trek back home.

Hmm. Stopping reading fiction and get back to work.