OPINION
ROTH, Circuit Judge:
This case involves allegations that the anti-depressant drug Zoloft, manufactured by Pfizer, causes cardiac birth defects when taken during early pregnancy. In support of their position, plaintiffs, through a Plaintiffs’ Steering Committee (PSC), depended upon the testimony of Dr. Nicholas Jewell, Ph.D. Dr. Jewell used the “Bradford Hill” criteria
to analyze existing literature on the causal connection between Zoloft and birth defects. The District Court excluded this testimony and granted summary judgment to defendants. The PSC now appeals these orders, alleging that 1) the District Court erroneously held that an expert opinion on general causation must be supported by replicated observational studies reporting a statistically significant association between the drug and the adverse effect, and 2) it was an abuse of discretion to exclude Dr. Jew-, ell’s testimony. Because we find that the District Court did not establish such a legal standard and did not abuse its discretion in excluding Dr. Jewell’s testimony, we will affirm the District Court’s orders.
I.
This case arises from multi-district litigation involving 315 product liability claims against Pfizer, alleging that Zoloft, a selective serotonin reuptake inhibitor (SSRI), causes cardiac birth defects. The PSC introduced a number of experts in order to establish causation. The testimony of each of these experts was excluded in whole or in part. In particular, the court excluded all of the testimony of Dr. Anick Bérard (an epidemiologist), which relied on the “novel technique of drawing conclusions by examining ‘trends’ (often statistically non-significant) across selected stud
ies.”
The PSC filed a motion for partial reconsideration of the decision to exclude the testimony of Dr. Bérard, which the District Court denied. The PSC then moved to admit Dr. Jewell (a statistician) as a general causation witness. Pfizer filed a motion to exclude Dr. Jewell, and the District Court conducted a Daubert
hearing.
The District Court considered Dr. Jewell’s application of various methodologies, reviewing his expert report, rebuttal reports, party briefs, and oral testimony. The District Court first examined how Dr. Jewell applied the traditional methodology of analyzing replicated, significant results. While Dr. Jewell discussed many groupings of cardiac birth defects, he focused on the significant findings for all cardiac defects and septal defects. Dr. Jewell presented two studies reporting a significant association between Zoloft and all cardiac defects (Kornum (2010)
and Jimenez-Solem (2012)
). He also presented five studies reporting a significant association between Zoloft and septal defects (Kornum (2010), Jimenez-Solem (2012), Louik (2007),
Pedersen (2009),
and Bérard (2015)
). After excluding two studies from its consideration,
the District Court expressed two concerns with the remaining studies: Jimenez-Solem (2012), Kornum (2010), and Pedersen (2009). First, despite the fact that the remaining studies produced consistent results, the District Court did not consider them to be independent replications because they used overlapping Danish populations. Second, a larger study, Furu (2015),
included almost all the data from Jimenez-Solem (2012), Kor-
num (2010), and Pedersen (2009) and did not replicate the findings of those studies. Dr. Jewell did not explain the reasons why this attempted replication produced different results or why the new study did not contradict his opinion.
The court then examined Dr. Jewell’s reliance on insignificant results, noting that it was very similar to Dr. Bérard’s methodology. The court noted that Dr. Jewell did not provide any evidence that the epidemiology or teratology
communities value statistical significance
any less than it has traditionally been understood.
The court also expressed concern that Dr. Jewell inconsistently applied his “technique” of multiplying p-values
and his trend analysis.
The District Court critiqued several other techniques Dr. Jewell used in analyzing the evidence. First, Dr. Jewell rejected meta-analyses on which he had previously relied in a lawsuit against another SSRI, Prozac. The meta-analyses reported insignificant associations with birth defects for Zoloft but not for Prozac. Dr. Jewell rationalized his decision to ignore these meta-analyses because the “heterogeneity”
within its Zoloft studies was significant; the District Court accepted this explanation but questioned why Dr. Jewell “fails to statistically calculate the heterogeneity” across other studies instead of relying on trends.
Second, Dr. Jewell reanalyzed two studies, Jimenez-Solem (2012) and Huybrechts (2014),
both of which had originally concluded that there was no significant effect attributable to Zoloft.
The District Court questioned his rationale for conducting, and tactics for implementing, this reanalysis. Finally, Dr. Jewell conducted a meta-analysis with
Huybrechts (2014) and Jimenez-Solem (2012). The District Court questioned why he used only those particular studies.
Based on this analysis, the District Court found that Dr. Jewell, tasked with explaining his opinion about Zoloft’s effect on birth defects and reconciling contrary studies, “failed to consistently apply the scientific methods he articulates, has deviated from or downplayed certain well-established principles of his field, and has inconsistently applied methods and standards to the data so as to support his
a prioñ
opinion.”
For this reason, on December 2, 2015, the District Court entered an order, excluding Dr. Jewell’s testimony, and on April 5, 2016, the court granted Pfizer’s motion for summary judgment. The PSC appeals the exclusion of Dr. Jewell and the grant of summary judgment.
II.
In general, courts serve as gatekeepers for expert witness testimony. “A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if,”
inter alia,
“the testimony is the product of reliable principles and methods[ ] and ... the expert has reliably applied the principles and methods to the facts of the case.”
In determining the reliability of novel scientific methodology, courts can consider multiple factors, including the testability of the hypothesis, whether it has been peer reviewed or published, the error rate, whether standards controlling the technique’s operation exist, and whether the methodology is generally accepted.
Both an expert’s methodology and the application of that methodology must be reviewed for reliability.
A court should not, however, usurp the role of the fact-finder; instead, an expert should only be excluded if “the flaw is large enough that the expert lacks the ‘good grounds’
for his or her conclusions.”
Central to this case is the question of whether statistical significance is necessary to prove causality. We decline to state a bright-line rule. Instead, we reiterate that plaintiffs ultimately must prove a causal connection between Zoloft and birth defects. A causal connection may exist despite the lack of significant findings, due to issues such as random misclassification or insufficient power.
Conversely, a causal connection may not exist despite the presence of significant findings. If a causal connection does not actually exist, significant findings can still occur due to,
inter alia,
inability to control for a confounding effect or detection bias. A standard based on replication of statistically significant findings obscures the essential issue: a causal connection. Given this, the requisite proof necessary to establish causation will vary greatly case by case. This is not to suggest, however, that statistical significance is irrelevant. Despite the problems with treating statistical significance as a magic criterion, it remains an important metric to distinguish between results supporting a true association and those resulting from mere chance. Discussions of statistical significance should thus not understate or overstate its importance.
With this in mind, we proceed to the issues at hand. The PSC raises two issues on appeal: 1) whether the District Court erroneously concluded that reliability requires replicated, statistically significant findings, and 2) whether Dr. Jewell’s testimony was properly excluded.
A.
The PSC argues that the District Court erroneously held that replicated, statistically significant findings are necessary to satisfy reliability. This argument seems to have been originally raised in the motion for reconsideration of Dr. Bérard’s exclusion. Explaining its decision to exclude Dr. Bérard, the District Court cited a previous case,
Wade-Greaux v. Whitehall Labs, Inc.,
for the proposition that the teratology community generally requires replicated, significant epidemiological results before inferring causality.
The PSC claims that in so doing, the District Court was asserting a legal standard that required replicated, significant findings for reliability.
Pfizer contends that the District Court merely made a factual finding about what the teratology community generally accepts.
Upon review, it is clear that the District Court was not creating a legal standard, but merely making a factual finding. The PSC argues that the District Court must have created a legal standard because it did not cite any sources other than
Wade-Greaux
to support its assertion that the teratology community generally requires replicated, significant epidemiological findings. However, in its initial exclusion of Dr. Bérard, the District Court noted that it looked to the standards adopted by “other epidemiologists, even the very researchers [Dr. Bérard] cites in her report.”
Similarly, in its order denying general reconsideration of Dr. Bérard’s exclusion, the District Court clarified that it “made this factual finding after review of the published literature relied upon by Dr. Bérard and other experts, as well as its review of the reports and testimony of both parties”
and merely used this factual finding as part of its FRE 702 analysis.
While the District Court does cite
Wade-
Greaux,
it uses it merely to show “that other courts have made similar findings regarding the prevailing standards for scientists in Dr. Bérard’s field.”
Second, the course of the proceedings make clear that the replication of significant results was not dispositive in establishing whether the testimony of either Dr. Bérard or Dr. Jewell was reliable. In fact, the District Court expressly rejected Pfizer’s argument that the existence of a statistically significant, replicated result is a threshold issue before an expert can conduct the Bradford-Hill analysis.
In doing so, the District Court was clear that it was not requiring a threshold showing of statistical significance. Similarly, the District Court did not end its inquiry after analyzing whether there were replicated, significant results. Instead, the District Court examined other techniques of general trend analysis, reanalysis of other studies, and meta-analysis. Even though it ultimately rejected the application of these techniques as unreliable, it did not categorically reject alternative techniques, suggesting that it did not make a legal standard requiring replicated, significant results.
For these reasons, we find that the District Court did not require replication of
significant results to establish reliability. Instead, it merely made a factual finding that teratologists generally require replication of significant results, and this factual finding did not prevent it from considering other evidence of reliability.
B.
The second issue on appeal is whether it was an abuse of discretion for the District Court to exclude Dr. Jewell’s testimony. Dr. Jewell utilized a combination of two methods: the “weight of the evidence” analysis and the Bradford Hill criteria. The “weight of the evidence” analysis involves a series of logical steps used to “infer[] to the best explanation[.]”
The Bradford Hill criteria are metrics that epidemiologists use to distinguish a causal connection from a mere association. These metrics include strength of the association, consistency, specificity, temporality, coherence, biological gradient, plausibility, experimental evidence, and analogy.
In his expert report, Dr. Jewell seems to utilize numerous “techniques” in implementing the weight of the evidence methodology. Dr. Jewell discusses whether the conclusions drawn from these techniques satisfy the Bradford Hill criteria and support the existence of a causal connection.
Pfizer does not seem to contest the reliability' of the Bradford Hill criteria or weight of the evidence analysis generally; the dispute centers on whether the specific methodology implemented by Dr. Jewell is reliable. Flexible methodologies, such as the “weight of the evidence,” can be implemented in multiple ways; despite the fact that the methodology is generally reliable, each application is distinct and should be analyzed for reliability. In
In re Paoli R.R. Yard PCB Litigation,
this Circuit noted that while differential diagnosis—also a flexible methodology—is generally accepted, “no particular combination of techniques chosen by a doctor to assess an individual patient is likely to have been generally accepted.”
Accordingly, we subjected the expert’s specific differential diagnosis process to a
Daubert
inquiry.
We noted that “to the extent that a doctor utilizes standard diagnostic techniques in gathering this information, the more likely we are to find that the doctor’s methodology is reliable.”
While we did not require the expert to run specific tests or ascertain full information in order for the differential diagnosis to be reliable, we did require
him to explain why his conclusion remained reliable in the face of alternate causes.
This standard, while articulated with respect to differential diagnoses, applies to the weight of the evidence analysis. We have briefly encountered the Bradford Hill criteria/weight of the evidence methodology in
Magistrini v. One Hour Martinizing Dry Cleaning,
a nonprecedential affirmance of the District of New Jersey’s exclusion of an expert.
The expert followed the weight of the evidence methodology, including epidemiological findings assessed using the Bradford Hill criteria. The District Court acknowledged that although the weight of the evidence methodology was generally rehable, “[t]he particular combination of evidence considered and weighed here has not been subjected to peer review.”
Similar concerns are arguably present for the Bradford Hill criteria, which are neither an exhaustive nor a necessary list.
An expert can theoretically assign the most weight to only a few factors, or draw conclusions about one factor based on a particular combination of evidence. The specific way an expert conducts such an analysis must be reliable; “all of the relevant evidence must be gathered, and the assessment or weighing of that evidence must not be arbitrary, but must itself be based on methods of science.”
To ensure that the Bradford Hill/weight of the evidence criteria “is truly a methodology, rather than a mere conclusion-oriented selection process ... there must be a scientific method of weighting that is used and explained.”
For this reason, the specific techniques by which the weight of the evidence/Bradford Hill methodology is conducted must themselves be rehable according to the principles articulated in
Daubert.
In short, despite the fact that both the Bradford Hill and the weight of the evidence analyses are generally reliable, the “techniques” used to implement the analysis must be 1) reliable and 2) reliably applied. In discussing the conclusions produced by such techniques in light of the Bradford Hill criteria, an expert must explain 1) how conclusions are drawn for each Bradford Hill criterion and 2) how the criteria are weighed relative to one another. Here, we accept that the Brad
ford Hill and weight of the evidence analy-ses are generally reliable. We also assume that the “techniques” used to implement the analysis (here, meta-analysis, trend analysis, and reanalysis) are themselves reliable. However, we find that Dr. Jewell did not 1) reliably apply the “techniques” to the body of evidence or 2) adequately explain how this analysis supports specified Bradford Hill criteria. Because
“any
step that renders the analysis unreliable under the
Daubert factors renders the expert’s testimony
inadmissible,”
this is sufficient to show that the District Court did not abuse its discretion in excluding Dr. Jewell’s testimony.
1.
It was not an abuse of discretion for the District Court to find Dr. Jewell’s application of trend analysis, reanalysis, and meta-analysis to the body of evidence to be unreliable. Here, we assume the techniques listed are generally reliable and rest on the fact that they were unreliably applied. As stated in
In re Paoli,
use of standard techniques bolster the inference of reliability;
nonstandard techniques need to be well-explained. Additionally, if an expert applies certain techniques to a subset of the body of evidence and other techniques to another subset without explanation, this raises an inference of unreliable application of methodology.
First, we find no abuse of discretion in the District Court’s determination that Dr. Jewell unreliably analyzed the trend in insignificant results. Dr. Jewell applied this technique by qualitatively discussing the probative value of multiple positive, insignificant results. In justifying this approach, he relied on a quantitative method by which one can calculate the likelihood of seeing multiple positive but insignificant results if there were actually no true effect.
However, after alluding to this presumably reliable mathematical calculation technique for analyzing trends in even insignificant results, Dr. Jewell did not actually implement it; instead he qualitatively discussed the general trend in the data. In light of the opportunity to actually conduct such quantitative analysis, his refusal to do so—without explanation—suggests that he did not reliably apply his stated methodology.
Even assuming the reliability of Dr. Jewell’s version of trend analysis, Dr. Jewell identified trends and interpreted insignificant results differently based on the outcome of the study. The District Court concluded that Dr. Jewell “selectively emphasize[d] observed consistency ... only when the consistent studies support his opinion.”
Dr. Jewell emphasized the insignificance of results reporting odds ratios below 1 but not the insignificance of those reporting odds ratios above 1. He also paid attention to the upper bounds of the confidence intervals associated with odds ratios below 1, but not to the lower bounds.
Second, we interpret the District Court’s discussion of heterogeneity as raising the concern that Dr. Jewell selectively used meta-analyses. He did this in two ways: First, without explanation, Dr. Jewell performed a meta-analysis on two studies but not on any of the other studies. The District Court questioned why Dr. Jewell did not conduct a meta-analysis on the remaining studies instead of using the qualitative general trend analysis. While Dr. Jewell was not required to do specific tests, the lack of explanation made his inconsistent application of meta-analysis to certain studies unreliable.
Second, when he did perform a meta-analysis, Dr. Jewell only included two studies utilizing “exposed” and “paused” groups even though each had a different definition of “paused,” without an adequate explanation for why these studies can be lumped together. He also inexplicably excluded another study (Kor-num (2010)) utilizing similar methodology. Again, while there may have been legitimate reasons for these inconsistencies, the fact that he did not give an adequate explanation for doing so makes his testimony unreliable.
Finally, Dr. Jewell reanalyzed two studies to control for confounding by indication. The need for conducting this reanalysis on Huybrechts (2014) was unclear. Dr. Jewell said that he wanted to control for indication by comparing the outcomes for “paused” Zoloft users to “exposed” Zoloft users; however, the study already controlled for indication. If Dr. Jewell wanted to correct for misclassification, the original study already controlled for that as well through extensive sensitivity analyses.
Given that the study originally concluded that Zoloft was not associated with a statistically significant increase in the likelihood of birth defects, this reanalysis seems conclusion-driven.
Ultimately, the fact that Dr. Jewell applied these techniques inconsistently, without explanation, to different subsets of the body of evidence raises real issues of reliability. Conclusions drawn from such unreliable application are themselves questionable.
2.
Using the techniques discussed above, Dr. Jewell went on to evaluate the Bradford Hill criteria. While Dr. Jewell did discuss the applicable Bradford Hill criteria and how he weighed the factors together, he did not explain how he drew conclusions for certain criteria, namely the strength of association and consistency.
Dr. Jewell concluded that the strength of association weighs in favor of causality. In doing so, he focused on studies reporting odds ratios between two and three (Colvin (2011),
Jimenez-Solem (2012), Malm (2011),
Pedersen (2009), and Louik
(2007)). He rationalized that such a large association is unlikely to be associated with confounding alone.
He later bolstered this argument by estimating the percent of the effect generally attributable to confounding by indication. He estimated this percent by observing the percent decrease in odds ratios after controlling for indication over a few studies. When pressed by counsel at the
Dauberb
hearing, Dr. Jewell admitted that this was not a scientifically rigorous adjustment.
Such reliance on ad hoc adjustments supports the District Court’s decision to exclude Dr. Jewell’s testimony.
Similarly, while Dr. Jewell found that the causal effect of Zoloft on cardiac birth defects is consistent, it is not clear how he drew this conclusion. As noted above, Dr. Jewell classified insignificant odds ratios above one as supporting a “consistent” causality result, downplaying the possibility that they support
no
association between Zoloft use and cardiac birth defects. While an insignificant result
may
be consistent with a causal effect, Dr. Jewell’s discussion is too far-reaching, sometimes understating the importance of statistical significance. For example, Furu (2015)—a study that incorporated almost all the data in Pedersen (2009), Jimenez-Solem (2012), and Kornum (2010)—included a larger sample but, unlike the former three studies, reported no significant association between Zoloft and cardiac birth defects. Insignificant results can occur merely because a study lacks power to produce a significant result, and, all else being equal, a larger sample size increases the power of a test.
Unless there are other significant differences, we would expect Furu to be better able to capture a true effect than the preceding three studies. While an insignificant result from a low-powered study does not necessarily undermine a statistically significant result from a higher-powered study, the opposite argument (ie., that an insignificant finding from a presumably better-powered study is evidence of consistency with significant findings from lower-powered studies) requires further explanation.
While there may be a reason that such a result could be consistent with the past significant effects, Dr. Jewell did not meaningfully discuss why this may be.
Without adequate explanation, this argument understates the importance of statistical significance. Like the expert in
Magistrini,
Dr. Jewell should have “sufficiently discredited] other studies that found
no association
or a negative association with much more precise confidence intervals, [or] sufficiently
explain[ed] why he did not accord weight to those studies.”
Claiming a consistent result without meaningfully addressing these alternate explanations, as noted in
In re Paoli,
undermines reliability.
For these reasons, the District Court determined that Dr. Jewell did not consistently assess the evidence supporting each criterion or explain his method for doing so. Thus, it was not an abuse of discretion to find that Dr. Jewell’s application of the Bradford Hill criteria was unreliable.
This is not to suggest that all of the District Court’s criticisms were necessarily justified. For example, the fact that in his reanalysis Dr. Jewell drew a different conclusion from a study than its authors did is not necessarily a problem. Similarly, his imposition of a different assumption about the “exposed” group in Huybrechts (2014) did not require expert knowledge about psychology; he was merely testing the robustness of the results to Huybrechts’ original assumption. Similarly, the District Court credited the claim that overlapping samples did not provide replicated results, despite the fact that Dr. Jewell claimed it provided some informational value.
These inquiries are more appropriately left to the jury.
On the whole, however, the District Court did not improperly usurp the jury’s role in assessing Dr. Jewell’s credibility. There is sufficient reason to find Dr. Jewell’s testimony was unreliable. Indeed,
“any
step that renders the analysis unreliable under the
Daubert factors renders the expert’s testimony
inadmissible.”
The fact that Dr. Jewell unreliably applied the techniques underlying the weight of the evidence analysis and the factors of the Bradford Hill analysis satisfies this standard for inadmissibility.
III.
This case involves complicated facts, statistical methodology, and competing claims of appropriate standards for assessing causality from observational epidemiological studies. Ultimately, however, the issue is quite clear. As a gatekeeper, courts are supposed to ensure that the testimony given to the jury is reliable and will be more informative than confusing. Dr. Jewell’s application of his purported methods does not satisfy this standard. By applying different techniques to subsets of the data and inconsistently discussing statistical significance, Dr. Jewell does not reliably analyze the weight of the evidence. Selecting these conclusions to discuss certain Bradford Hill factors also contributes to the unreliability. While the District Court may have flagged a few issues that are not necessarily indicative of an unreliable application of methods, there is certainly sufficient evidence on the record to suggest that the court did not abuse its discretion in excluding Dr. Jewell as an expert on the basis of the unreliability of his methods. For these reasons, we will affirm the orders of the District Court, excluding the testimony of Dr. Jewell and granting summary judgment in favor of Pfizer.