False Discovery Risk: When Your Backtest Finds Something That Isn’t There

The Problem with Running a Thousand Tests

A p-value of 0.05 means there is a 5% chance of seeing a result this extreme if the underlying effect is zero. That is a reasonable threshold for a single test. It becomes a trap when you run hundreds.

If you test 200 random strategies against historical data, approximately 10 will pass a p < 0.05 filter by chance alone. No edge, no signal, no exploitable pattern — just noise that happened to line up. The math is not subtle. It is arithmetic. And yet this is one of the most consistently underweighted risks in systematic research.

In our own early work, we ran broad parameter sweeps across dozens of signal variants and selected the ones with the strongest t-statistics. The results looked compelling. The out-of-sample performance was mediocre. It took longer than it should have to recognize that we were harvesting noise, not discovering signal. That experience changed how we structure every research workflow that followed.

Fig. 1 — The false discovery funnel: most "significant" results from large-scale testing are noise

Why It Feels Like Discovery

False discoveries do not feel false. That is the core problem.

When a strategy passes a significance test, the researcher sees a clean equity curve, a favorable t-statistic, and a plausible narrative for why it works. The brain constructs a story. Momentum works because of behavioral anchoring. Mean reversion works because of liquidity provision. The pattern makes sense — and once it makes sense, it feels real.

But the narrative was constructed after the result, not before it. This is the distinction between hypothesis-driven research and data-mined discovery. In hypothesis-driven research, you decide what to test and then test it. In data mining, you test everything and then explain the winners. The second approach is dramatically more vulnerable to false discovery because every test increases the odds that noise will pass the filter.

The statistical term is multiple comparisons. The practical consequence is that the more strategies you test, the higher your false discovery rate — unless you adjust for the number of tests.

Where Multiple Testing Hides

Multiple testing is obvious when a researcher runs 500 backtests in a loop. It is less obvious — and more dangerous — when it accumulates implicitly.

Parameter sweeps are the most common source. Testing a moving average crossover with windows of 5, 10, 15, 20, 25, 30, 40, 50, 60, and 100 days is ten tests, not one. If you then vary the holding period, the universe, and the entry condition, you may have run hundreds of implicit tests before arriving at the “final” strategy.

Feature selection is another hidden multiplier. If you screen 50 candidate features and select the five with the best predictive power, you have effectively run 50 tests. The five survivors are biased toward those that happened to correlate with the target in-sample, whether or not they carry real predictive information.

Even informal exploration counts. Every time a researcher looks at a result, adjusts something, and re-runs, the effective number of tests increases. This is sometimes called researcher degrees of freedom or the garden of forking paths. It is nearly impossible to track precisely, which is why structural corrections matter more than after-the-fact accounting.

Source	Effective Tests	Visibility
Explicit backtest loop	Equal to loop count	High — visible in code
Parameter sweep	Product of all varied parameters	Medium — easy to undercount
Feature selection (50 candidates → 5 chosen)	50	Low — feels like a single model
Informal exploration / re-runs	Unknown, often dozens	Very low — not tracked
Strategy selection across team members	Sum of all tests by all researchers	Near zero — organizational blind spot

The Corrections That Exist

Three practical approaches reduce false discovery risk. They differ in conservatism, and the right choice depends on how many tests you are correcting for and how much you can afford to miss.

Bonferroni correction is the simplest: divide your significance threshold by the number of tests. If you ran 200 tests, your new threshold is p < 0.05 / 200 = p < 0.00025. This controls the family-wise error rate — the probability that any single false positive slips through. It is conservative by design. In practice, it is often too conservative: it suppresses real discoveries along with false ones, especially when tests are correlated.

Benjamini-Hochberg (BH) procedure controls the false discovery rate (FDR) rather than the family-wise error rate. Instead of asking “what is the chance that any one result is false?”, it asks “what proportion of my discoveries are likely to be false?” The procedure sorts p-values from smallest to largest, then compares each to a threshold that scales with its rank. It is less conservative than Bonferroni and more appropriate when you expect some genuine effects to exist among the noise.

Deflated Sharpe ratio, proposed by Harvey and Liu, adjusts the required Sharpe ratio upward based on the number of strategies tested. If the research community has tried 300 strategies to find one that works, the Sharpe threshold should be much higher than the standard 2.0. This approach is particularly useful in quantitative finance because it operates directly on the performance metric researchers care about, rather than on abstract p-values.

In our workflow, we default to BH at a 5% FDR for systematic screening and use Bonferroni when evaluating a small number of pre-specified hypotheses. Neither is perfect. Both are dramatically better than ignoring the problem.

Fig. 2 — Three correction methods applied to the same set of p-values produce very different conclusions

What Corrections Cannot Fix

Multiple testing corrections adjust the significance threshold. They do not fix the underlying research design.

If a strategy was built by mining a parameter space, correcting the p-value after the fact is better than nothing — but it does not undo the fact that the strategy was designed to fit the data. A strategy that emerged from a sweep of 500 parameter combinations is structurally different from one that was specified before looking at data, even if both have the same corrected p-value.

This is why corrections are necessary but not sufficient. The deeper protection is research process design: pre-specifying hypotheses, using hold-out samples, running walk-forward validation, and treating out-of-sample performance as the primary evidence.

A useful mental model: corrections are the seatbelt. Good research design is not driving into walls. You want both, but if you had to choose, you would choose not driving into walls.

A Practical Research Standard

The goal is not to eliminate false discoveries — that would require never testing anything. The goal is to keep the false discovery rate at a level where the surviving results are worth acting on.

In practice, this means three things.

First, count your tests. Before reporting a result, estimate how many tests — explicit and implicit — were run to arrive at it. If you cannot estimate the number, assume it is higher than you think.

Second, apply a correction. BH at 5% FDR is a reasonable default for exploratory research. Bonferroni for confirmatory tests with a small number of pre-specified hypotheses. Report the correction method alongside the result.

Third, demand out-of-sample confirmation. A corrected p-value on in-sample data is a filter, not a proof. The strategy still needs to demonstrate edge on data it has never seen. If it passes the correction in-sample and then fails out-of-sample, the correction did its job — the result was probably noise.

False Discovery Risk Checklist

Can you estimate the total number of tests (explicit + implicit) that led to this result?
Has a multiple testing correction been applied? Which one, and at what FDR level?
Was the hypothesis pre-specified, or did it emerge from data exploration?
Does the result survive in a true out-of-sample period (not a reshuffled in-sample)?
If a parameter sweep was involved, does the edge persist across a range of parameters or only at the optimum?
Would you still trust this result if the number of tests were 3x higher than your estimate?

Takeaway

Statistical significance is not the finish line. It is the first filter.

When that filter is applied once, it works reasonably well. When it is applied hundreds of times — across parameter sweeps, feature screens, and informal exploration — it lets through noise at a rate that can dominate the results. The strategies that look best are often the ones that got luckiest.

The fix is not to stop testing. It is to adjust the threshold for the number of tests, demand out-of-sample confirmation, and treat every uncorrected significant result with the suspicion it deserves. That discipline is not exciting. It is what separates research that holds up from research that disappoints.

This content is the original work of Zylo Technology and may not be republished or reproduced without permission.