Sample Size and Statistical Power: When Your Backtest Doesn't Have Enough Data

Context

The previous notes in this series have addressed several ways a backtest can mislead: fitting the data too tightly (Methodology Notes #7), contaminating the research process through repeated exposure (#8), failing to account for how many strategies were tested (#5), and mistaking a narrow parameter optimum for a real effect (#10). Each of those failure modes has a common upstream dependency: the dataset must contain enough information to support the conclusion being drawn. If it does not, every downstream diagnostic is compromised.

Sample size is the most fundamental constraint in quantitative research, and it is the one most frequently hand-waved away. A researcher with 20 years of daily data has roughly 5,000 trading days. That sounds like a lot. But the effective number of independent observations is almost always much smaller -- sometimes dramatically so -- and the gap between the nominal sample size and the effective sample size is where false confidence originates.

This note examines the relationship between sample size, effect size, and statistical power in the context of systematic research. It covers why financial data yields fewer independent observations than calendar time suggests, how to estimate effective sample size in the presence of autocorrelation and regime structure, what practical minimum sample sizes look like for different research frequencies, and how to determine whether a dataset is large enough to detect the effect you are looking for -- or whether you are running an experiment that cannot succeed.

The core question is simple: does your data contain enough information to distinguish a real effect from noise at the magnitude you expect? If the answer is no, no amount of methodological rigor in later stages can rescue the conclusion.

Why Nominal Sample Size Overstates Information Content

A dataset of 5,000 daily observations does not contain 5,000 independent pieces of information. Financial time series exhibit serial dependence -- today's return is correlated with yesterday's, this week's volatility is correlated with last week's, and the market regime operating on any given day is likely the same regime that was operating the day before. Each of these dependencies reduces the effective sample size below the nominal count.

Autocorrelation is the most direct mechanism. If daily returns have a first-order autocorrelation of 0.05 -- which is modest by financial standards -- the effective sample size is reduced by a factor of approximately (1 - 0.05) / (1 + 0.05), or roughly 0.90. That is a 10% reduction for a nearly imperceptible level of serial dependence. At higher frequencies, where autocorrelation can be 0.15-0.30 in volatility or spread measures, the effective sample shrinks by 25-45%. The researcher who reports '5,000 observations' may in practice have the informational equivalent of 3,000.

Clustering compounds the problem. Financial returns are not uniformly distributed across time. Volatility clusters, liquidity clusters, and regime persistence mean that large blocks of adjacent observations carry similar information. Ten consecutive days in a low-volatility trending environment are not ten independent data points about how the strategy performs -- they are closer to one or two observations of 'how the strategy behaves in this specific regime.' The nominal count of days is not the relevant unit of measurement.

The practical consequence is that significance tests, confidence intervals, and Sharpe ratio estimates computed using nominal sample sizes are systematically too optimistic. The standard errors are too small, the confidence bands are too narrow, and the apparent significance is overstated. A t-statistic of 2.5 computed with 5,000 nominal observations may correspond to a t-statistic of 1.8 or lower once effective sample size is accounted for -- which may not clear standard significance thresholds.

Statistical Power and Effect Size

Statistical power is the probability that a test will correctly detect a real effect when one exists. A test with 80% power will detect a true effect 80% of the time; the remaining 20% of the time, it will fail to reject the null hypothesis even though the effect is genuine. Power depends on three quantities: the size of the effect, the noise level in the data, and the number of independent observations.

In systematic research, the effect size is typically small. A strategy that generates 3-5% annualized excess return with 15% annualized volatility has a Sharpe ratio of roughly 0.2-0.3. Detecting an effect of that magnitude with statistical confidence requires a large number of independent observations. The required sample size scales inversely with the square of the effect size: halving the expected Sharpe ratio quadruples the data requirement. This is why strategies with modest edges need decades of data to validate, while the typical backtest covers five to ten years.

The power curve illustrates this relationship. For a strategy with a true Sharpe ratio of 0.3 and a significance threshold of 5%, achieving 80% power requires approximately 120 independent annual observations -- which, at daily frequency with autocorrelation adjustments, translates to roughly 30-40 years of data. Most backtests use 10-15 years. They are running underpowered experiments whose conclusions rest on sample sizes that cannot reliably distinguish the claimed effect from noise.

An underpowered test does not simply fail to detect real effects. It produces a distorted picture of reality. Among the results that do achieve significance in an underpowered setting, the estimated effect sizes are systematically inflated -- a phenomenon known as the winner's curse or Type M error. The effects that clear the significance bar in small samples tend to be the ones that were lucky, not the ones that were large. This connects directly to the false discovery risk framework from Methodology Notes #5: underpowered tests amplify the false discovery rate.

Fig. 1 -- Statistical power curves by sample size: achieving 80% power with a small effect (d = 0.2) requires n > 600 independent observations

Autocorrelation and the Effective Sample Size Calculation

Estimating effective sample size requires accounting for the serial dependence structure of the data. The standard adjustment uses the autocorrelation function of the series being tested. For a stationary time series with autocorrelations rho_1, rho_2, ..., rho_k at lags 1 through k, the effective sample size is approximately N / (1 + 2 * sum of rho_i for i = 1 to k), where N is the nominal sample size and the sum runs over all significant lags.

In practice, the adjustment can be substantial. For a strategy that generates returns with significant autocorrelation at lags 1 through 5 -- which is common for strategies with multi-day holding periods or overlapping observation windows -- the denominator can easily reach 1.5-2.0, cutting the effective sample in half. Strategies that use rolling windows for signal construction are particularly susceptible because the overlapping windows mechanically induce autocorrelation in the return series, regardless of the underlying market dynamics.

A practical approach is to compute the Newey-West adjustment for the standard error of the mean return, using a lag length proportional to the cube root of the sample size. The ratio of the naive standard error to the Newey-West standard error gives an estimate of the effective sample size deflator. If the Newey-West standard error is 1.4 times the naive standard error, the effective sample size is roughly half the nominal size. This calculation takes minutes and should be a standard step in any research pipeline.

Ignoring autocorrelation does not make it go away. It simply means the researcher reports confidence intervals that are too tight and significance levels that are too generous. The effect is not subtle: we have observed cases in our own pipeline where adjusting for serial dependence moved a result from statistically significant at the 1% level to insignificant at the 10% level. The underlying data was the same. Only the honesty of the inference changed.

Regime Changes and the Relevance Decay of Historical Data

Even after adjusting for autocorrelation, there is a deeper problem with using long historical samples: older data may not be relevant to the current market environment. Market microstructure evolves. Decimalization, the rise of electronic trading, changes in margin requirements, the growth of passive investing, and shifts in monetary policy regimes all mean that the market of 2006 is structurally different from the market of 2026. Data from an earlier era may be statistically valid but economically uninformative.

This creates a painful trade-off. Longer samples increase statistical power by adding observations. But they also dilute the sample with data from regimes that may no longer apply. A 30-year backtest has more statistical power than a 10-year backtest, but if the first 20 years reflect a market structure that no longer exists, the added power is illusory -- the test is well-powered to detect an effect that is no longer present.

Regime detection methods, discussed in Methodology Notes #6, can help identify structural breaks in the data. When a regime break is identified, the researcher faces a choice: use the full sample and accept that the early data may be from a different generating process, or truncate the sample at the regime break and accept the reduced statistical power. Neither choice is comfortable. The full sample is more powerful but potentially contaminated. The truncated sample is cleaner but may be too short to support any conclusion.

A pragmatic approach is to weight observations by recency -- applying an exponential decay or a step function that gives recent data more influence than older data. This preserves some of the statistical benefit of a longer sample while reducing the contamination from stale regimes. The decay rate itself becomes a parameter, but it is a parameter with economic meaning (how fast does market structure evolve?) rather than a fitting parameter. In our work, we typically apply a half-life of 5-7 years for daily strategies, meaning data from 15 years ago receives roughly one-quarter the weight of recent data.

Fig. 2 -- Relevance decay and regime breaks: nominal 20-year history may contain only ~6 years of structurally relevant data

Practical Minimum Sample Sizes by Strategy Frequency

The minimum sample size required to draw a reliable conclusion depends on the expected effect size, the noise level, and the degree of serial dependence -- all of which vary by strategy frequency. The following are rough empirical guidelines based on standard power analysis assumptions (80% power, 5% significance, moderate autocorrelation adjustment).

For daily strategies with expected Sharpe ratios of 0.5 or above, a minimum of 5-7 years of daily data (roughly 1,250-1,750 trading days) provides reasonable power after autocorrelation adjustment. For Sharpe ratios in the 0.3 range -- which is more realistic for most systematic approaches -- the requirement extends to 15-25 years. Below a Sharpe of 0.2, the required sample exceeds what most researchers have available, which means the backtest cannot reliably distinguish the effect from noise regardless of how clean the methodology is.

For weekly strategies, where each observation is more independent but the count is lower, the data requirements are proportionally longer in calendar time. A weekly strategy with a Sharpe of 0.5 needs approximately 8-12 years of data. For monthly strategies, the requirements become extreme: a monthly Sharpe of 0.5 requires roughly 20-30 years of monthly observations to achieve adequate power, and most monthly strategies operate at lower Sharpe ratios than that.

These numbers are sobering. They imply that a large fraction of published backtests -- and an even larger fraction of informal research -- operate with insufficient data to support their conclusions at standard confidence levels. The research may still be directionally informative, but presenting a 5-year backtest as statistically validated evidence of a 0.3-Sharpe effect is not supported by power analysis. Acknowledging this limitation explicitly is more honest than ignoring it.

A Practical Assessment Framework

Before trusting a backtest result, assess whether the dataset is large enough to support the conclusion. The following checklist provides a structured approach.

Sample Size Adequacy Checklist

Have you computed the effective sample size after adjusting for autocorrelation (e.g., Newey-West correction)?
Is the effective sample size large enough to detect the expected effect at 80% power and 5% significance?
Have you estimated the minimum detectable effect size given your actual sample -- and is that threshold meaningful?
Does the sample span multiple distinct market regimes, or is it dominated by a single environment?
If using data older than 10 years, have you assessed whether structural market changes reduce the relevance of early observations?
Have you applied recency weighting or regime-conditional analysis to avoid diluting conclusions with stale data?
If the power analysis indicates the test is underpowered, have you acknowledged this limitation rather than proceeding as if the sample were sufficient?

Takeaway

Sample size is not a detail. It is the foundation on which every statistical conclusion rests. A backtest with insufficient data cannot be rescued by better methodology -- it can only be made to look more rigorous while remaining unreliable. The diagnostics from earlier notes in this series -- overfitting checks, snooping defenses, sensitivity analysis -- all assume that the underlying dataset contains enough independent information to support inference. When it does not, those diagnostics are testing noise against noise.

The effective sample size in financial research is almost always smaller than the nominal count of observations. Autocorrelation, regime clustering, and overlapping observation windows reduce the information content of the data, often by a factor of two or more. Researchers who do not adjust for this are implicitly claiming more precision than their data supports.

The most uncomfortable implication of power analysis is that many research questions cannot be answered with the data available. A strategy with a true Sharpe of 0.25 and 10 years of daily data is operating in a regime where statistical testing has roughly coin-flip reliability. The honest response is not to test anyway and hope for significance. It is to acknowledge the limitation, seek additional data or alternative validation methods, and resist the temptation to treat an underpowered result as evidence.

This content is the original work of Zylo Technology and may not be republished or reproduced without permission.