Execution Gap: Why Your Live Results Never Match the Backtest

Context

Every researcher who has moved a systematic strategy from backtest to live deployment has encountered the same experience: the live results are worse. Sometimes marginally, sometimes materially. Rarely better. The gap between what the model showed and what the market delivered is not a sign that the research was wrong -- it is a structural feature of how backtesting works.

Backtests operate on a simplified model of execution. Prices are assumed to be available at the moment the research logic fires. Orders are assumed to fill at the price on the chart. Costs are assumed to be known, symmetric, and consistent. None of these assumptions hold in production, and each violation chips away at expected performance.

This note -- the ninth in our Methodology Notes series -- examines the anatomy of the execution gap: where it comes from, how each component contributes, and what can be done to narrow the distance between modeled and realized performance. Earlier notes covered survivorship bias (#1), walk-forward validation (#2), transaction cost modeling (#3), look-ahead bias (#4), false discovery risk (#5), regime detection (#6), overfitting diagnostics (#7), and data snooping (#8). Execution gap is a natural continuation: even a methodologically clean backtest will overstate live performance if execution mechanics are poorly modeled.

The goal is not to make the backtest pessimistic. It is to make the modeled performance honest -- close enough to realistic expectations that the live deployment does not produce a structural surprise.

The Structural Reasons Backtests Overstate Live Performance

The execution gap is not a single mistake. It is a stack of small optimistic assumptions that compound across every observation in the dataset. Taken individually, each looks minor. Taken together across thousands of historical observations, they can produce a performance gap of several hundred basis points annually on strategies with moderate turnover -- and much more on high-frequency or capacity-constrained approaches.

The most fundamental reason is that a backtest observes prices but does not participate in price formation. It uses historical market data as if that data was produced independently of the strategy being tested. In reality, any order placed in the market changes the available liquidity, moves the price, and interacts with the order book in ways that the backtest cannot see. The backtest is a passive observer of a market that, in live trading, it would be actively perturbing.

A second structural issue is that backtests typically assume 100% fill rates. Every observation that meets research criteria is assumed to result in a full position at the assumed price. In practice, orders are partial fills, queued fills, skipped fills, or filled at materially worse prices during fast-moving conditions. Missed fills are not random -- they are systematically correlated with the cases where the trade would have been most valuable, because those cases are also the ones with the most adverse price movement.

Third, the sequence of operations is compressed in backtests. Research criteria evaluation, decision logic, order generation, and execution are treated as instantaneous. In production, each step takes time, and that time has a cost. By the time an order reaches the market, conditions have shifted. The price the model observed is already history.

Fig. 1 -- Each execution cost layer erodes gross backtest return; the red zones represent the gap between modeled and realizable performance

Slippage: The Gap Between Observed Price and Fill Price

Slippage is the difference between the price at which a research system identifies an observation and the price at which the resulting order actually fills. It is one of the most consistently underestimated costs in systematic research, partly because it does not appear as a fee on a brokerage statement and partly because it varies with conditions in ways that are difficult to model from historical data alone.

The simplest form of slippage is bid-ask spread crossing. Any market order that crosses the spread pays the half-spread as an immediate execution cost. For liquid large-cap equities, this may be only a few basis points. For small-caps, thinly traded instruments, or periods of elevated volatility, the spread can be ten to fifty basis points wide, and crossing it on every round-trip imposes a significant structural drag.

Beyond the spread, slippage includes the movement of the price against the order between the moment of observation and the moment of execution. On a 1-minute delay, this can be trivial in calm conditions and severe during high-momentum or news-driven moves -- precisely the conditions that generate the most compelling research observations. Backtests do not account for this adversarial dynamic. They see the price at the observation moment and assume it is available. Markets, in practice, move away from favorable prices.

A well-calibrated slippage model should vary by instrument, by liquidity regime, by order size relative to average daily volume, and by time of day. A single flat slippage assumption applied uniformly across the dataset will understate costs in exactly the cases where costs are highest: small names, thin liquidity, large relative order sizes, volatile periods. This is not a conservative error -- it is a systematically optimistic one.

Market Impact: Capacity Constraints and Price Perturbation

Market impact is the price movement caused by the order itself. Unlike slippage, which is primarily about the gap between observation and execution, market impact is about what happens to the market as a direct result of participating in it. For small orders relative to average daily volume, market impact is negligible. For orders that represent a meaningful fraction of daily volume, it becomes the dominant execution cost.

The square-root impact model -- widely used in institutional execution research -- suggests that market impact scales roughly with the square root of participation rate (order size divided by average daily volume). An order representing 1% of ADV might incur 5 basis points of impact. An order representing 10% of ADV might incur 15-20 basis points. These numbers are not precise, but the qualitative point is robust: impact grows faster than linearly with order size. Doubling order size more than doubles impact.

For systematic strategies, this creates a capacity constraint that backtests cannot reveal. A strategy may appear excellent in research -- where it is modeled as if it can execute at any size without market consequence -- but deteriorate or become unviable at the capital levels where it would actually be deployed. The backtest result is implicitly conditioned on infinitely small order sizes. Live trading is not.

Capacity constraints are particularly acute in strategies that concentrate in smaller names, trade during low-volume periods, or require rapid simultaneous execution across many positions. Stress testing a strategy's expected market impact across a range of deployment sizes is a necessary step before treating backtest results as representative of what live execution at scale would look like. A result that holds only at fractional size is not a production-ready result.

Fill Assumptions: Close, Mid, and VWAP

Among the most consequential and least-examined choices in backtest construction is the fill assumption: which price in the historical record is used as the assumed execution price. The three most common assumptions are the closing price, the mid-price (midpoint of bid and ask), and VWAP (volume-weighted average price). Each introduces a different form of distortion.

Closing price fills assume that all executions occur at the official market close. This is convenient because close prices are clean, widely available, and unambiguous. But it introduces an optimism problem: most strategies observe their criteria during the trading session and then retroactively assume they filled at the close -- a price they could not have known in advance. Even when research logic is constructed to observe criteria at the prior close and act at the current close, there is still an implicit assumption that the close price was fully available and that there was no adverse selection in reaching it.

Mid-price fills are systematically optimistic in a different way. The mid-price is the midpoint between the best bid and best ask. Actual transactions do not occur at the mid-price -- they occur at the bid (when selling) or the ask (when buying). Using the mid-price as the assumed fill therefore omits the entire cost of crossing the spread, which is typically the largest single component of execution cost for small-to-medium orders.

VWAP fills are conceptually appealing because VWAP represents the average execution price achievable by a patient trader participating proportionally throughout the day. However, using historical VWAP as a backtest fill price assumes that the strategy's orders are small enough to not move VWAP, that the trader has the full trading session available for execution, and that no information leakage occurs during the execution window. For concentrated, time-sensitive, or capacity-constrained strategies, VWAP is an upper bound on execution quality, not a realistic expectation.

Fig. 2 -- Each fill assumption model introduces a distinct bias; mid-price is the most commonly misused and most systematically optimistic

Timing Delays: The Latency Tax

Even if fill price assumptions were perfectly calibrated, there would still be a timing gap between the moment a research system identifies an observation and the moment an order reaches the market. This latency operates at multiple levels: data latency (how stale is the market data the system is reading?), computation latency (how long does it take to evaluate criteria and generate orders?), order routing latency (how long does the order take to reach the exchange?), and queue position latency (how long does the order wait before executing?).

For daily-resolution strategies, timing latency is typically measured in minutes and its direct cost is small. The more important timing issue for daily systems is the overnight gap: a strategy may observe criteria at the prior close and enter at the open, but the open price can be materially different from the close. This is particularly true during earnings seasons, macroeconomic announcements, or periods of overnight news flow. Backtests that assume open fills often understate this gap because the gap itself is in the data -- but they still assume the fill at the open is guaranteed, when in practice the open can be a chaotic, low-liquidity period.

For intraday strategies, timing latency can be the dominant execution cost. A strategy that reacts to a condition at the 1-minute mark may find that the edge it identified has already been captured by faster participants by the time its order arrives. The market has moved. The expected fill price is no longer achievable. The backtest, which saw the price at the observation moment and assumed it was available, shows a clean fill that never occurred in practice.

Measuring timing sensitivity is a practical diagnostic. Re-running a backtest with a 1-bar delay inserted between observation and execution is a revealing test: if performance degrades sharply, the strategy is timing-sensitive in a way that live trading will penalize. A robust strategy should show gradual, not catastrophic, degradation as delay is increased. Sharp degradation at even a 1-bar delay is a warning that execution quality in the model is carrying more of the apparent edge than the research logic itself.

Measuring and Modeling the Execution Gap

The most direct way to measure the execution gap is to maintain parallel records of modeled fills and actual fills from the beginning of live deployment, then decompose the difference systematically. Each live trade produces an actual fill price. The research model produces a modeled fill price for the same observation. The difference, accumulated across all trades and broken down by instrument, time of day, order size, and market condition, reveals which components of the execution model are least accurate.

Before live deployment is available, an alternative is to construct a realistic execution simulation using available market microstructure data. This means: replacing flat spread assumptions with instrument-specific historical spread data, applying a market impact estimate based on order size relative to ADV, inserting a configurable execution delay and observing sensitivity, and applying a fill rate model that probabilistically misses or degrades fills in adverse conditions. This is more work than a flat-cost assumption, but it typically produces a significantly more calibrated performance expectation.

A practical benchmark for evaluating execution model quality is the implementation shortfall metric, widely used in institutional trading. Implementation shortfall measures the total cost of going from a decision to a completed position, including the price movement from decision to first fill, the spread crossing cost, the market impact during execution, and any opportunity cost from missed or partial fills. Decomposing implementation shortfall across a strategy's full trade history identifies which cost component is largest and therefore where modeling effort will have the highest return.

The practical goal is not to eliminate the execution gap entirely -- some gap is structural and irreducible. The goal is to model it accurately enough that the performance expectation set before deployment is close to the performance realized after it. A researcher who launches live trading expecting backtest-level returns and encounters materially worse results is not experiencing bad luck. They are experiencing the predictable consequence of an execution model that did not account for the full cost of participating in real markets.

A Practical Checklist

Before treating a backtest result as a realistic performance expectation for live deployment, apply the following questions to the execution model.

Execution Gap Pre-Deployment Checklist

Does the fill assumption (close/mid/VWAP) reflect achievable execution, or is it an optimistic proxy?
Is slippage modeled per-instrument and per-liquidity-regime, or applied as a flat uniform cost?
Has market impact been estimated across a range of deployment sizes, not just at fractional scale?
Is fill rate assumed at 100%, or is there a realistic miss/partial-fill model?
Has a 1-bar execution delay been inserted and sensitivity tested?
Has the strategy's capacity been stress-tested -- does it remain viable at the intended deployment size?
Is there a plan to track modeled vs. actual fills from day one of live deployment?

Takeaway

The execution gap is one of the most reliable sources of disappointment in systematic research. It is not a sign of dishonest modeling or strategic failure -- it is a structural consequence of the gap between the simplified world a backtest inhabits and the adversarial, latency-constrained, impact-generating world that live trading occupies.

Closing the gap requires replacing optimistic default assumptions with instrument-specific, regime-aware, size-sensitive models of execution quality. It requires acknowledging that 100% fill rates are fictional, that spread crossing is a real cost, that market impact grows nonlinearly with order size, and that the price on the chart at the moment of observation is rarely the price available when the order arrives.

The discipline of honest execution modeling is not a pessimism exercise. It is a calibration exercise. A strategy that survives realistic execution modeling -- where costs reflect what the market actually charges for participation -- is a strategy that has earned the right to be deployed. A strategy that only looks good under optimistic fill assumptions has not cleared that bar.

This content is the original work of Zylo Technology and may not be republished or reproduced without permission.