Methodology17 May 2026 · 6 min read

What Backtesting Actually Measures and What It Does Not

A backtest is a precise measurement of one historical path, not an estimate of future returns, and the distinction governs every downstream decision.

A backtest is a measurement instrument, not a time machine. It quantifies how a specific rule set would have transformed a specific historical price series into a specific equity curve — nothing more. Treating its output as an estimate of future performance is the most expensive mistake systematic traders make, and it is encoded directly into the statistics most reports emphasize. Understanding what a backtest actually measures clarifies which decisions it can support and which it cannot.

What a backtest measures

A backtest measures the path-dependent outcome of applying a deterministic decision function to a finite historical sample, conditional on a specific execution model. That outcome is a sequence of trades, a P&L stream, and a set of summary statistics derived from both. Every number in the report — CAGR, max drawdown, win rate, profit factor — is a sample statistic of that single realized path through history.

This is a real measurement. It tells you whether the strategy logic compiles into trades that would have survived the historical regime, whether your position sizing produces tolerable drawdowns, and whether your assumed costs leave any edge intact. It also reveals structural defects: lookahead bias, survivorship in the universe, mismatched bar timing, fills at impossible prices. These are diagnostics, and backtests are excellent diagnostic tools.

What a backtest does not measure

A backtest does not measure the distribution of future returns. It measures one draw from a process whose generating distribution is unknown, non-stationary, and only partially overlapping with whatever distribution will produce tomorrow's prices. The standard error of any backtest statistic — including Sharpe — is large enough that two strategies with materially different reported performance often cannot be distinguished statistically.

It does not measure your edge. It measures the joint outcome of your edge, your overfitting, your data choices, and the specific regime that prevailed during the sample window. Separating these contributions requires out-of-sample testing, walk-forward analysis, and an honest accounting of how many configurations were tried before the reported one was selected.

A backtest with a Sharpe of 2.0 on ten years of daily data has a 95% confidence interval on the true Sharpe that typically spans roughly 1.1 to 2.9 — and that is before accounting for selection bias from parameter search. The point estimate is almost never what you think it is.

The selection bias problem

The reported performance of a strategy is conditioned on the fact that you chose to report it. If you tested 200 parameter combinations and reported the best, the expected forward Sharpe is dramatically lower than the in-sample Sharpe even if the underlying edge is genuine. This is not a flaw in any specific backtest — it is a property of the search procedure that produced it.

E[Sharpe_oos] ≈ Sharpe_is − sqrt(2 · ln(N) / T) · σ_Sharpe

Here N is the number of configurations tested and T is the number of independent return observations. The penalty grows with the breadth of the search and shrinks slowly with sample size. A backtest tells you Sharpe_is; it cannot tell you N unless you track it yourself. Kestrel Signal logs every configuration evaluated against a dataset precisely because this number is the difference between a credible result and a fitted one.

What the equity curve actually represents

The equity curve is not a forecast of wealth. It is a single sample path from a stochastic process, conditional on a specific sequence of historical innovations. Resampling the trades — via bootstrap or block bootstrap — produces a distribution of equity curves consistent with the same trade-level statistics, and that distribution is typically much wider than the visual single line suggests.

The max drawdown shown on the chart is the realized max drawdown of one path. The expected max drawdown of the strategy, even assuming the trade distribution is stationary, is materially worse. A useful heuristic: the drawdown you should plan for is approximately 1.5 to 2 times the worst drawdown observed in a sufficiently long backtest.

The most informative number in a backtest report is often the one not displayed: the count of distinct strategy configurations that were evaluated to produce it. Without that denominator, the reported Sharpe is uninterpretable.

Using backtests for the decisions they support

Backtests are well-suited to falsification, not confirmation. A strategy that fails on clean historical data will not succeed forward; this is a strong negative result. A strategy that succeeds on historical data may or may not succeed forward; this is a weak positive result that requires further evidence — out-of-sample windows, cross-asset robustness, parameter stability surfaces, and forward paper trading.

Use the backtest to eliminate strategies, to size positions to historically observed risk, and to detect implementation bugs. Do not use it to estimate forward returns, to compare two strategies whose Sharpe ratios differ by less than 0.5, or to justify capital deployment on the strength of in-sample numbers alone. The instrument is precise about the past and silent about the future, and confusing those two outputs is the dominant failure mode of systematic retail trading.