Methodology17 May 2026 · 5 min read

In-Sample Performance Is Not Evidence of Edge

Backtests fit on the data used to design them measure optimization artifacts, not strategy edge, and require multiplicity-adjusted out-of-sample validation.

A backtest that performs well on the data used to design it tells you almost nothing about whether the strategy has edge. The procedure of fitting rules, parameters, or signal thresholds to a historical sample guarantees that the chosen configuration looks good on that sample — that is what optimization does. The question is whether the apparent performance reflects a stable property of the market or an artifact of the fitting process. Most of the time, it is the latter, and the cost of confusing the two is paid in live capital.

What in-sample performance actually measures

When you run a backtest over a window of data you used to select indicators, tune parameters, or filter trades, you are measuring the maximum of a noisy function over a search space. The reported Sharpe, CAGR, or hit rate is not an estimate of the strategy's true performance — it is an estimate of the best-case realization across every variant you considered, including the ones you discarded silently. This is true even if you only tried "a few" combinations, because the brain is an aggressive optimizer that prunes branches without logging them.

The correct mental model: in-sample metrics are upper bounds on what you would have seen had you committed to a single specification before looking at the data. Any departure from that pre-commitment — adjusting a lookback, adding a regime filter, dropping a bad year — inflates the bound further. The inflation is not small, and it scales with the flexibility of your search.

The selection bias is quantifiable

If you test N independent strategy variants on the same data and each has true Sharpe of zero, the expected maximum observed Sharpe grows with N. For independent normal returns sampled over T periods, a reasonable approximation for the inflation of the best observed Sharpe is:

E[max Sharpe] ≈ sqrt(2 · ln(N) / T) · annualization_factor

For N = 100 variants and T = 1000 daily observations, this implies an expected maximum annualized Sharpe of roughly 1.0 from pure noise. For N = 1000 it rises to about 1.2. These numbers assume independence; correlated variants (parameter sweeps on a single rule) reduce the effective N but do not eliminate the bias. The implication: a backtest Sharpe of 1.5 after a moderate parameter search is statistically indistinguishable from noise.

If you cannot state — before running the backtest — how many configurations you have implicitly or explicitly evaluated, your in-sample Sharpe is uninterpretable. "I only tested a few" is not a defense; it is an admission that the multiplicity correction is unknown.

Why out-of-sample testing is necessary but insufficient

Holding out a test set is the standard remedy, and it is the right starting point. But the holdout only works once. The moment you look at out-of-sample results and decide whether to keep the strategy, modify it, or try a new variant, the holdout has been consumed. It is now part of the training set, and its statistical validity is gone.

This is the most common failure mode in retail systematic research. A trader reserves the last two years, finds the strategy underperforms there, adds a volatility filter that "fixes" the bad period, and reports the improved out-of-sample numbers. Those numbers are in-sample. The filter was selected because it improved the holdout. The only honest path forward is a new holdout — data the modified strategy has never touched — and you only get one shot at each.

What evidence of edge actually looks like

Edge is a claim about the data-generating process, not about a backtest curve. Credible evidence requires at least three things working together: a prior hypothesis about why the effect should exist, performance that survives multiplicity-adjusted significance testing, and stability across data the strategy was not designed against. None of these alone is sufficient.

The prior matters because it constrains the search space before you see the data. A strategy motivated by a documented microstructure effect, a behavioral bias with independent empirical support, or a structural feature of the instrument is testing a narrow hypothesis. A strategy discovered by sweeping 50 indicators across 20 lookbacks is testing 1000 hypotheses, and the multiplicity correction is brutal.

A useful diagnostic: if you cannot describe your strategy's economic mechanism in one sentence without referencing the backtest, you have a curve-fit, not a hypothesis. The backtest should confirm a prior, not generate one.

Operational discipline

The practical implication for research workflow is that the experiment design must precede the experiment. Specify the universe, the parameter grid, the evaluation metric, the holdout window, and the decision rule before running anything. Log every variant tested, including the ones that fail. Apply a multiplicity correction — Bonferroni is conservative but defensible, deflated Sharpe ratio is more refined — and treat the corrected number as your estimate, not the raw one.

In Kestrel Signal, the research log captures every parameter sweep and configuration evaluated against a dataset, so the effective N is observable rather than guessed. This does not make the bias disappear, but it makes it accountable. The goal is not to produce backtests that look good. The goal is to produce backtests whose results you can defend statistically when they go live and the noise reasserts itself.