Methodology · 20 May 2026

Why most retail backtests are lying to you

A backtest is a hypothesis test in disguise. The hypothesis is "this strategy makes money". The test is "let's see what would have happened if we'd run it on past data". When that test passes, we deploy. When it fails, we don't. So the test needs to be a good one — and almost all retail backtests fail at being a good one in at least three predictable ways.

If you've spent any time on r/algotrading, you've seen the pattern. Someone posts a screenshot of a beautiful equity curve. Sharpe of 2.1. Max drawdown of -8%. Compounded annual return of 23%. The strategy is some mean-reversion idea on SPY with three parameters they tuned by hand. Comments split into "looks great, what's the strategy?" and "this is overfit". The poster says no, they only ran the backtest a couple of times.

Six months later: another post, same person, asking why their live performance is nothing like the backtest. They've drawn down 14% in two months. The strategy that looked so clean in sample is bleeding money in production.

This is not a new phenomenon. Academic finance has known about it for decades. There are three specific statistical failures that nearly every retail backtest exhibits, and each one inflates your apparent edge in a different way. Let's walk through them.

Failure one: You tested twenty strategies and only kept the winner

Suppose you have absolutely no edge. You generate random trading signals from a coin flip — buy on heads, sell on tails. You run this strategy on SPY over the last decade. By chance, you get a backtest result. Maybe it's good, maybe it's bad. You shrug and try again with a new coin.

Do this twenty times and the laws of probability guarantee that at least one of those random strategies will look profitable. The expected number of "winners" from twenty random tests at a 5% significance threshold is exactly one. So if you keep testing and keep the best one, you'll always find a "winning" strategy — even when there isn't one.

This is multiple-comparison bias, and it's the single largest source of overfit backtests in the retail world. Every time you adjust a parameter, swap an indicator, or change a stop-loss and re-run the backtest, you're paying a hidden statistical cost. The more you test, the lower the bar for finding noise that looks like signal.

The fix: Bonferroni correction

The cleanest fix is also the oldest. If you tested k strategies on the same data, divide your significance threshold by k. So if you wanted 5% significance and you tested 20 variants, your actual threshold needs to be 5% / 20 = 0.25%. The Bonferroni correction is conservative — it'll occasionally reject a real edge that didn't quite clear the higher bar — but for retail traders that's exactly the right kind of conservative. You want to err on the side of rejecting strategies that might be noise.

In practice, this means: every time you check a new variant, the next one needs to be more impressive to pass. You can't get away with running 50 backtests and showing me the best one. The maths refuses to let you.

Try it for yourself: if your favourite backtesting tool doesn't surface the Bonferroni-corrected significance threshold automatically, ask it to. Most don't, which is exactly why so many retail strategies fail in live trading.

Failure two: You reported a single number and called it a result

A backtest produces dozens of trades. Each trade is a random sample from an unknown distribution of "what this strategy would have returned in similar conditions". When you compute the mean — the headline return number — you're estimating a population parameter from a small sample. And every undergraduate statistics course teaches the same lesson: a single point estimate without a confidence interval is meaningless. (The Sharpe ratio is no exception — that's the same problem in a different costume.)

Here's a concrete example. You run a backtest of a simple SMA(50)/SMA(200) crossover on SPY over ten years. You get an annualised return of 8.4%, beating cash by a decent margin. Looks great.

Now, that 8.4% is built from a dozen round-trip trades. If you resample those trades — randomly pick twelve trades with replacement from the original twelve, recompute the annualised return, and repeat ten thousand times — you discover that the 95% confidence interval on the annualised return is [-2.1%, +18.9%]. The "true" return of this strategy, in expectation, is somewhere between losing 2% a year and making 19% a year. The point estimate of 8.4% sits in the middle, but the range is enormous.

Now ask yourself: would you deploy real money against a strategy whose 95% confidence interval includes losing money? Most people would not. But that's what every retail backtest is asking you to do when it shows only the point estimate.

The fix: percentile bootstrap on returns

The bootstrap is one of the most powerful tools in applied statistics. The idea is simple: instead of assuming a parametric distribution for your data, you treat your observed sample as a proxy for the population and resample from it. Do it ten thousand times. Take the 2.5th and 97.5th percentiles of the resulting distribution of statistics. Those are your 95% confidence interval bounds.

Applied to backtests, this gives you concrete answers to questions you should have been asking:

"What's the realistic range of outcomes if I deploy this?"
"Does the lower bound exclude zero, or do I need to assume I might just lose money?"
"How wide is the uncertainty around the Sharpe? The drawdown? The win rate?"

For our SMA crossover example above, the 95% bootstrap CI of [-2.1%, +18.9%] tells you the strategy's edge is real but small, the data is noisy, and you should size positions accordingly. The point estimate of 8.4% doesn't tell you any of that. The point estimate is a lie of omission.

Failure three: You tuned and tested on the same data

In machine learning, this would be called data snooping or in-sample bias. In trading, it shows up everywhere — even when traders think they're being careful.

The setup is innocuous. You decide to test an RSI mean-reversion strategy. You pick RSI(14) — the default. You set the entry threshold at 30 because that's standard. You backtest. The result is mediocre. You try RSI(7) instead. Better. You move the threshold to 25. Better still. You add a 50-day SMA filter. The Sharpe climbs.

Each of those parameter changes was informed by looking at the same data. You didn't run twenty independent strategies — you ran one strategy and re-optimised it against the data it was being tested on, twenty times. The final version looks great, but its performance is at least partly a function of how well its parameters fit the noise of that specific historical period.

If you deploy it live, the noise won't repeat. The signal might. Hopefully. Probably not enough to recover the edge you thought you had.

The fix: walk-forward validation

The cleanest defence is to physically separate the data you tune on from the data you evaluate on. Split your history chronologically into three windows:

Train — earliest portion. You can do anything you want here. Tune freely, explore, fit, optimise.
Validation — middle portion. Touch this only to compare candidate strategies. Pick a winner here.
Holdout — most recent portion. Run the chosen strategy here exactly once. The number you get is your honest estimate of how the strategy will perform live.

This is called walk-forward validation in trading, or train/validation/test split in machine learning. The discipline is simple: once you've looked at the holdout, you cannot go back and re-tune. The window is "spent". If you do go back, the holdout window is no longer independent, and you've fallen back into in-sample bias.

The full mechanics of walk-forward — expanding versus rolling windows, what the procedure actually proves, and the three common ways retail tools get it wrong — are covered in detail in the walk-forward vs train/test split post.

Most retail backtests skip the holdout entirely. The Sharpe number you see is computed on the data the strategy was tuned on. That's not a prediction of future performance — it's just a measure of how well the parameters fit the past.

What honest backtesting looks like

Putting the three checks together, an honest backtest pipeline looks like this:

Define the strategy in plain English. Resist the urge to tune at the start. Pick a specific idea you have a hypothesis about.
Test it on the train window. If it doesn't work in-sample, it almost certainly won't work out-of-sample. Stop here if so.
Evaluate candidate variants on the validation window. Pick the best one. Count how many variants you tested.
Run the chosen variant on the holdout window once. This is your honest performance estimate.
Bootstrap the holdout result. Resample trades ten thousand times. Report a 95% confidence interval, not just a point estimate.
Apply Bonferroni correction. Take the number of variants you tested in step 3, divide alpha by that number, and report the corrected confidence interval too.
Compare to a buy-and-hold baseline. If your strategy doesn't beat buy-and-hold by a meaningful margin in the holdout, the rest doesn't matter.

At the end you should have something like:

Strategy:           SMA(50) × SMA(200) crossover
Holdout window:     2023-01-01 → 2024-12-31
Annualised return:  +8.4%
95% bootstrap CI:   [-2.1%, +18.9%]
Bonferroni 98.3% CI (k=3 variants tested):  [-5.4%, +22.1%]
vs Buy-and-hold:    -1.8 pp/yr

Verdict: PROMISING  (95% LB > 0; Bonferroni LB ≤ 0)

That output is more honest than nearly every retail backtest you'll see. It tells you the strategy looks better than chance under the standard 95% threshold, but doesn't survive the more conservative Bonferroni test, and underperforms a buy-and-hold benchmark. Whether to deploy or not is now an informed decision instead of a hope.

This is what EdgeAudit does

EdgeAudit is a Discord bot that runs the entire pipeline above on every backtest. You describe a strategy in plain English; the bot parses it, fetches market data, runs the backtest, applies Bonferroni correction for the number of variants you've submitted, bootstraps the confidence intervals, splits the data chronologically for walk-forward validation, and posts the result back to the channel.

The methodology is what you'd find in a peer-reviewed asset-pricing paper. The interface is a slash command in Discord. That's all it is, and that's all it needs to be.

Try it. Free tier: 3 backtests a day, any equity ticker, full statistical suite. No card needed. See pricing →

Comments, corrections, or war stories? Email hello@edgeaudit.app. We read every one.