Backtesting data sources: where to get clean OHLCV data
A backtest is only as honest as the data it runs on, and bad data quietly produces beautiful, fictional results. The right historical data — clean, complete, free of survivorship bias and look-ahead leaks — is the unglamorous foundation that decides whether your backtest means anything. This guide covers where to actually get OHLCV data (free exchange APIs through ccxt, paid vendors, and the gaps in each), the data traps that fabricate fake edges, the tick-versus-bar decision, and how to clean a feed before you trust a single equity curve.
Why the data source matters
Every metric a backtest produces — return, win rate, drawdown, Sharpe — is computed from the price series you feed it. Gaps, bad ticks, wrong timestamps or a missing delisted asset don’t just add noise; they systematically bias the result, almost always in the flattering direction. Clean data is not a nicety, it is the precondition for the backtest meaning anything at all.
Free data sources
For crypto, the exchanges themselves are the best free source: Binance, Bybit, Kraken and others expose historical OHLCV through public APIs, and ccxt normalises all of them. For stocks, free feeds (Yahoo Finance, Stooq, the free tiers of Alpha Vantage or Tiingo) are fine for daily bars but thin on intraday and notorious for unadjusted splits and survivorship gaps. Free is enough to start; just know its limits.
Paid data vendors
When you need clean intraday history, point-in-time fundamentals, or properly adjusted, survivorship-bias-free equity data, paid vendors (Polygon, Tiingo paid, Norgate, Kaiko for crypto) earn their fee. The value is not just more data but correct data — adjusted, gap-filled and including the assets that later delisted. For a serious equity backtest this is usually non-negotiable.
Pulling OHLCV with ccxt
python · fetch_data.pyimport ccxt, pandas as pd, time
ex = ccxt.binance()
since = ex.parse8601('2023-01-01T00:00:00Z')
rows = []
while since < ex.milliseconds():
batch = ex.fetch_ohlcv('BTC/USDT', '1h', since, limit=1000)
if not batch: break
rows += batch
since = batch[-1][0] + 1 # page forward
time.sleep(ex.rateLimit / 1000) # respect rate limits
df = pd.DataFrame(rows, columns=['t','o','h','l','c','v'])
The key detail is pagination: fetch_ohlcv returns a capped batch, so you loop forward from the last timestamp and sleep to honour the exchange’s rate limit, exactly as in the crypto bot guide.
The data traps
The two killers. Survivorship bias: testing only on assets that still exist today silently deletes every coin or stock that went to zero, inflating returns — your universe must include the dead. Look-ahead bias: using data that wasn’t available at decision time (a split-adjusted price before the split, a bar’s close to trade at that bar’s open) fabricates an edge that vanishes live. Both produce gorgeous backtests that fail instantly in forward testing.
Cleaning the data
Before any backtest: check for and fill or flag gaps in the timestamp series, drop or repair zero-volume and obviously bad bars, confirm timezone and timestamp alignment, and for stocks use split- and dividend-adjusted prices. Validate the cleaned series visually, then run it on the backtester. Garbage in, fictional edge out — the cleaning step is where most of the honesty lives.
Frequently asked questions
Where can I get free data for backtesting?
For crypto, the exchanges themselves (Binance, Bybit, Kraken and others) offer free historical OHLCV through public APIs that ccxt normalises. For stocks, free feeds like Yahoo Finance, Stooq and the free tiers of Alpha Vantage or Tiingo work for daily bars but are thin on intraday data and often suffer from unadjusted splits and survivorship gaps.
What is survivorship bias in backtesting data?
Survivorship bias is the error of testing only on assets that still exist today, which silently removes every coin or stock that went to zero or delisted. Because failures are deleted from the universe, returns are inflated and the strategy looks far better than it really is. A correct backtest must include the assets that later died.
What is look-ahead bias?
Look-ahead bias is using information that was not actually available at the moment of the trading decision — for example a split-adjusted price before the split happened, or using a bar’s closing price to trade at that same bar’s open. It fabricates an edge that cannot exist in live trading and produces a beautiful backtest that fails immediately in forward testing.
Do I need paid data to backtest a trading bot?
Not to start — free exchange APIs via ccxt are enough for crypto and free daily feeds work for basic stock testing. But for clean intraday history, properly split- and dividend-adjusted prices, and survivorship-bias-free equity universes, paid vendors like Polygon, Tiingo, Norgate or Kaiko are usually worth it for any serious backtest.