Backtesting data sources: where to get clean OHLCV data

A backtest is only as honest as the data it runs on, and bad data quietly produces beautiful, fictional results. The right historical data — clean, complete, free of survivorship bias and look-ahead leaks — is the unglamorous foundation that decides whether your backtest means anything. This guide covers where to actually get OHLCV data (free exchange APIs through ccxt, paid vendors, and the gaps in each), the data traps that fabricate fake edges, the tick-versus-bar decision, and how to clean a feed before you trust a single equity curve.

On this page
  1. Why the source matters
  2. Free sources
  3. Paid vendors
  4. Pulling data with ccxt
  5. The data traps
  6. Cleaning the data
  7. FAQ

Why the data source matters

Every metric a backtest produces — return, win rate, drawdown, Sharpe — is computed from the price series you feed it. Gaps, bad ticks, wrong timestamps or a missing delisted asset don’t just add noise; they systematically bias the result, almost always in the flattering direction. Clean data is not a nicety, it is the precondition for the backtest meaning anything at all.

Free data sources

For crypto, the exchanges themselves are the best free source: Binance, Bybit, Kraken and others expose historical OHLCV through public APIs, and ccxt normalises all of them. For stocks, free feeds (Yahoo Finance, Stooq, the free tiers of Alpha Vantage or Tiingo) are fine for daily bars but thin on intraday and notorious for unadjusted splits and survivorship gaps. Free is enough to start; just know its limits.

When you need clean intraday history, point-in-time fundamentals, or properly adjusted, survivorship-bias-free equity data, paid vendors (Polygon, Tiingo paid, Norgate, Kaiko for crypto) earn their fee. The value is not just more data but correct data — adjusted, gap-filled and including the assets that later delisted. For a serious equity backtest this is usually non-negotiable.

Pulling OHLCV with ccxt

python · fetch_data.pyimport ccxt, pandas as pd, time

ex = ccxt.binance()
since = ex.parse8601('2023-01-01T00:00:00Z')
rows = []
while since < ex.milliseconds():
    batch = ex.fetch_ohlcv('BTC/USDT', '1h', since, limit=1000)
    if not batch: break
    rows += batch
    since = batch[-1][0] + 1      # page forward
    time.sleep(ex.rateLimit / 1000)    # respect rate limits

df = pd.DataFrame(rows, columns=['t','o','h','l','c','v'])

The key detail is pagination: fetch_ohlcv returns a capped batch, so you loop forward from the last timestamp and sleep to honour the exchange’s rate limit, exactly as in the crypto bot guide.

The data traps

Survivorship bias and look-ahead leaks

The two killers. Survivorship bias: testing only on assets that still exist today silently deletes every coin or stock that went to zero, inflating returns — your universe must include the dead. Look-ahead bias: using data that wasn’t available at decision time (a split-adjusted price before the split, a bar’s close to trade at that bar’s open) fabricates an edge that vanishes live. Both produce gorgeous backtests that fail instantly in forward testing.

survivors only → inflated true universe → real
Drop the assets that died and the backtest looks far better than the strategy ever was.

Cleaning the data

Before any backtest: check for and fill or flag gaps in the timestamp series, drop or repair zero-volume and obviously bad bars, confirm timezone and timestamp alignment, and for stocks use split- and dividend-adjusted prices. Validate the cleaned series visually, then run it on the backtester. Garbage in, fictional edge out — the cleaning step is where most of the honesty lives.

Not financial advice. This content is educational. Automated and algorithmic trading carries a real risk of financial loss. Never trade money you cannot afford to lose. Review the SEC investor.gov and CFTC resources before trading.

Frequently asked questions

Where can I get free data for backtesting?

For crypto, the exchanges themselves (Binance, Bybit, Kraken and others) offer free historical OHLCV through public APIs that ccxt normalises. For stocks, free feeds like Yahoo Finance, Stooq and the free tiers of Alpha Vantage or Tiingo work for daily bars but are thin on intraday data and often suffer from unadjusted splits and survivorship gaps.

What is survivorship bias in backtesting data?

Survivorship bias is the error of testing only on assets that still exist today, which silently removes every coin or stock that went to zero or delisted. Because failures are deleted from the universe, returns are inflated and the strategy looks far better than it really is. A correct backtest must include the assets that later died.

What is look-ahead bias?

Look-ahead bias is using information that was not actually available at the moment of the trading decision — for example a split-adjusted price before the split happened, or using a bar’s closing price to trade at that same bar’s open. It fabricates an edge that cannot exist in live trading and produces a beautiful backtest that fails immediately in forward testing.

Do I need paid data to backtest a trading bot?

Not to start — free exchange APIs via ccxt are enough for crypto and free daily feeds work for basic stock testing. But for clean intraday history, properly split- and dividend-adjusted prices, and survivorship-bias-free equity universes, paid vendors like Polygon, Tiingo, Norgate or Kaiko are usually worth it for any serious backtest.

MB

Mustafa Bilgic

Algorithmic trading practitioner · Founder, AITradingBot.us

Mustafa builds and backtests automated trading systems and writes about them without the hype. Every tool on this site is free and runs entirely in your browser.