Survivorship-bias free S&P 500 constituent lists
Do you just want the data? Here ya go - https://github.com/riazarbi/sp500-scraper
Even a cursory search for equity backtesting techniques leads to warnings about biases. Common biasas are:
- look ahead
- survivorship
- optimization
You can read up on what these mean in your own time. This blog post is about avoiding survivorship bias when backtesting against a universe of stocks defined by the S&P 500 index constituents. And it turns out that that is mostly about which data source you use.
Using tidyquant
It’s pretty easy to get a list of S&P 500 constituents. You can use the R package tidyquant
to get a current list.
library(tidyquant)
tq_index("SP500")
> tq_index("SP500")
Getting holdings for SP500
# A tibble: 503 × 8
symbol company identifier sedol weight sector shares_held local_currency
<chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 AAPL Apple Inc. 03783310 2046251 0.0667 Information Technology 168775410 USD
2 MSFT Microsoft Corporation 59491810 2588173 0.0545 Information Technology 83322060 USD
3 AMZN Amazon.com Inc. 02313510 2000019 0.0258 Consumer Discretionary 99044216 USD
4 GOOGL Alphabet Inc. Class A 02079K30 BYVY8G0 0.0169 Communication Services 67004332 USD
5 BRK.B Berkshire Hathaway Inc. Class B 08467070 2073390 0.0165 Financials 20164068 USD
6 UNH UnitedHealth Group Incorporated 91324P10 2917766 0.0153 Health Care 10452738 USD
7 TSLA Tesla Inc 88160R10 B616C79 0.0153 Consumer Discretionary 29763936 USD
8 GOOG Alphabet Inc. Class C 02079K10 BYY88Y7 0.0152 Communication Services 59917096 USD
9 XOM Exxon Mobil Corporation 30231G10 2326618 0.0139 Energy 46572650 USD
10 JNJ Johnson & Johnson 47816010 2475833 0.0138 Health Care 29380180 USD
# … with 493 more rows
However it turns out that you can’t get a list of S&P 500 constituents for a particular date from tidyquant.
We can look at the tq_index
source code to find out where they are getting the data.
> tq_index
function (x, use_fallback = FALSE)
{
x <- clean_index(x)
verified <- verify_index(x)
if (!verified$is_verified) {
warning(verified$err)
return(tibble::tibble())
}
if (use_fallback)
return(index_fallback(x))
x_spdr <- spdr_mapper(x)
dload <- index_download(x_spdr, index_name = x)
if (!is.null(dload$err)) {
warning(dload$err)
return(tibble::tibble())
}
df <- clean_holdings(dload$df)
df
}
<bytecode: 0x55dc301520e0>
<environment: namespace:tidyquant>
It looks like they pass it to another function, index_download
, which is not exported to the user namespace. Let’s look at that source code.
> tidyquant:::index_download
function (x, index_name)
{
spdr_link <- paste0("https://www.ssga.com/us/en/institutional/etfs/library-content/products/fund-data/etfs/us/holdings-daily-us-en-",
tolower(x), ".xlsx")
res <- list(df = NULL, err = NULL)
message("Getting holdings for ", index_name)
tryCatch({
httr::GET(spdr_link, httr::write_disk(tf <- tempfile(fileext = ".xlsx")))
suppressMessages({
res$df <- readxl::read_xlsx(tf, skip = 3)
})
unlink(tf)
return(res)
}, error = function(e) {
res$err <- paste0("Error at ", index_name, " during download. \n",
e)
return(res)
})
}
<bytecode: 0x55dc3170cdb8>
<environment: namespace:tidyquant>
Looks like they are getting the holdings of the SPDR S&P500 ETF. Unfortunately, there is no ‘historical’ tab to download the holdings history.
Scraping iShares
Maybe we can use this approach with a different ETF though. A little bit of googling shows me that the iShares ETF website has a little drop down list that lets you select past dates. It only goes back about 12 months though. This blog shows us how to modify the download query to specify a certain date, and then we can download all the holdings going back to 2007!
> list.files("ishares/sp500/parquet/")
[1] "20061031.parquet" "20061130.parquet" "20061229.parquet" "20070131.parquet" "20070228.parquet" "20070330.parquet" "20070430.parquet"
[8] "20070531.parquet" "20070629.parquet" "20070731.parquet" "20070831.parquet" "20070928.parquet" "20071031.parquet" "20071130.parquet"
[15] "20071231.parquet" "20080131.parquet" "20080229.parquet" "20080331.parquet" "20080430.parquet" "20080530.parquet" "20080630.parquet"
[22] "20080731.parquet" "20080829.parquet" "20080930.parquet" "20081031.parquet" "20081128.parquet" "20081231.parquet" "20090130.parquet"
...
...
...
Well, then. Mission accomplished! No, not quite. We need to check that this is good data. In particular, we need to know if we can join this to market data, and fundamental data.
> open_dataset("ishares/sp500/parquet/") |> glimpse() |> collect()
FileSystemDataset with 189 Parquet files
95,417 rows x 18 columns
$ symbol <string> "PMCS", "PGL", "ADCT", "NAV", "DDS", "CIEN", "GAS.2", "NOVL", "MDP…
$ name <string> "PMC-SIERRA INC.", "PEOPLES ENERGY CORPORATION", "ADC TELECOMMUNIC…
$ sector <string> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ asset_class <string> "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equit…
$ market_value <double> 1960776, 2388226, 2399000, 2438687, 2649198, 2835141, 2909682, 291…
$ weight_pct <double> 0.010976, 0.013369, 0.013429, 0.013652, 0.014830, 0.015871, 0.0162…
$ notional_value <double> 1960776, 2388226, 2399000, 2438687, 2649198, 2835141, 2909682, 291…
$ shares <double> 295743, 54663, 167645, 87944, 87809, 120593, 63309, 485322, 56752,…
$ CUSIP <string> "69344F106", "711030106", "000886309", "63934E108", "254067101", "…
$ ISIN <string> "US69344F1066", "US7110301063", "US0008863096", "US63934E1082", "U…
$ SEDOL <string> "2807492", "2681403", "B08G9Z7", "2624464", "2269768", "B1FLZ21", …
$ price <double> 6.63, 43.69, 14.31, 27.73, 30.17, 23.51, 45.96, 6.00, 52.50, 18.20…
$ location <string> "United States", "United States", "United States", "United States"…
$ exchange <string> "NASDAQ", "New York Stock Exchange Inc.", "NASDAQ", "New York Stoc…
$ market_currency <string> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD", "US…
$ fx_rate <double> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ accrual_date <string> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ date <date32[day]> 2006-10-31, 2006-10-31, 2006-10-31, 2006-10-31, 2006-10-31, 2006-1…
Looks like we get a name, CUSIP, ISIN, and SEDOL. If we have a data provider that maps to one of those, we are golden. If we want to map directly to the SEC’s EDGAR data, though, we need a CIK.
Why can’t we just use the company symbol? Well, symbols are not standardised.
For instance, take the case of Visa Corp - the credit card company. iShares uses the VISA
symbol for Visa. Seems reasonable. What’s the EDGAR symbol? It’s V
. So you’d fail to match that one.
What about if you want to get Facebook’s fundamentals from EDGAR? You need to search for META
. But, if you’re using iShares holdings to extract symbols, and you’re backtesting in, say, 2019, the symbol for META
didn’t exist. Back then, it was FB
. The CIK, however, hasn’t changed.
Wikipedia
Ok. Where can we get an S&P 500 list that oncludes the CIK? Turns out you can get it from Wikipedia!
Even better, you can get revisions of the page, to see what the page looked like at particular points in time - it’s guaranteed to avoid survivorship and look ahead bias.
Python’s pywikibot
does a great job of retrieving all the revisions of a wikipedia page. We cna then use wikitextparser
to parse the table to pandas and save it to disk.
These tables are pretty messy though, so I’ve used R to clean them up a bit. I’m not going to go through the details of what I did because this was very much a one-shot job, and the code is gnarly, but the results are pretty good.
> open_dataset("wikipedia/sp500/parquet/pre20221101.parquet") |> collect() |> sample_n(100) |> glimpse()
Rows: 100
Columns: 13
$ Security <chr> "JPMorgan Chase & Co.", "Prudential Financial", "Flowserve Corporation", "Co…
$ `SEC filings` <chr> "https://www.sec.gov/cgi-bin/browse-edgar?CIK=JPM&action=getcompany reports"…
$ Industry <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ date <date> 2021-03-29, 2022-02-15, 2019-04-11, 2020-04-08, 2013-08-09, 2017-09-05, 202…
$ Symbol <chr> "JPM", "PRU", "FLS", "CXO", "CPB", "DHI", "ILMN", "INTU", "JNJ", "ZBH", "RMD…
$ `GICS Sector` <chr> "Financials", "Financials", "Industrials", "Energy", "Consumer Staples", "Co…
$ Exchange <chr> "Nyse", "Nyse", "Nyse", "Nyse", "Nyse", "Nyse", "Nasdaq", "Nasdaq", "Nyse", …
$ `Headquarters Location` <chr> "New York, New York", "Newark, New Jersey", "Irving, Texas", "Midland, Texas…
$ `Date first added` <chr> "1975-06-30", "2002-07-22", "2008-10-02", "2016-02-22", NA, NA, "2015-11-19"…
$ `GICS Sub-Industry` <chr> "Diversified Banks", "Life & Health Insurance", "Industrial Machinery", "Oil…
$ CIK <chr> "0000019617", "0001137774", "0000030625", "0001358071", NA, "0000882184", "0…
$ Founded <chr> "2000 (1799 / 1871)", "1875", NA, NA, NA, NA, NA, NA, NA, NA, "1989", NA, NA…
$ Location <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Bellevue, Washington", NA, NA, …
Check out that CIK!
It turns out that the CIK column was only added in 2014, so you can’t go further back than that, but that’s almost 10 years of data so it’s a pretty good starting point for a backtest universe.
Going forward
Going forward I’ve set up a GitHub repo that automatically pulls S&P 500 constituent data from iShares, Wikipedia and tidyquant, cleans it, and saves it to disk. So, in the future, I shouldn’t have to ever do this again.
Check it out here.