Survivorship-bias free S&P 500 constituent lists • One Step Forward

Do you just want the data? Here ya go - https://github.com/riazarbi/sp500-scraper

Even a cursory search for equity backtesting techniques leads to warnings about biases. Common biasas are:

look ahead
survivorship
optimization

You can read up on what these mean in your own time. This blog post is about avoiding survivorship bias when backtesting against a universe of stocks defined by the S&P 500 index constituents. And it turns out that that is mostly about which data source you use.

Using tidyquant

It’s pretty easy to get a list of S&P 500 constituents. You can use the R package tidyquant to get a current list.

library(tidyquant)
tq_index("SP500")

> tq_index("SP500")
Getting holdings for SP500
# A tibble: 503 × 8                                                                                                                                      
   symbol company                         identifier sedol   weight sector                 shares_held local_currency
   <chr>  <chr>                           <chr>      <chr>    <dbl> <chr>                        <dbl> <chr>         
 1 AAPL   Apple Inc.                      03783310   2046251 0.0667 Information Technology   168775410 USD           
 2 MSFT   Microsoft Corporation           59491810   2588173 0.0545 Information Technology    83322060 USD           
 3 AMZN   Amazon.com Inc.                 02313510   2000019 0.0258 Consumer Discretionary    99044216 USD           
 4 GOOGL  Alphabet Inc. Class A           02079K30   BYVY8G0 0.0169 Communication Services    67004332 USD           
 5 BRK.B  Berkshire Hathaway Inc. Class B 08467070   2073390 0.0165 Financials                20164068 USD           
 6 UNH    UnitedHealth Group Incorporated 91324P10   2917766 0.0153 Health Care               10452738 USD           
 7 TSLA   Tesla Inc                       88160R10   B616C79 0.0153 Consumer Discretionary    29763936 USD           
 8 GOOG   Alphabet Inc. Class C           02079K10   BYY88Y7 0.0152 Communication Services    59917096 USD           
 9 XOM    Exxon Mobil Corporation         30231G10   2326618 0.0139 Energy                    46572650 USD           
10 JNJ    Johnson & Johnson               47816010   2475833 0.0138 Health Care               29380180 USD           
# … with 493 more rows

However it turns out that you can’t get a list of S&P 500 constituents for a particular date from tidyquant.

We can look at the tq_index source code to find out where they are getting the data.

> tq_index
function (x, use_fallback = FALSE) 
{
    x <- clean_index(x)
    verified <- verify_index(x)
    if (!verified$is_verified) {
        warning(verified$err)
        return(tibble::tibble())
    }
    if (use_fallback) 
        return(index_fallback(x))
    x_spdr <- spdr_mapper(x)
    dload <- index_download(x_spdr, index_name = x)
    if (!is.null(dload$err)) {
        warning(dload$err)
        return(tibble::tibble())
    }
    df <- clean_holdings(dload$df)
    df
}
<bytecode: 0x55dc301520e0>
<environment: namespace:tidyquant>

It looks like they pass it to another function, index_download, which is not exported to the user namespace. Let’s look at that source code.

> tidyquant:::index_download
function (x, index_name) 
{
    spdr_link <- paste0("https://www.ssga.com/us/en/institutional/etfs/library-content/products/fund-data/etfs/us/holdings-daily-us-en-", 
        tolower(x), ".xlsx")
    res <- list(df = NULL, err = NULL)
    message("Getting holdings for ", index_name)
    tryCatch({
        httr::GET(spdr_link, httr::write_disk(tf <- tempfile(fileext = ".xlsx")))
        suppressMessages({
            res$df <- readxl::read_xlsx(tf, skip = 3)
        })
        unlink(tf)
        return(res)
    }, error = function(e) {
        res$err <- paste0("Error at ", index_name, " during download. \n", 
            e)
        return(res)
    })
}
<bytecode: 0x55dc3170cdb8>
<environment: namespace:tidyquant>

Looks like they are getting the holdings of the SPDR S&P500 ETF. Unfortunately, there is no ‘historical’ tab to download the holdings history.

Scraping iShares

Maybe we can use this approach with a different ETF though. A little bit of googling shows me that the iShares ETF website has a little drop down list that lets you select past dates. It only goes back about 12 months though. This blog shows us how to modify the download query to specify a certain date, and then we can download all the holdings going back to 2007!

> list.files("ishares/sp500/parquet/")
  [1] "20061031.parquet" "20061130.parquet" "20061229.parquet" "20070131.parquet" "20070228.parquet" "20070330.parquet" "20070430.parquet"
  [8] "20070531.parquet" "20070629.parquet" "20070731.parquet" "20070831.parquet" "20070928.parquet" "20071031.parquet" "20071130.parquet"
 [15] "20071231.parquet" "20080131.parquet" "20080229.parquet" "20080331.parquet" "20080430.parquet" "20080530.parquet" "20080630.parquet"
 [22] "20080731.parquet" "20080829.parquet" "20080930.parquet" "20081031.parquet" "20081128.parquet" "20081231.parquet" "20090130.parquet"
...
...
...

Well, then. Mission accomplished! No, not quite. We need to check that this is good data. In particular, we need to know if we can join this to market data, and fundamental data.

> open_dataset("ishares/sp500/parquet/") |> glimpse() |> collect()
FileSystemDataset with 189 Parquet files
95,417 rows x 18 columns
$ symbol          <string> "PMCS", "PGL", "ADCT", "NAV", "DDS", "CIEN", "GAS.2", "NOVL", "MDP…
$ name            <string> "PMC-SIERRA INC.", "PEOPLES ENERGY CORPORATION", "ADC TELECOMMUNIC…
$ sector          <string> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ asset_class     <string> "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equit…
$ market_value    <double> 1960776, 2388226, 2399000, 2438687, 2649198, 2835141, 2909682, 291…
$ weight_pct      <double> 0.010976, 0.013369, 0.013429, 0.013652, 0.014830, 0.015871, 0.0162…
$ notional_value  <double> 1960776, 2388226, 2399000, 2438687, 2649198, 2835141, 2909682, 291…
$ shares          <double> 295743, 54663, 167645, 87944, 87809, 120593, 63309, 485322, 56752,…
$ CUSIP           <string> "69344F106", "711030106", "000886309", "63934E108", "254067101", "…
$ ISIN            <string> "US69344F1066", "US7110301063", "US0008863096", "US63934E1082", "U…
$ SEDOL           <string> "2807492", "2681403", "B08G9Z7", "2624464", "2269768", "B1FLZ21", …
$ price           <double> 6.63, 43.69, 14.31, 27.73, 30.17, 23.51, 45.96, 6.00, 52.50, 18.20…
$ location        <string> "United States", "United States", "United States", "United States"…
$ exchange        <string> "NASDAQ", "New York Stock Exchange Inc.", "NASDAQ", "New York Stoc…
$ market_currency <string> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD", "USD", "US…
$ fx_rate         <double> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ accrual_date    <string> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "…
$ date       <date32[day]> 2006-10-31, 2006-10-31, 2006-10-31, 2006-10-31, 2006-10-31, 2006-1…

Looks like we get a name, CUSIP, ISIN, and SEDOL. If we have a data provider that maps to one of those, we are golden. If we want to map directly to the SEC’s EDGAR data, though, we need a CIK.

Why can’t we just use the company symbol? Well, symbols are not standardised.

For instance, take the case of Visa Corp - the credit card company. iShares uses the VISA symbol for Visa. Seems reasonable. What’s the EDGAR symbol? It’s V. So you’d fail to match that one.

What about if you want to get Facebook’s fundamentals from EDGAR? You need to search for META. But, if you’re using iShares holdings to extract symbols, and you’re backtesting in, say, 2019, the symbol for META didn’t exist. Back then, it was FB. The CIK, however, hasn’t changed.

Wikipedia

Ok. Where can we get an S&P 500 list that oncludes the CIK? Turns out you can get it from Wikipedia!

Even better, you can get revisions of the page, to see what the page looked like at particular points in time - it’s guaranteed to avoid survivorship and look ahead bias.

Python’s pywikibot does a great job of retrieving all the revisions of a wikipedia page. We cna then use wikitextparser to parse the table to pandas and save it to disk.

These tables are pretty messy though, so I’ve used R to clean them up a bit. I’m not going to go through the details of what I did because this was very much a one-shot job, and the code is gnarly, but the results are pretty good.

> open_dataset("wikipedia/sp500/parquet/pre20221101.parquet") |> collect() |> sample_n(100) |> glimpse() 
Rows: 100
Columns: 13
$ Security                <chr> "JPMorgan Chase & Co.", "Prudential Financial", "Flowserve Corporation", "Co…
$ `SEC filings`           <chr> "https://www.sec.gov/cgi-bin/browse-edgar?CIK=JPM&action=getcompany reports"…
$ Industry                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ date                    <date> 2021-03-29, 2022-02-15, 2019-04-11, 2020-04-08, 2013-08-09, 2017-09-05, 202…
$ Symbol                  <chr> "JPM", "PRU", "FLS", "CXO", "CPB", "DHI", "ILMN", "INTU", "JNJ", "ZBH", "RMD…
$ `GICS Sector`           <chr> "Financials", "Financials", "Industrials", "Energy", "Consumer Staples", "Co…
$ Exchange                <chr> "Nyse", "Nyse", "Nyse", "Nyse", "Nyse", "Nyse", "Nasdaq", "Nasdaq", "Nyse", …
$ `Headquarters Location` <chr> "New York, New York", "Newark, New Jersey", "Irving, Texas", "Midland, Texas…
$ `Date first added`      <chr> "1975-06-30", "2002-07-22", "2008-10-02", "2016-02-22", NA, NA, "2015-11-19"…
$ `GICS Sub-Industry`     <chr> "Diversified Banks", "Life & Health Insurance", "Industrial Machinery", "Oil…
$ CIK                     <chr> "0000019617", "0001137774", "0000030625", "0001358071", NA, "0000882184", "0…
$ Founded                 <chr> "2000 (1799 / 1871)", "1875", NA, NA, NA, NA, NA, NA, NA, NA, "1989", NA, NA…
$ Location                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Bellevue, Washington", NA, NA, …

Check out that CIK!

It turns out that the CIK column was only added in 2014, so you can’t go further back than that, but that’s almost 10 years of data so it’s a pretty good starting point for a backtest universe.

Going forward

Going forward I’ve set up a GitHub repo that automatically pulls S&P 500 constituent data from iShares, Wikipedia and tidyquant, cleans it, and saves it to disk. So, in the future, I shouldn’t have to ever do this again.

Check it out here.