I’m going to use this post to explain the rationale behind my new R package diffdffs, and walk you through how to use it with practical examples.

If you want to install the package, you can use devtools::install_github("riazarbi/diffdfs") in an R session. You can also visit riazarbi/diffdfs to poke around the code.

One very common operation across my data pipelines is the calculation of what has changed between an old version and new version of a data frame. I call this operation ‘diffing’.

Why caching data is important

Generally, if I need to do something with data that resides in a remote source, my first step is to cache it locally. This means pulling it from the remote source and saving it to a local disk or database. I like to keep copies of the data that drives an analytical product because it facilitates reproducibility and transparency. Also, what happens if your data source disappears? Then you’ll be left with a model but no backing data.

However, how you cache that data can end up painting you into a corner down the road. I’ll try demonstrate how this can happen in the practical example below.

# load libraries
suppressPackageStartupMessages({
  library(dplyr)
  library(readr)
  library(purrr)
  library(diffdfs)
  })
# generate a sample dataset
mtcars_tbl <- as_tibble(mtcars)
mtcars_tbl$model <- rownames(mtcars)
mtcars_tbl <- select(mtcars_tbl, model, everything())
rownames(mtcars_tbl) <- NULL

Caching strategy tradeoffs

Ok, let’s say you’ve successfully connected to a data source and pulled in some data. Well done! Here’s what your current data frame looks like.

# naive first pull
data <- mtcars_tbl[1:5, 1:6]
data

## # A tibble: 5 × 6
##   model               mpg   cyl  disp    hp  drat
##   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4          21       6   160   110  3.9 
## 2 Mazda RX4 Wag      21       6   160   110  3.9 
## 3 Datsun 710         22.8     4   108    93  3.85
## 4 Hornet 4 Drive     21.4     6   258   110  3.08
## 5 Hornet Sportabout  18.7     8   360   175  3.15

Doubtless you’ve used the data to generate amazing insights for your work or research department or client. You save the data to your hard drive long with the code so that you can refer to it later if needed.

A few months later, they want you to re-run the analysis with updated data.

Approach 1: Overwrite

What happens when you want to re-run a job? You need to pull the (updated) dataset and cache it as well. How do you cache it? Well, you could overwrite the old cached table (by, for example, simply re-running your script with hard coded save paths), but then you won’t be able to reproduce your older results.

# overwriting your original data
data <- mtcars_tbl[2:6, 1:6]
# this is now different
data

## # A tibble: 6 × 6
##   model               mpg   cyl  disp    hp  drat
##   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 Wag      21       6   160   110  3.9 
## 2 Datsun 710         22.8     4   108    93  3.85
## 3 Hornet 4 Drive     21.4     6   258   110  3.08
## 4 Hornet Sportabout  18.7     8   360   175  3.15
## 5 Valiant            18.1     6   225   105  2.76

Approach 2: Save every copy

You could save a new complete copy each time you run the code, maybe in a date-driven filename file.path() generation scheme like 20220714-data.csv or whatever. In the code block below we’ve saved each pull as a new R object.

This is the simplest non-destructive strategy, but it means you’ll be saving lots of redundant data. If you’ve got a 1 million row data frame that adds 1000 rows every day, 99.99% of a daily dump will be redundant. This gets expensive fast for frequently refreshed data.

# saving each pul as its own object
data1 <- mtcars_tbl[1:5, 1:6]

data2 <- mtcars_tbl[2:6, 1:6]

data3 <- mtcars_tbl[3:7, 1:6]

data4 <- mtcars_tbl[4:8, 1:6]

In the above set of data frames, each successive pull removes a single row and add a new row. This adds up to 11 row-level operations (5 initial row creations in data1, plus one deletion and one addition for data2, data3 and data4 successively). So theoretically we only need 11 rows to characterize all the row level changes in the above group of data frames.

However, the ‘save everything’ approach records the same information in 5 rows across 4 dataframes, for a total of 20 rows of data. That’s nearly double the space usage, and this is for a data source that mutates at 20% per pull! Imagine how bad the redundancy ratio will be for larger, more slowly mutating datasets.

Approach 3: Save just the difference

Ok, so you’ve been running this analysis as a long running project, saving every copy, and now your hard drive is full.

You’re willing to invest a bit more effort in reducing your data redundancy. The diffing approach is much more space-efficient, and you can refresh as frequently as you want with no storage penalty. The diffdfs package makes this easy.

The basic idea is that you can compute the diff between two data frames, and then ‘chain’ the diffs to reconstruct the dataset at any point in time.

# computing a diff
diffdfs(data1, data2, key_cols = "model")

##   operation     model  mpg cyl disp  hp drat
## 1       new Mazda RX4 21.0   6  160 110 3.90
## 2   deleted   Valiant 18.1   6  225 105 2.76

In the example above, we see that there were three row changes between data1 and data2. We got new data on a new model, the Mazda RX4, and we removed the data for two models, the Valiant and the Duster 360.

We can compute the diff between each successive dataset, creating a chain of diffs that contain all the information necessary to reconstruct any version of the data.

Using diffdfs to create a time travelling dataset

Let’s implement a quick and dirty versioned dataset using diffdfs.

First we combine all the data versions into a list. The first element of the list is NA because when we start, we don’t have any data. That is, data1 is the first addition to our dataset.

all_data <- list(NA, data1, data2, data3, data4)

Now, let’s work through that list, computing the diff between each successive dataset.

diffs <- list()
for(i in seq_along(1:(length(all_data)-1))) {
  diffs[[i]] <- diffdfs(all_data[[i+1]], all_data[[i]], key_cols = "model")
}

We can tag each of these diffs with a version number so that we can keep them in order. In a real example you could use a timestamp for the diff_num value.

for(i in seq_along(1:(length(diffs)))) {
  diffs[[i]]$diff_num <- i
}

Now let’s combine them into a single data frame. We will do this with bind_rows; in a cache to disk scenario you could write the diffs to a single directory and read in the entire directory as a dataset using the arrow::open_dataset function.

diff_df <- map_df(diffs, bind_rows)

diff_df

##    operation             model  mpg cyl  disp  hp drat diff_num
## 1        new         Mazda RX4 21.0   6 160.0 110 3.90        1
## 2        new     Mazda RX4 Wag 21.0   6 160.0 110 3.90        1
## 3        new        Datsun 710 22.8   4 108.0  93 3.85        1
## 4        new    Hornet 4 Drive 21.4   6 258.0 110 3.08        1
## 5        new Hornet Sportabout 18.7   8 360.0 175 3.15        1
## 6        new           Valiant 18.1   6 225.0 105 2.76        2
## 7    deleted         Mazda RX4 21.0   6 160.0 110 3.90        2
## 8        new        Duster 360 14.3   8 360.0 245 3.21        3
## 9    deleted     Mazda RX4 Wag 21.0   6 160.0 110 3.90        3
## 10       new         Merc 240D 24.4   4 146.7  62 3.69        4
## 11   deleted        Datsun 710 22.8   4 108.0  93 3.85        4

So we have a single data frame with a complete record of all the changes to happen to the dataset. Let’s create a function to reconstruct the dataset at any point in the past. This sounds daunting but it can be accomplished in 8 lines of piped code.

resurrect_df <- function(diff_df, key_cols, version) {

  diff_df %>%
    arrange(diff_num, model) %>%
    filter(diff_num <= version) %>%
    # group by the key column
    group_by(across(all_of(c(key_cols)))) %>%
    # take only the last row from each key column
    summarise(across(everything(), last), .groups = "keep") %>%
    ungroup() %>%
    # drop any rows where the last operation was 'deleted'
    filter(operation != "deleted") %>%
    # drop the operation and diff columns
    select(-diff_num, -operation) 

  }

Let’s test it out.

resurrect_df(diff_df, "model", 3)

## # A tibble: 5 × 6
##   model               mpg   cyl  disp    hp  drat
##   <chr>             <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Datsun 710         22.8     4   108    93  3.85
## 2 Duster 360         14.3     8   360   245  3.21
## 3 Hornet 4 Drive     21.4     6   258   110  3.08
## 4 Hornet Sportabout  18.7     8   360   175  3.15
## 5 Valiant            18.1     6   225   105  2.76

Let’s verify it works.

identical(resurrect_df(diff_df, "model", 3), arrange(data3, model))

## [1] TRUE

Well, there you have it.

Final comments

The diffdfs package does just one thing, which is to compute diffs between data frames. It’s just one building block of a complete diff-based caching strategy, but from the example above you can see that a simple diffing solution isn’t hard to whip up.

I’m using diffdfs as the core diffing engine in my more comprehensive data versioning package, dataversionr, but that’s a far more opinionated project, so I wanted to keep diffdfs separate because it has much wider general applicability.