Quickly diffing data frames in R
I’m going to use this post to explain the rationale behind my new R
package diffdffs
, and walk you through how to use it with practical
examples.
If you want to install the package, you can use
devtools::install_github("riazarbi/diffdfs")
in an R
session. You
can also visit riazarbi/diffdfs
to poke around the code.
One very common operation across my data pipelines is the calculation of what has changed between an old version and new version of a data frame. I call this operation ‘diffing’.
Why caching data is important
Generally, if I need to do something with data that resides in a remote source, my first step is to cache it locally. This means pulling it from the remote source and saving it to a local disk or database. I like to keep copies of the data that drives an analytical product because it facilitates reproducibility and transparency. Also, what happens if your data source disappears? Then you’ll be left with a model but no backing data.
However, how you cache that data can end up painting you into a corner down the road. I’ll try demonstrate how this can happen in the practical example below.
# load libraries
suppressPackageStartupMessages({
library(dplyr)
library(readr)
library(purrr)
library(diffdfs)
})
# generate a sample dataset
mtcars_tbl <- as_tibble(mtcars)
mtcars_tbl$model <- rownames(mtcars)
mtcars_tbl <- select(mtcars_tbl, model, everything())
rownames(mtcars_tbl) <- NULL
Caching strategy tradeoffs
Ok, let’s say you’ve successfully connected to a data source and pulled in some data. Well done! Here’s what your current data frame looks like.
# naive first pull
data <- mtcars_tbl[1:5, 1:6]
data
## # A tibble: 5 × 6
## model mpg cyl disp hp drat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9
## 2 Mazda RX4 Wag 21 6 160 110 3.9
## 3 Datsun 710 22.8 4 108 93 3.85
## 4 Hornet 4 Drive 21.4 6 258 110 3.08
## 5 Hornet Sportabout 18.7 8 360 175 3.15
Doubtless you’ve used the data to generate amazing insights for your work or research department or client. You save the data to your hard drive long with the code so that you can refer to it later if needed.
A few months later, they want you to re-run the analysis with updated data.
Approach 1: Overwrite
What happens when you want to re-run a job? You need to pull the (updated) dataset and cache it as well. How do you cache it? Well, you could overwrite the old cached table (by, for example, simply re-running your script with hard coded save paths), but then you won’t be able to reproduce your older results.
# overwriting your original data
data <- mtcars_tbl[2:6, 1:6]
# this is now different
data
## # A tibble: 6 × 6
## model mpg cyl disp hp drat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 Wag 21 6 160 110 3.9
## 2 Datsun 710 22.8 4 108 93 3.85
## 3 Hornet 4 Drive 21.4 6 258 110 3.08
## 4 Hornet Sportabout 18.7 8 360 175 3.15
## 5 Valiant 18.1 6 225 105 2.76
Approach 2: Save every copy
You could save a new complete copy each time you run the code, maybe in
a date-driven filename file.path()
generation scheme like
20220714-data.csv
or whatever. In the code block below we’ve saved
each pull as a new R
object.
This is the simplest non-destructive strategy, but it means you’ll be saving lots of redundant data. If you’ve got a 1 million row data frame that adds 1000 rows every day, 99.99% of a daily dump will be redundant. This gets expensive fast for frequently refreshed data.
# saving each pul as its own object
data1 <- mtcars_tbl[1:5, 1:6]
data2 <- mtcars_tbl[2:6, 1:6]
data3 <- mtcars_tbl[3:7, 1:6]
data4 <- mtcars_tbl[4:8, 1:6]
In the above set of data frames, each successive pull removes a single
row and add a new row. This adds up to 11 row-level operations (5
initial row creations in data1
, plus one deletion and one addition for
data2
, data3
and data4
successively). So theoretically we only
need 11 rows to characterize all the row level changes in the above
group of data frames.
However, the ‘save everything’ approach records the same information in 5 rows across 4 dataframes, for a total of 20 rows of data. That’s nearly double the space usage, and this is for a data source that mutates at 20% per pull! Imagine how bad the redundancy ratio will be for larger, more slowly mutating datasets.
Approach 3: Save just the difference
Ok, so you’ve been running this analysis as a long running project, saving every copy, and now your hard drive is full.
You’re willing to invest a bit more effort in reducing your data
redundancy. The diffing approach is much more space-efficient, and you
can refresh as frequently as you want with no storage penalty. The
diffdfs
package makes this easy.
The basic idea is that you can compute the diff between two data frames, and then ‘chain’ the diffs to reconstruct the dataset at any point in time.
# computing a diff
diffdfs(data1, data2, key_cols = "model")
## operation model mpg cyl disp hp drat
## 1 new Mazda RX4 21.0 6 160 110 3.90
## 2 deleted Valiant 18.1 6 225 105 2.76
In the example above, we see that there were three row changes between
data1
and data2
. We got new data on a new model, the Mazda RX4, and
we removed the data for two models, the Valiant and the Duster 360.
We can compute the diff between each successive dataset, creating a chain of diffs that contain all the information necessary to reconstruct any version of the data.
Using diffdfs to create a time travelling dataset
Let’s implement a quick and dirty versioned dataset using diffdfs
.
First we combine all the data versions into a list. The first element of
the list is NA
because when we start, we don’t have any data. That is,
data1
is the first addition to our dataset.
all_data <- list(NA, data1, data2, data3, data4)
Now, let’s work through that list, computing the diff between each successive dataset.
diffs <- list()
for(i in seq_along(1:(length(all_data)-1))) {
diffs[[i]] <- diffdfs(all_data[[i+1]], all_data[[i]], key_cols = "model")
}
We can tag each of these diffs with a version number so that we can keep
them in order. In a real example you could use a timestamp for the
diff_num
value.
for(i in seq_along(1:(length(diffs)))) {
diffs[[i]]$diff_num <- i
}
Now let’s combine them into a single data frame. We will do this with
bind_rows
; in a cache to disk scenario you could write the diffs to a
single directory and read in the entire directory as a dataset using the
arrow::open_dataset
function.
diff_df <- map_df(diffs, bind_rows)
diff_df
## operation model mpg cyl disp hp drat diff_num
## 1 new Mazda RX4 21.0 6 160.0 110 3.90 1
## 2 new Mazda RX4 Wag 21.0 6 160.0 110 3.90 1
## 3 new Datsun 710 22.8 4 108.0 93 3.85 1
## 4 new Hornet 4 Drive 21.4 6 258.0 110 3.08 1
## 5 new Hornet Sportabout 18.7 8 360.0 175 3.15 1
## 6 new Valiant 18.1 6 225.0 105 2.76 2
## 7 deleted Mazda RX4 21.0 6 160.0 110 3.90 2
## 8 new Duster 360 14.3 8 360.0 245 3.21 3
## 9 deleted Mazda RX4 Wag 21.0 6 160.0 110 3.90 3
## 10 new Merc 240D 24.4 4 146.7 62 3.69 4
## 11 deleted Datsun 710 22.8 4 108.0 93 3.85 4
So we have a single data frame with a complete record of all the changes to happen to the dataset. Let’s create a function to reconstruct the dataset at any point in the past. This sounds daunting but it can be accomplished in 8 lines of piped code.
resurrect_df <- function(diff_df, key_cols, version) {
diff_df %>%
arrange(diff_num, model) %>%
filter(diff_num <= version) %>%
# group by the key column
group_by(across(all_of(c(key_cols)))) %>%
# take only the last row from each key column
summarise(across(everything(), last), .groups = "keep") %>%
ungroup() %>%
# drop any rows where the last operation was 'deleted'
filter(operation != "deleted") %>%
# drop the operation and diff columns
select(-diff_num, -operation)
}
Let’s test it out.
resurrect_df(diff_df, "model", 3)
## # A tibble: 5 × 6
## model mpg cyl disp hp drat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Datsun 710 22.8 4 108 93 3.85
## 2 Duster 360 14.3 8 360 245 3.21
## 3 Hornet 4 Drive 21.4 6 258 110 3.08
## 4 Hornet Sportabout 18.7 8 360 175 3.15
## 5 Valiant 18.1 6 225 105 2.76
Let’s verify it works.
identical(resurrect_df(diff_df, "model", 3), arrange(data3, model))
## [1] TRUE
Well, there you have it.
Final comments
The diffdfs
package does just one thing, which is to compute diffs
between data frames. It’s just one building block of a complete
diff-based caching strategy, but from the example above you can see that
a simple diffing solution isn’t hard to whip up.
I’m using diffdfs
as the core diffing engine in my more comprehensive
data versioning package, dataversionr, but
that’s a far more opinionated project, so I wanted to keep diffdfs
separate because it has much wider general applicability.