Dataversionr: time versioned data frames in R • Riaz Arbi

Yesterday I found out that dataversionr, my time versioning package for R, has landed in CRAN.

This means that install.packages("dataversionr") will work in any R console. If you want to install directly from my development branch, you can use devtools::install_github("riazarbi/diffdfs"), and you can visit riazarbi/diffdfs to poke around the code.

dataversionr formalizes functionality that I have been developing in my pipelines on and off for the last four years. It makes it easy to write a data frame to an disk and keep a record of how the data frame has changed over time.

Because I’m using arrow for the heavy lifting, any backend supported by arrow will work. At this time, that includes -

local disk
S3 compatible object sotrage (AWS S3, minio, etc)
GCS

Time versioning is a useful property for a range of situations. I use is extensively for maintaining datasets where look-ahead bias is a risk. Being able to retrieve the dataset as it was at the time of ingestion allows us to train on data free from this bias.

You can also use it as part of a larger workflow to monitor data mutation rates, and insert trip wires in your pipeline to abort a write when the dataset changes in an anomalous manner. This is really useful when you are ingesting data from an untrusted source.

I’ll do a more in depth write-up at some point in the future, but for now I wanted to just give you a high level ‘getting started’ type overview of the functionality. This flies pretty close to the README in the repo.

Getting started

The most high-level functions in this package are intended to introduce as little additional overhead as possible over base R read, write and unlink functions:

create_dv : Create a time versioned dataset on either a local hard drive or at an S3 location.
read_dv: retrieve a dataset from the location. Specify a time in the past to obtain an historical version of the dataset.
update_dv: write a new version of the dataset to the location.
destroy_dv: completely erase the files at the location.

These four functions will be all that you need 99% of the time. The other functions in the package are the building blocks of the *_dv functions, but I’ve exported them to the user on the basis that as some point soemone will want fine-grained access for some edge case.

Here’s an example of how to use the *_dv functions:

library(dataversionr)
library(dplyr)
location <- tempfile()
new_df <- iris[1:5,3:5] %>% mutate(key = 1:nrow(.))

Create a dv:

create_dv(new_df,
          location,
          key_cols = "key",
          diffed = TRUE,
          backup_count = 10L)

Checking that new_df can be diffed...
Diff test passed.
[1] TRUE

Update a dv:

newer_df <- new_df
newer_df [1,1] <- 2
update_dv(newer_df, 
          location)

[1] TRUE

If we try update again:

update_dv(newer_df, 
          location)

No changes detected. Exiting.
[1] FALSE

Delete a row and update:

newest_df <- newer_df[2:5,]
update_dv(newest_df,
          location)

[1] TRUE

Read a dv:

read_dv(location)

  Petal.Length Petal.Width Species key
1          2.0         0.2  setosa   1
2          1.4         0.2  setosa   2
3          1.3         0.2  setosa   3
4          1.5         0.2  setosa   4
5          1.4         0.2  setosa   5

Summarise diffs:

summarise_diffs(location)

> summarise_diffs(location)
       diff_timestamp new modified deleted
1 2022-08-16 12:58:29   5       NA      NA
2 2022-08-16 12:59:14  NA        1      NA
3 2022-08-16 13:04:15  NA       NA       1

Or connect directly to the diff dataset:

get_diffs(location)

       diff_timestamp operation Petal.Length Petal.Width Species key
1 2022-08-16 12:58:29       new          1.4         0.2  setosa   1
2 2022-08-16 12:58:29       new          1.4         0.2  setosa   2
3 2022-08-16 12:58:29       new          1.3         0.2  setosa   3
4 2022-08-16 12:58:29       new          1.5         0.2  setosa   4
5 2022-08-16 12:58:29       new          1.4         0.2  setosa   5
6 2022-08-16 12:59:14  modified          2.0         0.2  setosa   1
7 2022-08-16 13:04:15   deleted          2.0         0.2  setosa   1

Destroy a dv:

destroy_dv(location, prompt = FALSE)

[1] TRUE

Further Development

I took a lot of care in building robust tests for the package, so everything is working as expected. There are a few rough edges (chatty messages that escaped my verbose flags) that I’ll clean up in my next release.

I want to put together a few vignettes to help users get started with more exotic use cases (using an S3 or gcs backend; using put_diff to build append-only timeseries datasets), and to create package site using pkgdown. But these must wait a few weeks while I get my teeth into my backtester.

I haven’t run any benchmarks yet, but that’s also on the radar. The package handles large amounts of data quite well, but I know from unrelated work that if you’re going to be rapidly calling different versions of a dataset (say, if you’re stepping through slices in time and building a model at each time period)), it makes sense to move the entire dv into RAM via tmpfs or similar. Maybe I’ll add a load_dv function if I can figure out the cross-platform edge cases. You can’t get into CRAN unless your code runs correctly on, like, 10 different operating systems).