Dataversionr: time versioned data frames in R
Yesterday I found out that dataversionr
, my time versioning package for R, has landed in CRAN.
This means that install.packages("dataversionr")
will work in any R console. If you want to install directly from my development branch, you can use devtools::install_github("riazarbi/diffdfs")
, and you can visit riazarbi/diffdfs to poke around the code.
dataversionr
formalizes functionality that I have been developing in my pipelines on and off for the last four years. It makes it easy to write a data frame to an disk and keep a record of how the data frame has changed over time.
Because I’m using arrow
for the heavy lifting, any backend supported by arrow
will work. At this time, that includes -
- local disk
- S3 compatible object sotrage (AWS S3, minio, etc)
- GCS
Time versioning is a useful property for a range of situations. I use is extensively for maintaining datasets where look-ahead bias is a risk. Being able to retrieve the dataset as it was at the time of ingestion allows us to train on data free from this bias.
You can also use it as part of a larger workflow to monitor data mutation rates, and insert trip wires in your pipeline to abort a write when the dataset changes in an anomalous manner. This is really useful when you are ingesting data from an untrusted source.
I’ll do a more in depth write-up at some point in the future, but for now I wanted to just give you a high level ‘getting started’ type overview of the functionality. This flies pretty close to the README
in the repo.
Getting started
The most high-level functions in this package are intended to introduce as little additional overhead as possible over base R
read
, write
and unlink
functions:
create_dv
: Create a time versioned dataset on either a local hard drive or at an S3 location.read_dv
: retrieve a dataset from the location. Specify a time in the past to obtain an historical version of the dataset.update_dv
: write a new version of the dataset to the location.destroy_dv
: completely erase the files at the location.
These four functions will be all that you need 99% of the time. The other functions in the package are the building blocks of the *_dv
functions, but I’ve exported them to the user on the basis that as some point soemone will want fine-grained access for some edge case.
Here’s an example of how to use the *_dv
functions:
library(dataversionr)
library(dplyr)
location <- tempfile()
new_df <- iris[1:5,3:5] %>% mutate(key = 1:nrow(.))
Create a dv:
create_dv(new_df,
location,
key_cols = "key",
diffed = TRUE,
backup_count = 10L)
Checking that new_df can be diffed...
Diff test passed.
[1] TRUE
Update a dv:
newer_df <- new_df
newer_df [1,1] <- 2
update_dv(newer_df,
location)
[1] TRUE
If we try update again:
update_dv(newer_df,
location)
No changes detected. Exiting.
[1] FALSE
Delete a row and update:
newest_df <- newer_df[2:5,]
update_dv(newest_df,
location)
[1] TRUE
Read a dv:
read_dv(location)
Petal.Length Petal.Width Species key
1 2.0 0.2 setosa 1
2 1.4 0.2 setosa 2
3 1.3 0.2 setosa 3
4 1.5 0.2 setosa 4
5 1.4 0.2 setosa 5
Summarise diffs:
summarise_diffs(location)
> summarise_diffs(location)
diff_timestamp new modified deleted
1 2022-08-16 12:58:29 5 NA NA
2 2022-08-16 12:59:14 NA 1 NA
3 2022-08-16 13:04:15 NA NA 1
Or connect directly to the diff dataset:
get_diffs(location)
diff_timestamp operation Petal.Length Petal.Width Species key
1 2022-08-16 12:58:29 new 1.4 0.2 setosa 1
2 2022-08-16 12:58:29 new 1.4 0.2 setosa 2
3 2022-08-16 12:58:29 new 1.3 0.2 setosa 3
4 2022-08-16 12:58:29 new 1.5 0.2 setosa 4
5 2022-08-16 12:58:29 new 1.4 0.2 setosa 5
6 2022-08-16 12:59:14 modified 2.0 0.2 setosa 1
7 2022-08-16 13:04:15 deleted 2.0 0.2 setosa 1
Destroy a dv:
destroy_dv(location, prompt = FALSE)
[1] TRUE
Further Development
I took a lot of care in building robust tests for the package, so everything is working as expected. There are a few rough edges (chatty messages that escaped my verbose
flags) that I’ll clean up in my next release.
I want to put together a few vignettes to help users get started with more exotic use cases (using an S3 or gcs backend; using put_diff
to build append-only timeseries datasets), and to create package site using pkgdown
. But these must wait a few weeks while I get my teeth into my backtester.
I haven’t run any benchmarks yet, but that’s also on the radar. The package handles large amounts of data quite well, but I know from unrelated work that if you’re going to be rapidly calling different versions of a dataset (say, if you’re stepping through slices in time and building a model at each time period)), it makes sense to move the entire dv
into RAM via tmpfs
or similar. Maybe I’ll add a load_dv
function if I can figure out the cross-platform edge cases. You can’t get into CRAN unless your code runs correctly on, like, 10 different operating systems).