Snaphotting partially changing tables with dbt
Ok, this one is mostly for my future self to refer to. I had an interesting issue this week with dbt snapshots.
Basically, we’ve got this situation where we get a report dumped in a Google Sheet at unpredictable times. We are not notified when the report is updated. The sheet has NO date field, so if you look at it a few days later, you can’t even tell when it was changed.
However, as soon as it has changed, we need to create a separate report, with all the keys from that original report.
Some background: How dbt snapshots work
Here is an example original report. Let’s call it source
:
| key | col1 | col2 | col3 |
|-----|------|------|------|
| a | 1 | 4 | 7 |
| b | 2 | 5 | 8 |
| c | 3 | 6 | 9 |
Now, let’s snapshot this using dbt snapshots, using the following SQL:
{% snapshot source_snapshot %}
{{
config(
unique_key='key',
strategy='check',
check_cols='all'
)
}}
select * from {{ source('google_sheets', 'source') }}
{% endsnapshot %}
When we run the snapshot, we’ll create a table that looks something like this:
| key | col1 | col2 | col3 | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|------------|----------------|----------------|--------------|
| a | 1 | 4 | 7 | slslsl | t1 | t1 | NULL |
| b | 2 | 5 | 8 | kfenekenv | t1 | t1 | NULL |
| c | 3 | 6 | 9 | scknwcnw | t1 | t1 | NULL |
Don’t worry about the dbt_scd_id
- that’s just a hash that dbt uses. The key columns are the timestamp columns (dbt_updated_at
, dbt_valid_from
, and dbt_valid_to
).
When a new value for, say, key=a
arrives, the snapshot operation will add the current timestamp to the dbt_valid_to
field of the existing key=a
row, and add a new row, with key=a
, the new values, dbt_valid_from
as the current timestamp, and the dbt_valid_to
field NULL.
So, if a new report is dumped in the google sheet that looks like this (note key a
has changed):
| key | col1 | col2 | col3 |
|-----|------|------|------|
| a | 10 | 11 | 12 |
| b | 2 | 5 | 8 |
| c | 3 | 6 | 9 |
The snapshot table will, after the snapshotting operation, look like this:
| key | col1 | col2 | col3 | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|------------|----------------|----------------|--------------|
| a | 1 | 4 | 7 | slslsl | t1 | t1 | t2 |
| b | 2 | 5 | 8 | kfenekenv | t1 | t1 | NULL |
| c | 3 | 6 | 9 | scknwcnw | t1 | t1 | NULL |
| a | 10 | 11 | 12 | pwvnwvn | t2 | t2 | NULL |
So you can see that by filtering on the valid_from
and valid_to
fields, you can travel through time. You could get the latest version of your report by simply querying
SELECT *
FROM
{{ref('source_snapshot')}}
WHERE
dbt_valid_to IS NULL
ORDER BY key
And you can plainly see that you will get you:
| key | col1 | col2 | col3 |
|-----|------|------|------|
| a | 10 | 11 | 12 |
| b | 2 | 5 | 8 |
| c | 3 | 6 | 9 |
The problem
Snapshot tables work really well if you want to track row level changes. We can ask things like, ‘show me how key a
has changed over time’:
SELECT
*
FROM
{{ref('source_snapshot')}}
WHERE
key = 'a'
Which returns
| key | col1 | col2 | col3 | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|------------|----------------|----------------|--------------|
| a | 1 | 4 | 7 | slslsl | t1 | t1 | t2 |
| a | 10 | 11 | 12 | pwvnwvn | t2 | t2 | NULL |
That is really cool, because now you can do things like track drift over time and ask whether those drifts are expected. Super useful if you have a data source you don’t trust, for instance.
But what we needed was to track report level changes. We wanted to be able to get the full report for any snapshot period.
Let’s just reproduce the snapshot table here to save you some scrolling:
| key | col1 | col2 | col3 | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|------------|----------------|----------------|--------------|
| a | 1 | 4 | 7 | slslsl | t1 | t1 | t2 |
| b | 2 | 5 | 8 | kfenekenv | t1 | t1 | NULL |
| c | 3 | 6 | 9 | scknwcnw | t1 | t1 | NULL |
| a | 10 | 11 | 12 | pwvnwvn | t2 | t2 | NULL |
Using the example above, we basically wanted to be able to ask “what did the report look like at time t1”? That’s pretty easy:
SELECT
*
FROM
{{ref('source_snapshot')}}
WHERE
dbt_valid_from = t1
…gives you…
| key | col1 | col2 | col3 | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|------------|----------------|----------------|--------------|
| a | 1 | 4 | 7 | slslsl | t1 | t1 | t2 |
| b | 2 | 5 | 8 | kfenekenv | t1 | t1 | NULL |
| c | 3 | 6 | 9 | scknwcnw | t1 | t1 | NULL |
… which is correct!
But how do you ask for what the report looked like at t2? It’s not so simple.
For one thing, keys b
and c
did not change in t2. So they have no record of any modification at that time. You could possibly create some complex query that achieves the following (this is just made up SQL, it’s just illustrative):
SELECT
MAX rows entry
FROM
{{ref('source_snapshot')}}
WHERE
dbt_valid_from <>= t2
ORDER BY
dbt_valid_from
PARTITION BY
key
Basically some sort of ordering combined with time filtering. But this is pretty flakey. For one thing, you might not have a single primary key. Or it might change. Or a record might be removed from the source
.
And, remember, we use this to trigger the generation of another report that needs to execute at the exact time of a report being captured.
The solution
Ok, so what we decided to do was to snapshot the entire report, each time it changed. The report is not so big that this will be expensive. So we needed some way of detecting that the report had changed.
Step 1: Create a fingerprint_table macro
This macro will accept a relation as an argument, and return the farm fingerprint of the farm fingerprints of all the rows. Basically, if a single value in the entire table changes, it will return a different value.
{% macro fingerprint_table(a) %}
WITH col_fingerprints as (
SELECT
CAST(FARM_FINGERPRINT(FORMAT("%T", a)) as string) AS fingerprint
FROM
{{a}} AS a)
SELECT FARM_FINGERPRINT(STRING_AGG(fingerprint ORDER BY FINGERPRINT)) as fingerprint FROM col_fingerprints
{% endmacro %}
Step 2: Create a view that is just a fingerprint of your source
source_fingerprint.sql
:
{{ fingerprint_table(source('google_sheets', 'source')) }}
This view will look something like this:
| fingerprint |
|-------------|
| 3983737829 |
Step 3: Create a model that is just your original report, with the fingerprint added as a new column
source_with_fingerprint.sql
:
SELECT
fingerprint.fingerprint as table_fingerprint,
base.*
FROM
{{ source('google_sheets', 'source') }}base,
{{ ref('source_fingerprint') }} fingerprint
So that view will look something like this:
| key | col1 | col2 | col3 | fingerprint |
|-----|------|------|------|-------------|
| a | 1 | 4 | 7 | 3983737829 |
| b | 2 | 5 | 8 | 3983737829 |
| c | 3 | 6 | 9 | 3983737829 |
Snapshot the fingerprinted view
{% snapshot source_snapshot %}
{{
config(
unique_key='key',
strategy='check',
check_cols='all'
)
}}
select * from {{ ref('source_with_fingerprint') }}
{% endsnapshot %}
Now your snapshot table looks like this:
| key | col1 | col2 | col3 | fingerprint | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|-------------|------------|----------------|----------------|--------------|
| a | 1 | 4 | 7 | 3983737829 | slslsl | t1 | t1 | NULL |
| b | 2 | 5 | 8 | 3983737829 | kfenekenv | t1 | t1 | NULL |
| c | 3 | 6 | 9 | 3983737829 | scknwcnw | t1 | t1 | NULL |
If you upload a new report that just changes one value, like this:
| key | col1 | col2 | col3 |
|-----|------|------|------|
| a | 999 | 4 | 7 |
| b | 2 | 5 | 8 |
| c | 3 | 6 | 9 |
Your snapshot will look like this:
| key | col1 | col2 | col3 | fingerprint | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|-------------|------------|----------------|----------------|--------------|
| a | 1 | 4 | 7 | 3983737829 | slslsl | t1 | t1 | t2 |
| b | 2 | 5 | 8 | 3983737829 | kfenekenv | t1 | t1 | t2 |
| c | 3 | 6 | 9 | 3983737829 | scknwcnw | t1 | t1 | t2 |
| a | 999 | 4 | 7 | 0293883839 | lslkks | t2 | t2 | NULL |
| b | 2 | 5 | 8 | 0293883839 | fjjrnbn | t2 | t2 | NULL |
| c | 3 | 6 | 9 | 0293883839 | dodiis | t2 | t2 | NULL |
Sure, it duplicates information, but it allows me to fulfil the original task very easily
SELECT
key
FROM
{{ref('source_snapshot')}}
WHERE
dbt_valid_from = t2
Gives us
| key | col1 | col2 | col3 | fingerprint | dbt_scd_id | dbt_updated_at | dbt_valid_from | dbt_valid_to |
|-----|------|------|------|-------------|------------|----------------|----------------|--------------|
| a | 999 | 4 | 7 | 0293883839 | lslkks | t2 | t2 | NULL |
| b | 2 | 5 | 8 | 0293883839 | fjjrnbn | t2 | t2 | NULL |
| c | 3 | 6 | 9 | 0293883839 | dodiis | t2 | t2 | NULL |