Access archived data¶
Sometimes, you may need to access an archived dataset or snapshot and compare it with the current one. Archived steps are no longer kept as code in the repository — instead, dag/archive/*.yml records each step that was once active, with a marker comment carrying the commit where it was last active. You recover a step by checking out that commit.
Git History¶
The simplest way to access an older dataset is by checking out the commit where the step was last active and running the ETL from there.
- Find the commit of interest:
- Look up the step in
dag/archive/*.yml. Its marker comment gives the recovery commit, e.g.# archived; last active in 4e6b5dfb9cb7 on 2026-05-11. -
(Alternatively, open the file in GitHub, click
History, and copy the SHA of the desired commit.) -
Checkout the commit:
-
Re-run the ETL:
Tip
Run this in a separate folder (e.g., etl2) to retain access to the current datasets. This setup allows you to compare datasets in a notebook.
Example comparison in Python
from etl.dataset import Dataset
# Load current dataset
tb_current = Dataset("~/projects/etl/data/garden/climate/latest/weekly_wildfires").read_table('wildfires')
# Load dataset from a previous commit
tb_old = Dataset("~/projects/etl2/data/garden/climate/latest/weekly_wildfires").read_table('wildfires')
Update MD5 for archived Snapshots¶
If the code hasn’t changed and only new snapshots have been created (e.g., for automatically updated datasets), you can modify the snapshot MD5 in the .dvc file to point to an older snapshot.
- Find the MD5 and size:
- Locate the desired commit in GitHub.
-
Copy the MD5 and size from the relevant
.dvcfile (e.g.,snapshots/climate/latest/weekly_wildfires.csv.dvc). -
Update the
.dvcfile locally: -
Replace the MD5 and size in your local
.dvcfile. -
Re-run the ETL with the updated MD5:
Tip
For chart comparisons, create a PR with the updated .dvc file, commit the changes, and use the chart diff tool. Enable "Show all charts" to view them side-by-side.
Comparing Snapshots¶
To directly compare snapshots, use the etl.snapshot module.
-
Load the current snapshot:
-
Load an older snapshot:
- Find its MD5 and size from a previous commit.
- Update the MD5 and size in your script: