#008: Dataset history via file dumps #8

Open
opened 2026-04-05 12:58:42 +00:00 by daniel · 0 comments
Owner

Blocked by

Summary

Save dataset history as serialized RDF dump files, stored in the file storage system (Hetzner storage box) alongside PDF uploads. Each dump is a Turtle or N-Quads file capturing the state of the dataset at a point in time.

Design sketch

When a migration is performed:

  1. Export the affected graph as a Turtle/N-Quads file.
  2. Store it in the team's file storage: data/files/{team}/dumps/{dataset}-{timestamp}.ttl
  3. Record metadata in the meta-dataset (file path, timestamp, SHACL version, reason).
  4. Apply the migration to the dataset.

The dumps can also be registered in the file catalog for discoverability via SPARQL.

Pros

  • Minimal impact on dataset and RDF store size
  • Dumps are portable — can be loaded into any RDF tool
  • Leverages existing file storage infrastructure (storage box)
  • Can handle very large datasets without bloating the store

Cons

  • Not directly queryable — must be loaded into a store first to query historical data
  • File management overhead (cleanup, retention policies)
  • Serialization/deserialization time for large datasets

Configuration

Enabled per-dataset via meta-dataset:

<urn:config:history> concon:historyStrategy concon:FileDumps ;
    concon:dumpFormat "text/turtle" .

Tests

Unit tests

#[test]
fn test_dump_file_created_on_migration() {
    // Configure FileDumps strategy
    // Run migration
    // Verify: Turtle file exists at expected path
}

#[test]
fn test_dump_file_contains_pre_migration_data() {
    // Run migration
    // Parse the dump file
    // Verify: contains the old data, not the migrated data
}

#[test]
fn test_dump_metadata_in_meta_dataset() {
    // Run migration
    // Query meta-dataset for dump metadata
    // Verify: file path, timestamp, format recorded
}

#[test]
fn test_dump_registered_in_file_catalog() {
    // Run migration with catalog registration enabled
    // Query file catalog
    // Verify: dump appears as a file entry
}

Manual tests

  1. Enable FileDumps strategy for a dataset
  2. Run migration, verify dump file appears in storage
  3. Download dump file — verify it's valid Turtle/N-Quads
  4. Load dump into a local store — verify historical data queryable
  5. Check file catalog — verify dump is discoverable
# Blocked by - [\#005 SHACL migration detection](005-shacl-migration-detection.org) (history is triggered by migrations) # Summary Save dataset history as serialized RDF dump files, stored in the file storage system (Hetzner storage box) alongside PDF uploads. Each dump is a Turtle or N-Quads file capturing the state of the dataset at a point in time. # Design sketch When a migration is performed: 1. Export the affected graph as a Turtle/N-Quads file. 2. Store it in the team's file storage: `data/files/{team}/dumps/{dataset}-{timestamp}.ttl` 3. Record metadata in the meta-dataset (file path, timestamp, SHACL version, reason). 4. Apply the migration to the dataset. The dumps can also be registered in the file catalog for discoverability via SPARQL. # Pros - Minimal impact on dataset and RDF store size - Dumps are portable — can be loaded into any RDF tool - Leverages existing file storage infrastructure (storage box) - Can handle very large datasets without bloating the store # Cons - Not directly queryable — must be loaded into a store first to query historical data - File management overhead (cleanup, retention policies) - Serialization/deserialization time for large datasets # Configuration Enabled per-dataset via meta-dataset: ``` turtle <urn:config:history> concon:historyStrategy concon:FileDumps ; concon:dumpFormat "text/turtle" . ``` # Tests ## Unit tests ``` rust #[test] fn test_dump_file_created_on_migration() { // Configure FileDumps strategy // Run migration // Verify: Turtle file exists at expected path } #[test] fn test_dump_file_contains_pre_migration_data() { // Run migration // Parse the dump file // Verify: contains the old data, not the migrated data } #[test] fn test_dump_metadata_in_meta_dataset() { // Run migration // Query meta-dataset for dump metadata // Verify: file path, timestamp, format recorded } #[test] fn test_dump_registered_in_file_catalog() { // Run migration with catalog registration enabled // Query file catalog // Verify: dump appears as a file entry } ``` ## Manual tests 1. Enable FileDumps strategy for a dataset 2. Run migration, verify dump file appears in storage 3. Download dump file — verify it's valid Turtle/N-Quads 4. Load dump into a local store — verify historical data queryable 5. Check file catalog — verify dump is discoverable
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
daniel/concon#8
No description provided.