You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kedro has a data catalog concept, which is absolutely fantastic to use. In a way, this is what the ExportCfg does but only for saving data, and only in parquet format, and only locally.
The purpose of this ticket is to extend this to be able to load and save many files in a single catalog entry with the same time stamp to make it easy for the engineer to know when each data was generated and what matches which run.
Requirements
Upon config, it should allow for versioning for saving data in a timestamped folder (MVP, then can be extended to other versioning methodologies)
The ExportCfg should be renamed to something more relevant for loading and storing.
It shall support local and S3 protocol for now, nothing else.
It shall support credentials for S3, like in Kedro
It shall be possible to load many different files from a given version, contrary to Kedro's catalog.
It shall support reading any dataframe format that Rust's arrow crate supports (at a minimum parquet and CSV)
It shall be a serializable structure, either as YAML or as Dhall
Test plans
Replace all ExportCfg with this new approach
Ensure that full scenario data can be reloaded from there.
Design
This should also take inspiration from the MetaFile approach used in ANISE to download data behind URLs. I also wonder whether this should be its own crate!
use serde::{Deserialize,Serialize};use std::collections::BTreeMap;#[derive(Serialize,Deserialize,Debug)]pubstructDataCatalogConfig{pubversioning:bool,pubstorage:StorageConfig,pubcredentials:Option<Credentials>,pubfiles:BTreeMap<String,Option<Box<dynLoadedFile>>>,}#[derive(Serialize,Deserialize,Debug)]pubstructStorageConfig{publocal_path:Option<String>,pubs3_path:Option<String>,}#[derive(Serialize,Deserialize,Debug)]pubstructCredentials{pubaws_access_key_id:String,pubaws_secret_access_key:String,}pubtraitLoadedFile{fnload(&self) -> Result<Box<dynLoadedFile>,Box<dyn std::error::Error>>;}
The text was updated successfully, but these errors were encountered:
High level description
Kedro has a data catalog concept, which is absolutely fantastic to use. In a way, this is what the
ExportCfg
does but only for saving data, and only in parquet format, and only locally.The purpose of this ticket is to extend this to be able to load and save many files in a single catalog entry with the same time stamp to make it easy for the engineer to know when each data was generated and what matches which run.
Requirements
Test plans
Design
This should also take inspiration from the MetaFile approach used in ANISE to download data behind URLs. I also wonder whether this should be its own crate!
The text was updated successfully, but these errors were encountered: