-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable versioning #1871
Configurable versioning #1871
Conversation
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Signed-off-by: nickolasrm <[email protected]>
Thank you @nickolasrm for raising this PR, impressive one. This is a big PR and I think it deserves some discussion and possibly breaking down into smaller pieces. I like that This PR is interesting as this is the 1st attempt I see someone trying to subclass Do I understand it correctly the motivation of this PR is to make Kedro recognize datasets that were generated outside of Kedro / pre-existing data? What happens if you are using only YYYY-mm-dd and there are existing datasets, does Side-note: |
I just want to add this is great work @nickolasrm we just need to think about how it affects some other moving pieces! |
I'm very happy you liked this change. I find everyone implementing the best mlops practices should use Kedro, but I also think for older projects or for companies that are in the early stages of mlops possibly have a different versioning system than Kedro's. In my case, not only do the old projects depend on the current versioned folder structure but other services connected to this data too. About the motivation, yes, we want to ingest old data generated by projects that didn't use Kedro at that moment. However, not only reading old data is necessary, but we'd like to keep our folder structure the way it is after rebuilding the project with Kedro. The partitioned dataset approach I mentioned in my first comment was made by prefixing partitions with the custom version format, but as I said, this is not a very maintainable approach. About the YYYY-mm-dd, the loss of the unique date format is something we can't avoid if the user chooses to do it. Actually, some of our projects require overwriting existing data. For example, when we want to ratify some monthly prediction results. I've added a validation in About the discussion: Very nice to know this was already discussed! I didn't see the thread before, but I think the problem with the |
Sorry for the delayed response, I missed the notification from GitHub, feel free to @mention again if the thread becomes stale for too long. A related thread #1551 about session_id. From my understanding, the version needs to be unique for two reasons.
@AntonyMilneQB @MerelTheisenQB may have more thought on this? |
Hi @nickolasrm, thanks for raising this PR! This is quite a big feature, so the maintainer team would like to discuss and dedicate time to this properly. It might take a while for us to get to it, but know that we haven't forgotten this 🙂 I will convert this into an issue (#2355) and close the PR to keep our repo up to date. I hope that's okay! |
Description
This PR aims to add more customization for
VersionedDataSet
s. There are three main additions made in this PR, the custom format versioning, the customizable version class, and the partial timestamp parsing.Motivation
Because Kedro can only versionate datasets using a predefined path, the data history structure generated by a previous code that wasn't using Kedro would require to be unnecessarily refactored. Because of that, I tried another approach using
PartitionedDataSets
, but its logic is not only hard to maintain but is syntactically different than Kedro's declarative YAML idea. For this reason, I wrote this PR to help turning this need into a feature.Custom format
The first addition enables the use of format codes in the filepath in order to change the default target path of the versioned file.
The example above dataset would have been translated to
data/01_raw/company/car_data/2022/09/25/car_data.csv
if today's date was2022-09-25
Partial timestamp
In order to simplify loading custom versioned datasets, inputting a not fully filled timestamp has also been implemented.
kedro run --load-version "cars:2022-09-25"
or
This is now a possible way of selecting the load version.
Custom version class
If the custom date format is not enough to implement the versioning logic, then the user can subclass the
Version
class in order to override the default parse and unparse behaviour of the timestamps. For example, let's say you want to represent the day as the Sunday of the week every time you run the code. For that, you could do something like this:Development notes
Version class
Instead of using
Version
as a namedtuple,Version
is now a complete class that helps to parse and to unparse filepaths, becoming the former part ofAbstractDataSet
that processes timestamps into paths. This was developed for enabling the custom version manipulation logic.Kept the original behaviour
The default versioning behaviour was kept using the new auxiliar methods
is_custom_format
andis_unique_date_format
of theAbstractDataSet
UnknownDateTime
This class was implemented because of the mocks ['first', 'second'] in unit tests. I'm not sure if these non-timestamp formats were only designed for testing or if they are actual features. If it is only used for testing, this class and its handling logic in
Version._safe_parse
method can be removed, but the unit tests may need to be changed.Custom
Version
class demands paying attentionEven though a custom
Version
class can be specified, itsparsing
,unparsing
, andglob
methods must be implemented safely in order to not break the internal versioning logic. For instance, the example described before would be considered unique byis_unique_date_format
if it implements all ISO format codes. However, because it has changed the%d
behaviour, it shouldn't be considered unique. There is a workaround for this problem in the docs, but this is something the user has to pay attention. Also, because unparsing is called multiple times inside the code, the pattern can't be easily manipulated. For example, if the user wants the unparse to always add the date at the end of the filepath the user has to be careful in order to not add it multiple times (because of the internal logic). These are some examples of this setting limitation.Unit tests
Wrote unit tests for all
kedro.io.date_time
classes, and their methods aiming to reproduce their caller's expectations present in other parts of the code.Wrote unit tests in
test_data_catalog
for testing new warnings and if the files created by datasets using custom versioning were loading and saving correctly.None of the already present tests were changed in order to make sure the default behaviour was preserved.
Checklist
RELEASE.md
file