Determine graduation process for contributions #581

merelcht · 2024-02-28T13:39:53Z

Description

Datasets within the experimental contributions folder may evolve and improve over time. Successful and well-maintained contributions can graduate from the experimental folder and move to the regular kedro_datasets space.

We need to establish clear criteria and guidelines for determining how a experimental contribution can graduate.

Task

Decide what the criteria are for "regular" contributions.
Decide what the process is to apply a dataset for graduation.
Document as part of Add documentation about experimental contribution process #579

The text was updated successfully, but these errors were encountered:

merelcht · 2024-03-07T16:53:32Z

My suggestions of how the graduation process should look like for datasets (of course heavily depends on #583):

Graduation process for experimental datasets

We should consider graduation of a dataset when:

The author opens a PR to graduate the dataset, having changed it to meet all requirements for a regular dataset contribution.
User ask: people want it to graduate, in this case the team might decide to update the dataset to meet all requirements for a regular dataset contribution.

Steps to Graduate an Experimental Dataset:

Review by Dataset Owner (or in special cases someone from the Kedro team):
- The owner of the experimental dataset should review their dataset to ensure it meets the standards and requirements for regular dataset contributions as listed in the guidelines.
Implement Tests:
- Add tests to achieve 100% test coverage for the dataset.
- Ensure these tests run via CI/CD jobs to ensure continuous integration and testing.
Update Docstrings:
- Ensure that all docstrings are informative and provide clear explanations on how to use the dataset.
- Test that the docstrings pass doctests, unless there are complex cloud/DB setups involved.
Compatibility Check:
- Verify that the dataset can be used with all versions supported by kedro-datasets/under NEP 29.
- Confirm compatibility with both Windows and Linux environments.
Dependency Stability:
- Check that the dataset uses stable dependencies that are not expected to be discontinued anytime soon.
Submission for Review:
- Once all the above steps are completed, the dataset owner can submit a request or proposal to the Kedro team for review.
Kedro Team Review:
- The Kedro team will review the dataset to ensure it meets all the requirements for regular dataset contributions.
- Any feedback or necessary changes will be communicated to the dataset owner.
- As part of the review the team should also consider whether we feel confident in maintaining the dataset
Graduation:
- If the dataset meets all the criteria, it will be graduated from an experimental dataset to a regular dataset contribution.
- The Kedro team will take over maintenance responsibilities for the dataset.

datajoely · 2024-03-12T12:17:16Z

Whilst I think this is a perfectly well defined process - I worry the governance overhead is higher than the reactive pattern we adopt today. I would like to get some clarity on the actual problem we're solving:

Is it discoverability?
Is it providing a 1st party stamp of approval?
Is it maintenance overhead?
Is it monorepo dependency hell?

My worry is that I've seen well intentioned codified processes fail to sustain themselves the minute priorities change / people move teams etc. My gut feeling is that this falls into that category.

noklam · 2024-03-12T14:41:27Z

From my understanding, the goal is to attract more contributions while not lowering the standard of the 1st party supported datasets.

Who will be responsible for the graduation process? Is it the dataset owner or the core Kedro team? I also think we need to do a better job to document "how" to achieve all above regardless of the graduation process.

The goal is that someone who want to contribute a Polar dataset shouldn't worry about fixing a mypy issue that they don't understand or random RTD fail/linting error. Ideally, they should be able to run tests locally without relying on the CI. They shouldn't worry about how to fix the dependency hell issues (or at least Kedro core team would help to solve this, the dataset owner should care about their own datasets only)

yetudada · 2024-03-12T16:23:08Z

Additionally, usage should be an important factor in this. I guess this links to the vision of counting datasets used by Kedro-Telemetry.

deepyaman · 2024-03-13T13:54:55Z

Largely happy with this!

Verify that the dataset can be used with all versions of Python 3.9 and above.

Nit: Reword to all versions supported by Kedro/under NEP 29.

Confirm compatibility with both Windows and Linux environments.

As mentioned in #583, on a case-by-case basis I can understand not having support for Windows. But this is the exception, rather than the rule, and should also be noted clearly.

Check that the dataset uses stable dependencies that are not expected to be discontinued anytime soon.

Also, dependencies should be relaxed as much as possible, to avoid conflicting with other core datasets, and not provide poor user experience. :)

The Kedro team will take over maintenance responsibilities for the dataset.

We can help maintain. We should encourage the author to keep involved as much as possible!

ElenaKhaustova · 2024-03-18T12:35:00Z

The suggested steps seem clear and perfectly defined!

One thought that came to my mind is how we will motivate contributors of experimental datasets to update them for graduation. If we give them an instrument to simplify their contributions, it can be hard to push for further updates since their goal is reached with less effort. At the same time, we want to avoid situations where experimental datasets are growing while regular datasets are never updated. So we should probably stay aware of what other users utilise (via telemetry or pip) and take graduation on ourselves if we decide it's in demand.

merelcht · 2024-04-04T09:36:00Z

Closing this in favour of continuing the discussion in #583

Bottom line is that the following make up the graduation process:

Anyone (TSC members + users) can trigger a graduation process
We need 1/2 approval from the TSC to initiate a review/merge to the regular datasets space
A dataset can only graduate when it meets all requirements of a regular dataset

merelcht added this to the Experimental dataset contribution model milestone Feb 28, 2024

This was referenced Feb 28, 2024

Determine demotion process of dataset contributions #582

Closed

Add documentation about experimental contribution process #579

Closed

merelcht self-assigned this Feb 28, 2024

astrojuanlu mentioned this issue Mar 12, 2024

Decide on definitions of regular and experimental contributions #583

Closed

merelcht closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine graduation process for contributions #581

Determine graduation process for contributions #581

merelcht commented Feb 28, 2024

merelcht commented Mar 7, 2024 •

edited

Loading

datajoely commented Mar 12, 2024

noklam commented Mar 12, 2024

yetudada commented Mar 12, 2024

deepyaman commented Mar 13, 2024

ElenaKhaustova commented Mar 18, 2024

merelcht commented Apr 4, 2024

Determine graduation process for contributions #581

Determine graduation process for contributions #581

Comments

merelcht commented Feb 28, 2024

Description

Task

merelcht commented Mar 7, 2024 • edited Loading

Graduation process for experimental datasets

Steps to Graduate an Experimental Dataset:

datajoely commented Mar 12, 2024

noklam commented Mar 12, 2024

yetudada commented Mar 12, 2024

deepyaman commented Mar 13, 2024

ElenaKhaustova commented Mar 18, 2024

merelcht commented Apr 4, 2024

merelcht commented Mar 7, 2024 •

edited

Loading