Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine graduation process for contributions #581

Closed
merelcht opened this issue Feb 28, 2024 · 7 comments
Closed

Determine graduation process for contributions #581

merelcht opened this issue Feb 28, 2024 · 7 comments
Assignees

Comments

@merelcht
Copy link
Member

Description

Datasets within the experimental contributions folder may evolve and improve over time. Successful and well-maintained contributions can graduate from the experimental folder and move to the regular kedro_datasets space.

We need to establish clear criteria and guidelines for determining how a experimental contribution can graduate.

Task

@merelcht
Copy link
Member Author

merelcht commented Mar 7, 2024

My suggestions of how the graduation process should look like for datasets (of course heavily depends on #583):

Graduation process for experimental datasets

We should consider graduation of a dataset when:

  • The author opens a PR to graduate the dataset, having changed it to meet all requirements for a regular dataset contribution.
  • User ask: people want it to graduate, in this case the team might decide to update the dataset to meet all requirements for a regular dataset contribution.

Steps to Graduate an Experimental Dataset:

  1. Review by Dataset Owner (or in special cases someone from the Kedro team):
    • The owner of the experimental dataset should review their dataset to ensure it meets the standards and requirements for regular dataset contributions as listed in the guidelines.
  2. Implement Tests:
    • Add tests to achieve 100% test coverage for the dataset.
    • Ensure these tests run via CI/CD jobs to ensure continuous integration and testing.
  3. Update Docstrings:
    • Ensure that all docstrings are informative and provide clear explanations on how to use the dataset.
    • Test that the docstrings pass doctests, unless there are complex cloud/DB setups involved.
  4. Compatibility Check:
    • Verify that the dataset can be used with all versions supported by kedro-datasets/under NEP 29.
    • Confirm compatibility with both Windows and Linux environments.
  5. Dependency Stability:
    • Check that the dataset uses stable dependencies that are not expected to be discontinued anytime soon.
  6. Submission for Review:
    • Once all the above steps are completed, the dataset owner can submit a request or proposal to the Kedro team for review.
  7. Kedro Team Review:
    • The Kedro team will review the dataset to ensure it meets all the requirements for regular dataset contributions.
    • Any feedback or necessary changes will be communicated to the dataset owner.
    • As part of the review the team should also consider whether we feel confident in maintaining the dataset
  8. Graduation:
    • If the dataset meets all the criteria, it will be graduated from an experimental dataset to a regular dataset contribution.
    • The Kedro team will take over maintenance responsibilities for the dataset.

@datajoely
Copy link
Contributor

Whilst I think this is a perfectly well defined process - I worry the governance overhead is higher than the reactive pattern we adopt today. I would like to get some clarity on the actual problem we're solving:

  1. Is it discoverability?
  2. Is it providing a 1st party stamp of approval?
  3. Is it maintenance overhead?
  4. Is it monorepo dependency hell?

My worry is that I've seen well intentioned codified processes fail to sustain themselves the minute priorities change / people move teams etc. My gut feeling is that this falls into that category.

@noklam
Copy link
Contributor

noklam commented Mar 12, 2024

From my understanding, the goal is to attract more contributions while not lowering the standard of the 1st party supported datasets.

Who will be responsible for the graduation process? Is it the dataset owner or the core Kedro team? I also think we need to do a better job to document "how" to achieve all above regardless of the graduation process.

The goal is that someone who want to contribute a Polar dataset shouldn't worry about fixing a mypy issue that they don't understand or random RTD fail/linting error. Ideally, they should be able to run tests locally without relying on the CI. They shouldn't worry about how to fix the dependency hell issues (or at least Kedro core team would help to solve this, the dataset owner should care about their own datasets only)

@yetudada
Copy link
Contributor

Additionally, usage should be an important factor in this. I guess this links to the vision of counting datasets used by Kedro-Telemetry.

@deepyaman
Copy link
Member

Largely happy with this!

Verify that the dataset can be used with all versions of Python 3.9 and above.

Nit: Reword to all versions supported by Kedro/under NEP 29.

Confirm compatibility with both Windows and Linux environments.

As mentioned in #583, on a case-by-case basis I can understand not having support for Windows. But this is the exception, rather than the rule, and should also be noted clearly.

Check that the dataset uses stable dependencies that are not expected to be discontinued anytime soon.

Also, dependencies should be relaxed as much as possible, to avoid conflicting with other core datasets, and not provide poor user experience. :)

The Kedro team will take over maintenance responsibilities for the dataset.

We can help maintain. We should encourage the author to keep involved as much as possible!

@ElenaKhaustova
Copy link
Contributor

The suggested steps seem clear and perfectly defined!

One thought that came to my mind is how we will motivate contributors of experimental datasets to update them for graduation. If we give them an instrument to simplify their contributions, it can be hard to push for further updates since their goal is reached with less effort. At the same time, we want to avoid situations where experimental datasets are growing while regular datasets are never updated. So we should probably stay aware of what other users utilise (via telemetry or pip) and take graduation on ourselves if we decide it's in demand.

@merelcht
Copy link
Member Author

merelcht commented Apr 4, 2024

Closing this in favour of continuing the discussion in #583

Bottom line is that the following make up the graduation process:

  1. Anyone (TSC members + users) can trigger a graduation process
  2. We need 1/2 approval from the TSC to initiate a review/merge to the regular datasets space
  3. A dataset can only graduate when it meets all requirements of a regular dataset

@merelcht merelcht closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants