Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library should use either "data set" or "dataset" consistently #533

Closed
deepyaman opened this issue Oct 2, 2020 · 10 comments
Closed

Library should use either "data set" or "dataset" consistently #533

deepyaman opened this issue Oct 2, 2020 · 10 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@deepyaman
Copy link
Member

Description

I'm always frustrated when I want to create a new data set... or is it dataset?

Context

  • As a contributor, I do my best to align to the prevailing style/convention.
  • As a user, I try to mimic Kedro convention when I create my own datasets.

In both cases, it bothers/confuses me to see inconsistencies. Maybe it's just me. 🤷‍♂️

List of Inconsistencies

  • All datasets are named SomeDataSet (data set as two words), but the module name is almost always some_dataset.py (dataset as one word).
    • Notable exceptions include the kedro.io module, where everything is some_data_set.py.
      • Except cached_dataset.py...
      • Tests for this module are another thing altogether, where half use the split form and the other half combined, not even aligned with the the module being tested (e.g. partitioned_data_set.py and test_partitioned_dataset.py).
  • I'm (personally) less concerned by the fact. that catalog.datasets is a single word, as it could be argued that it's done for convenient reasons. Similarly, extras.datasets makes sense as a single word, since it's more aligned with Python package naming convention.
  • I'm sure there are inconsistencies in the docs here, too, but I'm less concerned than for the actual codebase.

Much of this is trivial to address, although you do need to update some tests that rely on module name (e.g. test_partitioned_dataset.py).

@deepyaman deepyaman added good first issue Issue: Feature Request New feature or improvement to existing feature labels Oct 2, 2020
@lorenabalan
Copy link
Contributor

I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?

@deepyaman
Copy link
Member Author

I vaguely remember there was a discussion on this before we open-sourced, @stichbury any thoughts on "dataset" vs "data set"?

I feel like "dataset" has been adopted by a lot of major players. A few top hits off a search for data set:

Definitions often list both forms:

Here's some data (biased towards books, FWIW): https://books.google.com/ngrams/graph?content=data+set%2Cdataset&year_start=1800&year_end=2019&corpus=26&smoothing=3&case_insensitive=true

Some more, driven by the all-knowing Twitterverse: https://twitter.com/randal_olson/status/824702008007557121?lang=en

However, a consideration against switching to "dataset" is that it messes with all of the classes. You could deprecate the two-word form now and remove it in 0.17, but I'm sure it'll cause some pain. Granted, that pain would be incurred if you ever wanted to make that switch. For me, I'm happy as long as it's (mostly) consistent (i.e. OK with the catalog.datasets and BlahDataSet mismatch, since I can justify it in my head above).

@stichbury
Copy link
Contributor

Yep @lorenabalan @deepyaman It is indeed a dataset :) if you follow our docs style guide.

https://github.com/quantumblacklabs/private-kedro/blob/master/docs/README.md#kedro-lexicon

Use dataset (not data set, or data-set) for a generic dataset.
Use capitalised DataSet when talking about a specific Kedro dataset class e.g. CSVDataSet.
Use data catalog for a generic data catalog.
Use Data Catalog to talk about the Kedro Data Catalog

I think we are pretty much consistent in the docs in using the single word, no hyphenation. I've no opinion of the choice used for classes and in code TBH since I don't believe the code and docs have to follow each other exactly. Consistency is key though.

@yetudada
Copy link
Contributor

I'm just following up, is it okay to close this issue?

@stichbury
Copy link
Contributor

💯 from me

@deepyaman
Copy link
Member Author

I'm just following up, is it okay to close this issue?

@yetudada I don't think it's resolved in the code. At a minimum, can we change:

  • lamda_data_set.py -> lamda_dataset.py
  • memory_data_set.py -> memory_dataset.py
  • partitioned_data_set.py -> partitioned_dataset.py
  • test_lambda_data_set.py -> test_lambda_dataset.py
  • test_memory_data_set.py -> test_memory_dataset.py

This way, at least module naming is consistent/people don't have to question which form to use. I'm happy to do this, given a green light.

I assume we're not good to change SparkDataSet -> SparkDataset, etc., through the codebase lol? I think it would eventually align with the lexicon that way, but it would need to be deprecated in this major release and removed in the next one.

@merelcht
Copy link
Member

Hi @deepyaman, yes that sounds good. We're very happy to accept a PR from you for this 😄

@stale
Copy link

stale bot commented Jul 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 26, 2023

Spotted in kedro-org/kedro-starters#137:

f"is invalid: all data set types must extend 'AbstractDataSet'."

@astrojuanlu
Copy link
Member

We've decided to use "dataset" in prose, and FooDataset in class names in this issue.

This was partially achieved in https://github.com/quantumblacklabs/private-kedro/pull/1211 (private), and then a series of other PRs recently, including gh-2500, gh-2673, gh-2724, gh-2735, mostly by @deepyaman.

@noklam collected some context in gh-2740, let's continue the conversation there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

7 participants