Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing leading period from YAML extensions? #990

Open
TheChymera opened this issue Jan 26, 2022 · 10 comments
Open

Removing leading period from YAML extensions? #990

TheChymera opened this issue Jan 26, 2022 · 10 comments
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.

Comments

@TheChymera
Copy link
Collaborator

TheChymera commented Jan 26, 2022

I was wondering whether there is a specific reason for having the extensions recorded with a leading period in the YAML schema data.
Maybe the leading period should be inserted by path constructors based on the procedural information given by the “extension” designation, similarly to how suffixes aren't recorded with a leading underscore?
The “extension” designation is not arbitrary as it stands anyway, since it already defines the position of the string in the path... so I don't think this would make it more fragile in any way.

@tsalo tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Jan 26, 2022
@tsalo
Copy link
Member

tsalo commented Jan 31, 2022

We included the leading dot because that's how extensions are defined in the common principles and how they're used in pybids.

Here's the description in common principles:

  1. File extension - a portion of the the file name after the left-most period (.) preceded by any other alphanumeric. For example, .gitignore does not have a file extension, but the file extension of test.nii.gz is .nii.gz. Note that the left-most period is included in the file extension.

@effigies
Copy link
Collaborator

Yeah, we went through a period of normalizing different tooling. Reverting now would be pretty annoying to any users of libraries (notably pybids) that started out not including the initial period and went through a FutureWarning/DeprecationWarning process to migrate users and downstream projects.

@TheChymera
Copy link
Collaborator Author

@tsalo @effigies maybe this could still be put a pin in for the next major overhaul that's likely to also break backward-compatibility?

@tsalo
Copy link
Member

tsalo commented Mar 22, 2022

I've come across a solid reason to keep the leading period in #1036. We have some extensions (None, /) that do not have leading periods, so automatically assuming there is one would make these edge cases for us to work around.

@TheChymera
Copy link
Collaborator Author

hmmmm... thanks for the update in any case :)

I mean those have special meanings, and can be easily represented with simple logic — but, are you sure that's not actually a sign of using the concept of “extensions” incorrectly?
What are the edge cases for whether the absence of an extension needs to be noted as the presence of the extension None?
Also, / as an extension sounds like a workaround for not declaring whether a path is actually a directory.

We actually have the / issue in the DANDI extensions for BIDS (.ngff files are in fact directories), and I wanted to ask once I make more progress on the validator and the render tests whether we could encode direcotry-ness in the schema explicitly — / might do the trick, but it sounds like it's overloading the concept of an extension.

yarikoptic added a commit to yarikoptic/pybids that referenced this issue Apr 4, 2022
Primarily to reflect the fact that extensions now include leading . .
Extra references on this subject:
- bids-standard#830 -- about new
  warning from pandas (The frame.append method is deprecate)
- bids-standard/bids-specification#990
  discussion on extensions to include leading .
@yarikoptic
Copy link
Collaborator

I've come across a solid reason to keep the leading period in #1036. We have some extensions (None, /) that do not have leading periods, so automatically assuming there is one would make these edge cases for us to work around.

Well, under assumption that "there MUST NOT be an empty extension" (e.g. just filename.) and the fact that trailing / is a convention, I think it does not really mandate including . within extension, albeit indeed them require more of "special handling in code" while appending to filename. And that is why I agree with observation of @TheChymera that

Also, / as an extension sounds like a workaround for not declaring whether a path is actually a directory.

BUT because it is a matter of the fact that some extension might be a directory or not, and now that there is https://github.com/bids-standard/bids-specification/blob/HEAD/src/schema/objects/extensions.yaml -- I think "extension" notion is not really a "file extension" but rather a "thing" of its own mapping its value to the one in the extensions.yaml where someone might make directory: True; name: mefd (or alike) for a "convention" of .mefd/ to make it "explicit". The question is either it is "worth it" - i.e. should we offload it to tooling code to judge that extension implies directory via .endswith('/') instead of dereferencing into "extensions table" and checking for .directory attribute?

Having trailing / in extension itself also makes it a bit harder for tooling since comparing output of e.g. split_ext - the value should first get rid of trailing /. So making it "simpler" for some usecases might make it harder for others.

We included the leading dot because that's how extensions are defined in the common principles and how they're used in pybids.

well, had to rerun tutorial: bids-standard/pybids#832 since git grep disagreed, but indeed confirming that pybids does that too.

But if you look at result records like

{'acquisition': 'fullbrain',
 'datatype': 'func',
 'extension': '.tsv.gz',
 'run': 1,
 'session': '1',
 'subject': '01',
 'suffix': 'physio',
 'task': 'rest'}

you will see that suffix doesn't come with leading _, as it could have to be consistent with extension to include the leading .. Overall, although consistent with tooling (like os.path.split_ext), inclusion of leading period (.) IMHO should have not been done. But since it was done, now all BIDS tooling should also follow the trend, so I am doomed to adjust BIDSFile of heudiconv (WiP: nipy/heudiconv#544; attn @pvelasco) to do the same for the sake of consistent (across tooling) inconsistency (has leading . whenever e.g. prefix doesn't have leading _, similarly to any other entity).

Having said all that, I don't think there is anything to be done at this point regarding the initial issue/title - i.e. the "leading period". Not yet 100% sure about trailing /. E.g. -- what pybids would return in its extension for .ds/ or future .ngff etc? will it include trailing / ?

@TheChymera
Copy link
Collaborator Author

Here's an update on how the None special value in fact makes the periods useful, but again, I don't think this speaks for the periods, rather against None.

If the list of permitted extensions includes None, even if automatically determining that it means "" and not in fact "None", iterative regex generation would e.g. create ^README\.(|md|rst|txt)$, which will fail to validate README.

Nevertheless even if ^README(|\.md|\.rst|\.txt)$ is considered preferable to ^README(|\.(|md|rst|txt))$ the leading periods can still be added when needed instead of hard-coded in the YAML :)

For the time being I'm going with ^README(|\.md|\.rst|\.txt)$ however :( TheChymera@27bfb55

@effigies
Copy link
Collaborator

If we're seriously considering reverting this change, it would be worth reviewing the original discussion that led to the push for consistency in the current direction: #152

@yarikoptic
Copy link
Collaborator

If we're seriously considering reverting this change, it would be worth reviewing the original discussion that led to the push for consistency in the current direction: #152

I personally think that fortunately or unfortunately we are past that point now since it might be too intrusive to change within 1. series of BIDS, and even for 2. might be a bit too much. So I would vote to keep leading . in extensions. The question which remains and I think, and which we must address is what to do about trailing /?!

https://github.com/bids-standard/bids-specification/blob/master/src/02-common-principles.md#common-principles

  • doesn't mention possibility for / to be a part of a "File extension"
  • we also call it "File extension" there (not just extension)

So, overall we have not defined a notion of "extension" but rather of "File extension". Should we define "Directory extension"? should we generalize into "File or directory name extension" and include / as indicator of being a directory? Will we ever allow for the same extension to be used for a file or directory? (sorry for questions and not suggestions ATM)

@TheChymera
Copy link
Collaborator Author

If we're seriously considering reverting this change

At this point I have temporarily downgraded this to low-priority in my head, but I still think it's a good idea :(

One idea would be:

  • None"" or false value.
  • /directory_extensions parameter

e.g.

README:
  required: true
  directory_extensions:
  - ""
  - .md
  - .rst
  extensions:
  - ""
  - .md
  - .txt

Given how JSON is parsed, ""/false actually evaluates as False (which is correct as it means no extension), whereas None which gets parsed to "None" evaluates as True. I see how encoding everything in the string is a quick shortcut to cover everything and worry about it later (which is what I'm doing atm as well). But I really think this will encourage us to write broken logic with if str.endswith("/") or other such detection methods which will become convoluted and are fragile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
None yet
Development

No branches or pull requests

4 participants