Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to disable sequence matching #212

Closed
carmenbianca opened this issue Apr 28, 2024 · 6 comments
Closed

Add a way to disable sequence matching #212

carmenbianca opened this issue Apr 28, 2024 · 6 comments
Labels
S: wontfix The issue will not be fixed for the stated reasons. T: feature Feature.

Comments

@carmenbianca
Copy link

carmenbianca commented Apr 28, 2024

I'm writing a specification that needs a very, very, very simple globbing of paths, and I am writing a tool that implements the specification. The globbing needs to:

  • Always use POSIX paths.
  • Match zero or more characters with *, including . at the start of a file, excluding path separators.
  • Match zero or more characters with **, including . at the start of a file, including path separators.
  • Escape characters with \.

Using DOTGLOB | FORCEUNIX | GLOBSTAR in wcmatch, I am like 99% of the way there, which is great! Less work for me, and this library looks amazing.

However, wcmatch comes with two more features that I do not need:

  • ? matches a single character.
  • [seq] and [!seq] matches any character in the sequence.

I am not well bothered by ?—I do not need it, but I can add it to the specification without much fuss. Being able to turn this off would be nice, but it is not needed.

[seq] is far too feature-rich for me to use, however. I do not want to include this in the specification. But if I do not include this in the specification, then my tool must subsequently not implement this. I have considered using a pre-parser to escape all square brackets, but I figured an upstream-first approach would be helpful.

I would love a NOSEQ flag that disables the behaviour I do not need.


Additional context:

This is being discussed in this thread in a PR. Copying my available options from that thread:

  1. Open an issue against wcmatch and plead them to include an option to turn this behaviour off. Or do this ourselves—I can write Python!
  2. Do a bad job of documenting this. See: the referenced code.
  3. Don't document it at all. This isn't great.
  4. Do a good job of documenting this. But: that's a loooot of words spent on a feature we don't even want.
  5. Make an attempt at turning this behaviour off downstream. Whenever we see [ or ], escape that character internally in the REUSE tool, and the seq behaviour will never be triggered. Subsequently, we needn't document this. But: this may have unintended consequences, maybe. I'm not sure.

The relevant bit of the specification is fairly simple. There is a REUSE.toml configuration file that matches path expressions against existing files in a project, and applies metadata to those matched files. For example:

[[annotations]]
path = "**/*.md"
SPDX-License-Identifier = "CC-BY-4.0"

This matches all Markdown files and asserts that they are all licensed under CC-BY-4.0. Super simple, easy peasy, no fancy features needed.

@gir-bot gir-bot added the S: triage Issue needs triage. label Apr 28, 2024
@facelessuser
Copy link
Owner

I am hesitant to implement such behavior. I can understand that some users may have some niche needs, but considering this project aims for Bash-style globbing, the bare minimum requirements are *, **, ?, and [...]. While I can understand that your specific project may desire this reduced functionality, it is most likely in a very small minority. The work required to reduce such functionality is not strongly supported by need from the community at large.

I would also argue that these features don't need a lot of documentation and can be explained using at minimum what is listed below. I state this not to convince you to use these features, but just as an illustration. Certainly, these features could be delved into with greater depth and with examples (I know in my documentation that I can be a bit too wordy). I also understand that these stated features may still not be desired, regardless of how easy it is to explain or not explain the features.

Supported Syntax

Pattern Meaning
* Match zero or more characters, including . at the start of a file, excluding path separators.
** Match zero or more characters, including . at the start of a file, including path separators.
? Matches any single character.
[abc] Matches one character in the bracket .
[a-z] Matches one character from the range given in the bracket
[[:alnum:]] Matches one character within the specified character class in the form [:name:].
[!...] Matches one character not in the bracket, bracket range, or bracket character class . This can be applied via [!abc], [!a-z], and [![:alnum:]]. Character exclusions will also be accepted in the form of [^abc], [^a-z], and [^[:alnum:]].
\ Escapes the character immediately following the backslash. Useful for specifically escaping meta characters with special meaning such as \? or \*.

Supported character class names are found below.

Property Pattern
alnum [a-zA-Z0-9]
alpha [a-zA-Z]
ascii [\x00-\x7F]
blank [ \t]
cntrl [\x00-\x1F\x7F]
digit [0-9]
graph [\x21-\x7E]
lower [a-z]
print [\x20-\x7E]
punct [!\"\#$%&'()*+,\-./:;<=>?@\[\\\]^_`{}~]
space [ \t\r\n\v\f]
upper [A-Z]
word [a-zA-Z0-9_]
xdigit [A-Fa-f0-9]

@carmenbianca
Copy link
Author

carmenbianca commented Apr 28, 2024

I won't argue with you that you should adopt this feature if this is not in-scope for the project. I'd like the feature to be adopted, but that is neither here nor there. I understand keeping a strict leash on scope.

Regarding the amount of documentation not being a lot: I don't necessarily disagree, but that's a lot more words than I want to be spending on describing how globs work in a specification that doesn't care for the feature.

Because you're a field expert, though, I'd love your input—if wcmatch does not adopt this feature, how would you recommend I solve my problem? I have a few somewhat-satisfying solutions in mind:

  • Do the aforementioned pre-parsing: escape the square brackets.
  • Implement my own glob-matching mini-library.
  • Use fnmatch stdlib → not viable because it handles paths awkwardly, and doesn't understand the difference between ** and *. (edit: and in fact also includes [seq])

@carmenbianca
Copy link
Author

Well, that was embarrassingly easy. I've made this library unnecessary with a few lines of code:

def translate(path):
    path = re.escape(path)
    path = path.replace(r"\*\*/", r".*")
    path = path.replace(r"\*\*", r".*")
    path = path.replace(r"\*", r"[^/]*")
    return f"^({path})$"

This regex matches everything I want it to match, and nothing I don't want it to. It's a bit naïve, but it 100% answers my needs.

Thank you for making wcmatch. This library was a great inspiration, and helped a lot during development.

@facelessuser facelessuser added S: wontfix The issue will not be fixed for the stated reasons. T: feature Feature. and removed S: triage Issue needs triage. labels Apr 28, 2024
@facelessuser
Copy link
Owner

Since you are happy with your solution, I will close this issue.

@carmenbianca
Copy link
Author

Turned out to be a touch more complicated with escaping. Resulting code:

def translate(path: str) -> str:
    blocks = []
    escaping = False
    globstar = False
    prev_char = ""
    for char in path:
        if char == "\\":
            if prev_char == "\\" and escaping:
                escaping = False
                blocks.append(r"\\")
            else:
                escaping = True
        elif char == "*":
            if escaping:
                blocks.append(re.escape("*"))
                escaping = False
            elif prev_char == "*" and not globstar:
                globstar = True
                blocks.append(r".*")
        elif char == "/":
            if globstar:
                pass
            else:
                globstar = False
                blocks.append("/")
            escaping = False
        else:
            if prev_char == "*" and not globstar:
                blocks.append(r"[^/]*")
            blocks.append(re.escape(char))
            globstar = False
            escaping = False
        prev_char = char
    result = "".join(blocks)
    return f"^({result})$"

With tests and everything it works.

@facelessuser
Copy link
Owner

There is quite a bit of nuance in globbing, both in pattern recognition and in generating the regular expression that always behaves as intended, especially cross-platform. If your needs are simple, it is likely your solution can be simple. Most likely any issues will only arise when stressed with relevant use cases. I have no idea of the extent of your use case so I cannot comment further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: wontfix The issue will not be fixed for the stated reasons. T: feature Feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants