Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: Pattern methods for OsStr without OsStr patterns #311

Open
epage opened this issue Dec 7, 2023 · 17 comments
Open

ACP: Pattern methods for OsStr without OsStr patterns #311

epage opened this issue Dec 7, 2023 · 17 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@epage
Copy link

epage commented Dec 7, 2023

Proposal

Problem statement

With rust-lang/rust#115443, developers, like those writing CLI parsers, can now perform (limited) operations on OsStr but it requires unsafe to get an OsStr back, requiring the developer to understand and follow some very specific safety notes that cannot be checked by miri.

RFC #2295 exists for improving this but its been stalled out. The assumption here is that part of the problem with that RFC is how wide its scope is and that by shrinking the scope, we can get some benefits now.

Motivating examples or use cases

Mostly copied from #306

Argument parsers need to extract substrings from command line arguments. For example, --option=somefilename needs to be split into option and somefilename, and the original filename must be preserved without sanitizing it.

clap currently implements strip_prefix and split_once using transmute (equivalent to the stable encoded_bytes APIs).

The os_str_bytes and osstrtools crates provides high-level string operations for OS strings. os_str_bytes is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3). osstrtools enables reasonable uses of split() to parse $PATH and replace() to fill in command line templates.

Solution sketch

Provide strs Pattern-accepting methods on &OsStr.

Defer out OsStr being used as a Pattern and OsStr indexing support which are specified in RFC #2295.

Example of methods to be added:

impl OsStr {
    pub fn contains<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>;

    pub fn starts_with<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>;

    pub fn ends_with<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn find<'a, P>(&'a self, pat: P) -> Option<usize>
    where
        P: Pattern<&'a Self>;

    pub fn rfind<'a, P>(&'a self, pat: P) -> Option<usize>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    // (Note: these should return a concrete iterator type instead of `impl Trait`.
    //  For ease of explanation the concrete type is not listed here.)
    pub fn split<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn split_inclusive<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplit<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn split_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplit_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn splitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn split_once<'a, P>(&'a self, delimiter: P) -> Option<(&'a Self, &'a Self)>where
        P: Pattern<&'a Self>;

    pub fn rsplit_once<'a, P>(&'a self, delimiter: P) -> Option<(&'a Self, &'a Self)>where
        P: Pattern<&'a Self>;

    pub fn matches<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rmatches<'a, P>(&self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn match_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)>
    where
        P: Pattern<&'a Self>;

    pub fn rmatch_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn trim_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>,
        P::Searcher: DoubleEndedSearcher<&'a Self>;

    pub fn trim_start_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>;

    pub fn strip_prefix<'a, P>(&'a self, prefix: P) -> Option<&'a Self> where
    P: Pattern<&'a Self>;

    pub fn strip_suffix<'a, P>(&'a self, prefix: P) -> Option<&'a Self> where
    P: Pattern<&'a Self>;

    pub fn trim_end_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn replace<'a, P>(&'a self, from: P, to: &'a Self) -> Self::Owned
    where
        P: Pattern<&'a Self>;

    pub fn replacen<'a, P>(&'a self, from: P, to: &'a Self, count: usize) -> Self::Owned
    where
        P: Pattern<&'a Self>;
}

impl Pattern<&OsStr> for char {}
impl Pattern<&OsStr> for &str {}
impl Pattern<&OsStr> for &String {}
impl Pattern<&OsStr> for &[char] {}
impl Pattern<&OsStr> for &&str {}
impl<const N: usize> Pattern<&OsStr> for &[char; N] {}
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
impl<const N: usize> Pattern<&OsStr> for [char; N] {}
  • This is meant to match str and if there are any changes between the writing of this ACP and implementation, the focus should be on what str has at the time of implementation (e.g. not adding a deprecated variant but the new one)
  • We likely want to add trim, trim_start, and trim_end to be consistent with trim_start_matches / trim_end_matches
  • for more details, see Add pattern matching API to OsStr rust#109350

This should work because

From an API design perspective, there is strong precedence for it

  • Its copying methods over from str
  • The design is a subset of RFC #2295 (approved) and RFC #1309 (postponed)
    • By deferring support for OsStr as a pattern, we bypass the main dividing point between proposals (split APIs, panic on unpaired surrogates, switching away from WTF-8)

Alternatives

#306 proposes a OsStr::slice_encoded_bytes

  • Still requires writing higher level operations on top, but at least its without unsafe
  • Either takes a performance hit to be consistent across platforms or has per-platform caveats that will be similarly hard to get right for less common platforms among developers (e.g. Windows)
  • As far as I can tell, there isn't precedence for an API design like this meaning more new ground has to be set (naming, deciding the above preconditions, etc)

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@epage epage added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Dec 7, 2023
@joshtriplett
Copy link
Member

This seems like a good idea to me. Putting these methods on OsStr will allow code to do simple parsing/splitting/etc in safe code. And because this does not add OsStr to Pattern, it remains a simple addition to API surface without any representation changes.

@pitaj
Copy link

pitaj commented Dec 8, 2023

What implements Pattern<&OsStr>?

@epage
Copy link
Author

epage commented Dec 8, 2023

I've updated the proposal to call that out

impl Pattern<&OsStr> for &str {}
impl Pattern<&OsStr> for char {}
impl Pattern<&OsStr> for &[char] {}
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
impl Pattern<&OsStr> for &&str {}
impl Pattern<&OsStr> for &String {}

Basically, this is a direct mirror of &strs functionality. I don't think I've seen this called out anywhere in the previous RFCs but my assumption is non-UTF8 byte sequences in OsStr are just considered non-matches.

@pitaj
Copy link

pitaj commented Dec 8, 2023

impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}

IIRC, this is currently forbidden due to coherence rules.

@joshtriplett
Copy link
Member

joshtriplett commented Dec 12, 2023

@pitaj Presumably we'd do this the same way we already did for Pattern.

@pitaj
Copy link

pitaj commented Dec 12, 2023

AFAIK all of today's Patterns are defined in core, but OsStr lives in std. That's where the coherence issues come in.

@joshtriplett
Copy link
Member

@pitaj Ah, I see. I think we have a mechanism that allows us to have impls of core traits in std without regard for the orphan rule, which would address that.

@pitaj
Copy link

pitaj commented Dec 12, 2023

Last I checked, there's a mechanism for inherent impls, but not for traits. That said, it's probably something that could be added.

@programmerjake
Copy link
Member

we could put just the OsStr struct in core as an implementation detail, along with necessary trait impls, and put the rest in std with a re-export of OsStr...this would allow implementing:

impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}

@m-ou-se
Copy link
Member

m-ou-se commented Dec 19, 2023

We briefly discussed this in last week's libs-api meeting. While we agree it'd be good for OsStr to have a more complete API, we're worried about the amount of string types: should CStr and [ascii::Char] have the same api, for example?

It'd be good to first explore solutions that could benefit all string types, before continuing with this proposal to extend OsStr itself. For example, could a trait or another mechanism be used to make this api availabel to all string types?

@epage
Copy link
Author

epage commented Dec 19, 2023

So if I understand correctly, the desire is to explore the design of a Pattern Extension trait with methods like contains, starts_with, etc that can apply to all string-like types?

Including CStr in this list is interesting because some methods, like strip_prefix, will require allocating into a CString. To be clear, that is expected and part of the design requirements for this?

@programmerjake
Copy link
Member

Including CStr in this list is interesting because some methods, like strip_prefix, will require allocating into a CString.

i think you meant strip_suffix? strip_prefix should be able to return &CStr

@epage
Copy link
Author

epage commented Dec 20, 2023

Good point, and figuring out how we should handle strip_prefix vs all other slicing operations adds another area for design exploration and discussions. Do we have all slicing operations treated the same, having CStr always return an owned type? Or do we hard code into the trait that only leading/middle/arbitrary content might be owned and trailing content is always borrowed.

For example, CStr::split_once could return either (Owned, &Borrowed) or (Owned, Owned) while split would always return impl Iterator<Item=Owned> (or maybe Cow?).

@programmerjake
Copy link
Member

maybe CStr::split_once could return something more like (&[u8], &CStr)?

@blyxxyz
Copy link

blyxxyz commented Dec 20, 2023

CStr operations could return &[u8] and let the caller decide whether to turn it into CString (at the cost of a null byte check). Or they could mutate the original string to insert null bytes, strtok-style (probably a bad idea). Or there could be a CStrSlice type that has no terminator but is guaranteed to contain no null bytes...

Should function arguments be full-blown CStrs? It's important that they not have internal null bytes but for e.g. a search pattern there's no technical reason for the terminator, and in some cases that could require an allocation. (The recently stabilized C string literals would help with this at least.)

But mainly: Should CStr have this API at all? My experience is very limited but it does include a case where raw C string pointers were passed around arbitrarily when they should have been converted ASAP after the function call. How big of a use case is there for manipulating these instead of converting them immediately at the FFI boundary? Unlike OS strings they're just bytes with a validity requirement, so they're easier to convert.

Should [u8] get this API? Cstr would then get lossless access through to_bytes().

@programmerjake
Copy link
Member

wouldn't CStrSlice just be [NonZeroU8] (except that's kinda hard to work with currently due to lacking lots of byte string methods)

@tgross35
Copy link

tgross35 commented Jan 9, 2024

I like the trait idea, especially since it will allow writing combinators that work for any string type and being able to reuse them on both CStrs from FFI and &strs from users.

Maybe a trait could look something like this (ignoring lifetimes):

trait SliceLikePattern: ToOwned {
    // Yes, we don't have associated type defaults...

    /// Result of splitting items
    type Partial = Self;

    /// Rightmost result of split items if different than `Partial`, e.g. for `CStr`
    type PartialRight = Self::Partial;

    /// Pattern type used when a single element will be extracted. `u8` for `&[u8]`,
    /// `str::pattern::Pattern` for str, maybe `u8` or `u16` for `OsStr`
    /// Or maybe  `FnMut(&u8) -> bool` for slices, as in `split_once`
    type ItemPattern;

    /// Pattern type used when a partial (slice) is expected, `&[u8]` for `&[u8]`
    /// still `str::pattern::Pattern` for `str`
    type PartialPattern;

    /// PartialPattern but if there is a specific right-first search
    /// e.g. str's `<P as Pattern<'a>>::Searcher: ReverseSearcher<'a>`
    type PartialPatternReverse = Self::PartialPattern;
    
    fn split_at(&self, mid: usize) -> (&Self::Partial, &Self::PartialRight);
    fn split_at_mut(&self, mid: usize) -> (&mut Self::Partial, &mut Self::PartialRight);
    fn contains<P: Self::ItemPattern>(&self, pat: P) -> bool;
    fn starts_with<P: Self::PartialPattern>(&self, pat: P) -> bool;
    fn ends_with<P: Self::PartialPatternReverse>(&self, pat: P) -> bool;
    fn find<P: Self::PartialPattern>(&self, pat: P) -> Option<usize>;
    fn rfind<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<usize>;
    fn split<P: Self::PartialPattern>(&self, pat: P) -> Split<P>;
    // ... similar variants of iterating splits and matches
    fn split_once<P: Self::ItemPattern>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
    fn rsplit_once<P: Self::ItemPatternReverse>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
    // I don't think we can do simple `trim_{start, end}` here or anything else that
    // relies on whitespace knowledge
    fn trim_start_matches<P: Self::PartialPattern>(&self, pat: P) -> &Self::PartialRight;
    fn trim_end_matches<P: Self::PartialPatternReverse>(&self, pat: P) -> &Self::Partial;
    fn strip_prefix<P: Self::PartialPattern>(&self, pat: P) -> Option<&Self::PartialRight>;
    fn strip_suffix<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<&Self::Partial>;
    fn replace<P: Self::PartialPattern>(&'a self, from: P, to: &Self::PartialRight) -> <Self as ToOwned>::ToOwned;
    fn repeat<P: Self::PartialPattern>(&'a self, from: P, repeat: usize) -> <Self as ToOwned>::ToOwned;
}

There probably isn't anything that restricts this to string-like types, I could see a lot of this being beneficial to let this apply to anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

7 participants