Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some additional utility methods to OsString and OsStr #1307

Merged
merged 1 commit into from
Oct 29, 2015

Conversation

eefriedman
Copy link
Contributor

@wthrowe
Copy link

wthrowe commented Oct 5, 2015

I've actually been thinking about OsStr(ing) interfaces recently and was thinking about writing something up, but hadn't gotten around to it yet. In terms of this proposal, is_empty and clear seem obviously reasonable, and I don't think there could be any doubt about what they do. from_string is also fine, although OsString already implements From<String>, which does the same thing.

It isn't clear to me what the length and capacity related methods would do, though. The internal representation of OsString on Windows is not very meaningful and most of the details of it are intentionally not exposed by the interface. Would these methods give values related to the internal WTF-8 representation (which should be useless, because all consequences of that encoding should be hidden), or of the system ill-formed UTF-16 representation (which would require some kind of conversion)?

@eefriedman
Copy link
Contributor Author

It isn't clear to me what the length and capacity related methods would do, though.

len() and capacity() are defined in terms of the number of bytes used internally. That has uses even without being able to index into an OsString. len() and with_capacity() can be used alongside OsString::push to avoid unnecessary reallocations. capacity() can be used as a rough estimate of heap usage. Probably some other uses I'm not thinking of off the top of my head.

Granted, they don't really have any semantic meaning, but they still have uses.

Actually, come to think of it I forgot to add reserve() and reserve_exact().

@wthrowe
Copy link

wthrowe commented Oct 5, 2015

Good points. I can see why we'd want those, now.

So I think I now like everything here except from_string, because the conversion functions related to OsString are already a mess and I don't think we want to add yet another nonstandard one that duplicates already existing functionality. (It doesn't look like anything currently exposed by the standard library has a from_string method.)

@nagisa
Copy link
Member

nagisa commented Oct 5, 2015

We usually add specific conversion methods for conversions that are expected to be done commonly (e.g. str → String). I don’t see people using OsString oftenly, let alone its conversion methods. I’d vote for leaving from_string out here as well.

len and capacity both sound fine, because on all currently supported platforms the OsString representation is byte-based (bytes on unix, WTF-8 on Windows).

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Oct 5, 2015
@eefriedman
Copy link
Contributor Author

Updated: removed from_string, added reserve and reserve_exact, added a brief note in the drawbacks section noting the discussion of the meaning of len().

@Diggsey
Copy link
Contributor

Diggsey commented Oct 6, 2015

Looks good. OsStr should probably also implement AsciiExt.

I had a look through and there weren't any other methods on str that made sense to implement: anything involving substrings can't work correctly in WTF8.

However, it may be worthwhile (although out of scope for this RFC) defining an OsChar type which is a newtype around either u8 or u16.

Substring operations can then be defined in terms of iterators over OsChars and OsString can implement From<Iterator<OsChar>>, so that iter.collect() can produce an OsString.

@BurntSushi
Copy link
Member

For from_string, note that today's unstable from_bytes works with anything that impls Into<Vec<u8>> (String has such an impl).

I feel like these methods won't be used that often, but if you need to use OsString, then these do seem like sensible things to have.

I think this looks good to me.

@BurntSushi BurntSushi self-assigned this Oct 7, 2015
@wthrowe
Copy link

wthrowe commented Oct 7, 2015

There's also the stable OsString::from(s: String)-> OsString.

@alexcrichton
Copy link
Member

🔔 This RFC is now entering its week-long final comment period 🔔


I feel that these methods are quite conventional among collections, so the only particular question in my mind would be whether we want them at all. I originally thought that they're not useful because the length of an OS string doesn't actually mean anything in terms of decoding it, but it does mean something in terms of performance (a key insight here), which I feel is justification enough for their inclusion.

@alexcrichton alexcrichton added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Oct 15, 2015
@sfackler
Copy link
Member

I feel a bit iffy with the naming choice of len - it seems like people will inevitably try to apply the "intuitive" definition of length which will end up being bogus on Windows, no matter how dire any warnings in the documentation are. Maybe rename it to storage_size or something like that instead?

@BurntSushi
Copy link
Member

@sfackler Isn't that the case for String's len method too? Specifically, it refers to the storage size rather than something else like [Insert Favorite Unicode Semantic Here]. (I very well could be missing a subtlety here!)

@sfackler
Copy link
Member

Well, String's storage size is the UTF8 size, which is way more common than WTF8 is :)

@retep998
Copy link
Member

@sfackler Working with strings with Windows API is already a minefield. If someone doesn't understand what len on OsStr does, that's probably the least of their problems. We'd gain a lot more from just having a solid guide on how to work with Windows API, rather than trying to give functions weird names.

@arthurprs
Copy link

I believe len should stick to the existing semantics (string, vec, all collections) and return the number of underlying items, in this case the utf16ish words.

@BurntSushi
Copy link
Member

@arthurprs &str's len returns the number of UTF-8 encoded bytes. Seems to me like OsStr should do the same (return the number of encoded bytes).

@Diggsey
Copy link
Contributor

Diggsey commented Oct 27, 2015

I believe len should stick to the existing semantics (string, vec, all collections) and return the number of underlying items, in this case the utf16ish words.

That would require iterating the OsString to calculate, when the len() functions on strings and vecs operate in constant time. It would also be totally inconsistent across platforms, and therefore unusable from non-platform-specific code.

Returning a byte count is portable, consistent and has practical usefulness for when building OsStrings. Without it, there's no direct relation between reserve and reserve_exact.

I don't think this breaks the encapsulation of the WTF8 encoding because len() is a lower level operation: it's the same reason that having size_of can be useful even when struct layout is ill-defined.

@arthurprs
Copy link

Yeah, I believe you guys are right since the underlying data is actually wtf8.

@alexcrichton
Copy link
Member

The libs team discussed this during triage yesterday and the decision was to merge, thanks again for the RFC @eefriedman!

Tracking issue: rust-lang/rust#29453

@alexcrichton alexcrichton merged commit 736bfce into rust-lang:master Oct 29, 2015
@Centril Centril added A-platform Platform related proposals & ideas A-collections Proposals about collection APIs labels Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Proposals about collection APIs A-platform Platform related proposals & ideas final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants