Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover error strings on Unix from_lossy_utf8 #99609

Merged
merged 1 commit into from
Sep 25, 2022

Conversation

workingjubilee
Copy link
Member

Some language settings can result in unreliable UTF-8 being produced.
This can result in failing to emit the error string, panicking instead.
from_lossy_utf8 allows us to assume these strings usually will be fine.

This fixes #99535.

Some language settings can result in unreliable UTF-8 being produced.
This can result in failing to emit the error string, panicking instead.
from_lossy_utf8 allows us to assume these strings usually will be fine.
@rustbot rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jul 22, 2022
@rustbot
Copy link
Collaborator

rustbot commented Jul 22, 2022

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 22, 2022
@workingjubilee workingjubilee added the O-unix Operating system: Unix-like label Jul 22, 2022
@the8472
Copy link
Member

the8472 commented Jul 22, 2022

Aiui some legacy encodings such as shift-jis aren't an ascii-superset. So if there's a system using such a locale they'd still garbage and this wouldn't really be a fix.

@workingjubilee
Copy link
Member Author

Alas, "Not output garbage" is not the goal of this PR.
"Not panic when the OS feeds us something we consider garbage" is the goal of this PR.
Emitting garbage is better than an unnecessary panic.
Currently, even a garbage error message is not emitted, because we panic before emitting it.

A more complicated fix requires more nuanced handling, you are correct, but immediately after writing this, I started doing many more smaller changes to improve things further until I realized the OsStr interfaces that we probably should be using for this case were actually quite underdeveloped and that in order for me to vet the implementation, I would also have to at minimum test Windows, and so I need to get my Windows machine booting. I do intend to make such followups, but I saw no reason to further delay things.

The primary location to find Shift-JIS is in fact on Windows platforms, which have an even more eccentric interpretation of it than you might imagine. I am quite familiar with how Shift-JIS breaks systems, as it is what gives the phenomenon of incorrect character emission the name mojibake.

@ChrisDenton
Copy link
Member

I'm not understanding the relevance of Windows here? In Rust for Windows, OsStr is WTF-8 encoded.

In Windows itself all local encodings are lossily converted to/from UTF-16. I guess the only issue would be reading the stdout of a third party tool that spits out a locale-specific encoding.

@workingjubilee
Copy link
Member Author

@ChrisDenton The only relevance is that to Actually Fix the Issue so that we don't ever emit garbage would require changing the signature of fn error_string. After doing such a significant change, I would prefer to test it on multiple installations of different operating systems. And to do what I would really prefer might require extending OsStr, which I would want to do further testing to validate, again, on multiple platforms.

And I may or may not get around to doing all that.

However, here the standard library is panicking needlessly while trying to construct a mere error report. Emitting junk data is genuinely preferable. "Really fixing it" can be left to future work.

@workingjubilee
Copy link
Member Author

To be clear: Before I ever touched programming, I did localization. If you have questions about fonts, locales, languages, or encodings, I will happily answer them. But I would rather save any such explanations to the comments and commit messages of my intended future PR.

Copy link
Member

@thomcc thomcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@workingjubilee asked me to take a look at this, and I think this makes sense given the justification in #99609 (comment), that this is just to avoid panicking and producing a good error message on a best-effort basis.

That said I won't snipe the review from the assigned reviewer, but r=me if they don't want to take it.

@ChrisDenton
Copy link
Member

ChrisDenton commented Jul 23, 2022

I'll also try to be clear. I do not have a problem with this PR at all. I think it's a good improvement over the current behavior. I would however be much more nervous about a more substantial fix that potentially also affects Windows as the original issue demonstrated a *nix specific problem. But I agree, this conversation can be better had at a more appropriate time.

@thomcc
Copy link
Member

thomcc commented Sep 24, 2022

@bors r+

@bors
Copy link
Contributor

bors commented Sep 24, 2022

📌 Commit bcf780e has been approved by thomcc

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 24, 2022
@joshtriplett
Copy link
Member

Sorry to have missed this. LGTM as well. Thanks @thomcc.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Sep 24, 2022
… r=thomcc

Recover error strings on Unix from_lossy_utf8

Some language settings can result in unreliable UTF-8 being produced.
This can result in failing to emit the error string, panicking instead.
from_lossy_utf8 allows us to assume these strings usually will be fine.

This fixes rust-lang#99535.
@bors
Copy link
Contributor

bors commented Sep 25, 2022

⌛ Testing commit bcf780e with merge 8e9c93d...

@bors
Copy link
Contributor

bors commented Sep 25, 2022

☀️ Test successful - checks-actions
Approved by: thomcc
Pushing 8e9c93d to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Sep 25, 2022
@bors bors merged commit 8e9c93d into rust-lang:master Sep 25, 2022
@rustbot rustbot added this to the 1.66.0 milestone Sep 25, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (8e9c93d): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-3.3% [-3.3%, -3.3%] 1
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
4.1% [4.1%, 4.1%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.4% [-2.4%, -2.4%] 1
Improvements ✅
(secondary)
-3.7% [-3.7%, -3.7%] 1
All ❌✅ (primary) -2.4% [-2.4%, -2.4%] 1

Footnotes

  1. the arithmetic mean of the percent change 2 3

  2. number of relevant changes 2 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. O-unix Operating system: Unix-like S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Printing io::Error panics if current UNIX locale doesn't use UTF-8 encoding
9 participants