Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test EncodedSring#to_s for undefined conversion / invalid byte sequence #134

Closed
wants to merge 14 commits into from

Conversation

@bf4 bf4 force-pushed the handle_non_utf8_encoding_string branch 2 times, most recently from 728cf6d to ff988ae Compare November 20, 2014 05:03
@@ -21,6 +21,47 @@ module RSpec::Support
end
end

describe '#to_s' do
if RUBY_VERSION == '1.9.2' && RSpec::Support::Ruby.mri?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protip(tm) you can shorten this to Ruby.mri? given you're in the Support module already, however you shouldn't limit it to Ruby version 1.9.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm switching back and forth between the two, so forgot my context :)

It actually behaves differently on 1.9.2.. https://travis-ci.org/rspec/rspec-support/jobs/41566833#L113

 1) RSpec::Support::EncodedString#to_s does nothing to an invalid byte sequence
     Failure/Error: expect(resulting_string.to_s).to eq("\xEF hi I am not going to work")

       expected: "\xEF hi I am not going to work"
            got: "\xEF hi I am not going to work"

@bf4 bf4 force-pushed the handle_non_utf8_encoding_string branch 2 times, most recently from b70f78e to 5325c3f Compare November 20, 2014 05:53
expect(resulting_string.to_s).to eq("? hi I am not going to work")
end

it 'replaces all characters with either a ? or a unicode ?' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc string doesn't make sense to me...it seems to claim that EncodedString#to_s returns a string containing all question marks no matter what the original string is. Can you wrap this in a descriptive context that describes when this behavior occurs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It surprised me as well. I just made the test pass from the actual output based on the current behavior. split to_s and << don't result in the same type of string. maybe it should be looked at. Also see @JonRowe in rspec/rspec-core#1760 (comment) and my followup

I thought adding the describe '#to_s' do was sufficient. Otherwise it requires a bit more investigation to understand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought adding the describe '#to_s' do was sufficient. Otherwise it requires a bit more investigation to understand

Well, there are other examples in this group where calling #to_s does not result in a string of all question marks, so clearly it 'replaces all characters with either a ? or a unicode ?' isn't the behavior you always see when #to_s is called. Are UTF-16LE and IBM737 completely incompatible encodings? Is that what causes that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a too-long break I'm back looking to finish the PRs. I'll update the context to make the tests clearer and see if I can clarify the reasons for different encoding behavior. I think that would resolve the blocker.

@bf4 bf4 force-pushed the handle_non_utf8_encoding_string branch from 5325c3f to c8b0521 Compare December 26, 2014 06:52
end
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of this? I kind of like it but I've never seen it before. I found it while looking through the Ruby encoding / transcoding code and tests.

@bf4
Copy link
Contributor Author

bf4 commented Dec 28, 2014

@myronmarston going to sleep now, will answer questions in the morning.. this has been a bit of a slog.. in some cases, to answer your question, it may be in a commit message..

some of your questions are about how we want it to behave (replace with '?' vs. something else) which I don't have an opinion on

@bf4
Copy link
Contributor Author

bf4 commented Dec 28, 2014

I do intend to rebase and clean this up.. esp now that the tests all passed.

@@ -47,8 +47,6 @@ def diff_as_string(actual, expected)
finalize_output(output, hunks.last.diff(format_type).to_s) if hunks.last

color_diff output
rescue Encoding::CompatibilityError
handle_encoding_errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being removed because of https://github.com/rspec/rspec-expectations/pull/220/files#diff-9533f5f156a38a3307ecfc610d2282d7R47 ?

I ask because rspec-mocks uses the differ in RSpec 2.2....but it doesn't rescue this error. It makes me wonder if we should move this rescue back into here and/or add a similar rescue to rspec-mocks.

@JonRowe, give that you authored rspec/rspec-expectations#220, what do you think should be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm noticing that rspec-expectations doesn't rescue this error anymore, either...so why is it safe to remove, @bf4? (Apologies if your commit messages explain but at 23 commits it's a lot to look through).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through the rest of the diff it looks like this error isn't possible anymore since EncodedString handles so many more cases internally...is that right, @bf4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myronmarston That's my experience thus far. I have been unable to cause the code to fail with that exception, which is why I tracked down where it came from. Specifically, as I noted in the commit message bf4@96584ba

The Differ no longer catches CompatibilityErrors
This was introduced in
https://github.com/rspec/rspec-expectations/pull/220/files#diff-9533f5f156a38a3307ecfc610d2282d7R47

but is no longer raised either by the the current test code, the
original test code
      @expected="Tu avec carté {count} itém has".encode('UTF-16LE')
      @actual="Tu avec carte {count} item has".encode('UTF-16LE')
or any other variations I've tried

Since I'm moving all the encoding logic into EncodedString, and this
doesn't appear to do anything any more, I am proposing to remove it.

As I commented elsewhere, I want to confirm this isn't still a problem on 1.9.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how notoriously difficult it is to actually craft an invalidly encoded string when you run a conventually encoded environment I'm uncomfortable with this FYI

@myronmarston
Copy link
Member

Thanks @bf4, this is looking good. I think I understand it now. Squashing commits during a rebase will definitely help keep the history here more understandable so please do that when you're ready.

@bf4
Copy link
Contributor Author

bf4 commented Dec 29, 2014

@myronmarston I think I'm about ready to rebase the commits. I'll push a few first addressing any remaining issues.

I was following the thoughtbot guide where when a PR is being discussed, to continue adding commits so that changes are easy to see, but then rebase when done to tell a better story. I started doing that because I was making so many changes, I wanted a better history of what I was trying, rather than what I ended up with :)

actual_bytes = resulting_string.each_byte.to_a
it 'uses the default external encoding when the two strings have incompatible encodings' do
utf8_string = build_encoded_string("Tu avec carte {count} item has\n", "UTF-8")
utf8_incompatible_string = "Tu avec carté {count} itém has\n".encode('UTF-16LE')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests with the 'Tu avec' came from the differ, since I thought the é was a good scenario we weren't testing here

def diff(actual, expected)
diff = ""
diff = EMPTY_DIFF.dup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was "" insufficient?

Also, is the dup just to guard against mutations (to ensure EMPTY_DIFF isn't mutated)? If there aren't any mutations maybe you could freeze the constant instead and use it w/o duping for fewer allocations.

@myronmarston
Copy link
Member

Hey @bf4, I really appreciate the effort you're putting into this, but I also feel like this PR keeps shifting significantly every time I take a look at it. As lead RSpec maintainer I think it's important that I understand any code I merge (as I'll need to be able to refactor it and fix bugs in it in the future) but there's a lot going on here that I don't understand and every time I look at it it seems like there's new stuff I don't understand and I still don't understand the old stuff :(. At this point it seems like here are many concerns being addressed all in one PR:

  • Improving the existing specs.
  • Improving the build.
  • Handling some encoding edge cases that we didn't already handle.

Having all these concerns is making it much harder for me to understand what the actual affect of this PR is and also to ensure we're not losing coverage in all the spec changes. I hate making more work for you (especially since you've done so much work on this PR already...) but do you think you could open up a few smaller PRs that address a single aspect each? That would make your changes much easier to review and reason about and hopefully we could get the stuff that's not still an open question merged quickly and then be left with whatever the WIP stuff is in a much smaller PR.

Ideally I'd like to keep spec refactorings distinct from implementation refactorings and bug fixes.

@myronmarston
Copy link
Member

For shelling out I've enjoyed using this snippet as it lets me stream the output as the process is running, rather than having to wait.

What does that have to do with your changes here? (I don't see any relation...) Or are you just recommending we change how rspec-support handles shelling out? If it's the latter, please open a separate PR for that -- mixing concerns all in one PR is making my head spin!

@bf4
Copy link
Contributor Author

bf4 commented Jan 6, 2015

@myronmarston thanks for your review. I think you're right that this can now be broken up into smaller PRs. As you know, that happens sometimes that some changes expose others :)

I know the scope has changed a bit, but that's mostly because I've been trying to figure out how to 1) cover the different Encoding exceptions and 2) reliably test them.

I see a few types of comments

  1. About env variables JRUBY_OPTS and LC_etc.

  2. Typos or otherwise unclear comments

  3. Code conventions / practices: e.g. constant lookup, allocations, hash, etc

  4. Understanding the exceptions rescue or other defensive techniques

    I'll answer in place, later, I think, but for now:

  5. JRUBY_OPTS are just a workaround for travis failures regarding the removed cext option causing warnings which cause failures 8aa73ae. The LC_etc changes are to make explicit that we expect our tests to run with external UTF-8 encoding. Travis does this, but our suite does not. So, on my vagrant box (rake-dev-compiler, btw), the default encoding is ISO-8859-1, which causes expectation failures. Rather than debug expectation failures stemming from different external encodings, I just made sure we always test against the same external encoding. bf4@db2c3a4 In the future, we might want to try something like EnvUtil.with_default_external to test these cases explicitly.

  6. Typos.. yeah, sorry, happens

  7. I'll change per your recommendations, which I agree with.

  8. The safe_chr rescue is a RangeError. If the codepoint is 128169 (pile of poo), then 128169.chr raises RangeError: 128169 out of char range in which case it returns ("U+%.4X" % [128169]) #=> "U+1F4A9". Contrariwise, 63.chr returns ?. The idea behind the safe_chr, safe_codepoints, and expect_identical_string is to provide the most useful failure info if the strings aren't the same.

First, it checks if the strings are the same encoding. If not, it's a failure.
Next, it turns them into bytes/codepoints for comparison. If they're the same, success! If not, we try to produce the most useful failure message.

If we did a straight-up bytes comparison, we would just get a diff of numbers which is pretty undreadable. So, what I try to do, is encode all the codepoints I can, then compare the characters in the strings until there is a difference, then output that.

Rescuing the ArgumentError in safe_codepoints is if the string has an invalid byte sequence, it raises that on string.codepoints, so I fall back to just bytes.

There's probably other ways to do it :)

@myronmarston
Copy link
Member

JRUBY_OPTS are just a workaround for travis failures regarding the removed cext option causing warnings which cause failures 8aa73ae.

From what I can tell, it looks like that didn't actually fix the JRuby build? According to the travis issue there was an RVM bug that caused jruby-head to get installed as jruby so we've just been planning to wait until the fix is rolled out. If there was a simple fix like that I'd be in favor but I'd want it made in rspec-dev and pushed out to all rspec repos.

The LC_etc changes are to make explicit that we expect our tests to run with external UTF-8 encoding. Travis does this, but our suite does not. So, on my vagrant box (rake-dev-compiler, btw), the default encoding is ISO-8859-1, which causes expectation failures. Rather than debug expectation failures stemming from different external encodings, I just made sure we always test against the same external encoding. bf4/rspec-support@db2c3a4 In the future, we might want to try something like EnvUtil.with_default_external to test these cases explicitly.

I think that EnvUtil.with_default_external might be a better solution, given that users who don't have UTF-8 encoding probably won't run the travis scripts to run the specs. It also makes it explicit that the encoding is being set for particular specs that rely on it. Lastly, that helper will allow us to write specs for environments that have a different default external coding if we want.

The idea behind the safe_chr, safe_codepoints, and expect_identical_string is to provide the most useful failure info if the strings aren't the same.

Gotcha. Thanks for explaining that. I think the logic you've written could benefit from being refactored into a custom matcher, that uses a simple string equality check for matches?, is diffable, and exposes the safe_chr/safe_codepoints values from failure_message, expected and actual -- that way they will get used in the failure diff. The nice thing about splitting the logic that way is that it keeps the core matching logic simple and easy to reason about, and does the more complicated conversion only when necessary to provide the most useful failure message. If you want help creating such a matcher, let me know.

@bf4
Copy link
Contributor Author

bf4 commented Jan 6, 2015

JRUBY_OPTS just a workaround so i can use ci. We do already modify
it. Will remove before merge or when requested.

LC_ not sure what is most worth my time. Maybe in a refactor? Those
are travis defaults. We could document wwtd?

Byte matcher: pseudo code would help. I haven't written many and it
sounds like you have something in mind.

myronmarston added a commit that referenced this pull request Jan 23, 2015
Extracting EncodedString specs from #134
bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015
bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015
bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015
bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015
Map string char with invalid encoding to '?'
Format identical string expectation to read easier

Refs:
- rspec/rspec-core#1760
- via rspec#134
bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015
Map string char with invalid encoding to '?'
Format identical string expectation to read easier

Refs:
- rspec/rspec-core#1760
- via rspec#134
bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015
Map string char with invalid encoding to '?'
Format identical string expectation to read easier

Refs:
- rspec/rspec-core#1760
- via rspec#134

Set to pending for JRuby, opened issue
jruby/jruby#2580
that JRuby is the only Ruby that returns "\x80"
in place of "?"
@bf4
Copy link
Contributor Author

bf4 commented Feb 9, 2015

Looks like the big goals of this 1 through 4 are done. The only things left include

  1. Adding some way to testing for an external encoding like EnvUtil.with_default_external, esp. for tests that only pass when it's UTF-8.
  2. Improving failure messages via the matcher and/or changing the replace string and/or using a fallback method.

And with that, I think we can close this PR.

@bf4 bf4 closed this Feb 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants