Test EncodedSring#to_s for undefined conversion / invalid byte sequence #134

bf4 · 2014-11-20T04:30:01Z

UPDATE 2015-02-8: Split into other PRs:

Extracting EncodedString specs from #134 #151
JRuby specs need to be able to pass args to in_sub_process #152
Test Differ#pick_encoding; move to EncodedString #167
Fix invalid byte sequence on EncodedString#split #172
The Differ no longer catches CompatibilityErrors #173 (closed without merging)
Differ tests no longer use Differ to report diff expectation #174
Fix typo rspec-dev#114
For discussion: be explicit about JRuby opts rspec-dev#115 (open)
Skip specs with non-mri-compatible backtrace. rspec-core#1871 (open)
Document how to test #171 (split into other PRs)
- Add RSpec-dev to Contributing; RSpec-Support to repos rspec.github.io#65 (open)
- TBD
Unify be_identical_string/be_diffed_as in spec/string_matcher.rb #176
EnvUtil.with_default_external, esp. for tests that only pass when it's UTF-8.
Improving failure messages via the matcher and/or changing the replace string and/or using a fallback method.

JonRowe · 2014-11-20T05:08:15Z

spec/rspec/support/encoded_string_spec.rb

@@ -21,6 +21,47 @@ module RSpec::Support
        end
      end

+      describe '#to_s' do
+        if RUBY_VERSION == '1.9.2' && RSpec::Support::Ruby.mri?


Protip(tm) you can shorten this to Ruby.mri? given you're in the Support module already, however you shouldn't limit it to Ruby version 1.9.2

Thanks. I'm switching back and forth between the two, so forgot my context :)

It actually behaves differently on 1.9.2.. https://travis-ci.org/rspec/rspec-support/jobs/41566833#L113

1) RSpec::Support::EncodedString#to_s does nothing to an invalid byte sequence Failure/Error: expect(resulting_string.to_s).to eq("\xEF hi I am not going to work") expected: "\xEF hi I am not going to work" got: "\xEF hi I am not going to work"

myronmarston · 2014-11-20T05:54:17Z

spec/rspec/support/encoded_string_spec.rb

+            expect(resulting_string.to_s).to eq("? hi I am not going to work")
+          end
+
+          it 'replaces all characters with either a ? or a unicode ?' do


This doc string doesn't make sense to me...it seems to claim that EncodedString#to_s returns a string containing all question marks no matter what the original string is. Can you wrap this in a descriptive context that describes when this behavior occurs?

It surprised me as well. I just made the test pass from the actual output based on the current behavior. split to_s and << don't result in the same type of string. maybe it should be looked at. Also see @JonRowe in rspec/rspec-core#1760 (comment) and my followup

I thought adding the describe '#to_s' do was sufficient. Otherwise it requires a bit more investigation to understand

I thought adding the describe '#to_s' do was sufficient. Otherwise it requires a bit more investigation to understand

Well, there are other examples in this group where calling #to_s does not result in a string of all question marks, so clearly it 'replaces all characters with either a ? or a unicode ?' isn't the behavior you always see when #to_s is called. Are UTF-16LE and IBM737 completely incompatible encodings? Is that what causes that?

After a too-long break I'm back looking to finish the PRs. I'll update the context to make the tests clearer and see if I can clarify the reasons for different encoding behavior. I think that would resolve the blocker.

bf4 · 2014-12-26T06:53:58Z

lib/rspec/support/encoded_string.rb

          end
+        }


what do you think of this? I kind of like it but I've never seen it before. I found it while looking through the Ruby encoding / transcoding code and tests.

bf4 · 2014-12-28T05:08:45Z

@myronmarston going to sleep now, will answer questions in the morning.. this has been a bit of a slog.. in some cases, to answer your question, it may be in a commit message..

some of your questions are about how we want it to behave (replace with '?' vs. something else) which I don't have an opinion on

bf4 · 2014-12-28T15:17:25Z

I do intend to rebase and clean this up.. esp now that the tests all passed.

myronmarston · 2014-12-29T01:41:12Z

lib/rspec/support/differ.rb

@@ -47,8 +47,6 @@ def diff_as_string(actual, expected)
        finalize_output(output, hunks.last.diff(format_type).to_s) if hunks.last

        color_diff output
-      rescue Encoding::CompatibilityError
-        handle_encoding_errors


Is this being removed because of https://github.com/rspec/rspec-expectations/pull/220/files#diff-9533f5f156a38a3307ecfc610d2282d7R47 ?

I ask because rspec-mocks uses the differ in RSpec 2.2....but it doesn't rescue this error. It makes me wonder if we should move this rescue back into here and/or add a similar rescue to rspec-mocks.

@JonRowe, give that you authored rspec/rspec-expectations#220, what do you think should be done?

Actually, I'm noticing that rspec-expectations doesn't rescue this error anymore, either...so why is it safe to remove, @bf4? (Apologies if your commit messages explain but at 23 commits it's a lot to look through).

After reading through the rest of the diff it looks like this error isn't possible anymore since EncodedString handles so many more cases internally...is that right, @bf4?

@myronmarston That's my experience thus far. I have been unable to cause the code to fail with that exception, which is why I tracked down where it came from. Specifically, as I noted in the commit message bf4@96584ba

The Differ no longer catches CompatibilityErrors This was introduced in https://github.com/rspec/rspec-expectations/pull/220/files#diff-9533f5f156a38a3307ecfc610d2282d7R47 but is no longer raised either by the the current test code, the original test code @expected="Tu avec carté {count} itém has".encode('UTF-16LE') @actual="Tu avec carte {count} item has".encode('UTF-16LE') or any other variations I've tried Since I'm moving all the encoding logic into EncodedString, and this doesn't appear to do anything any more, I am proposing to remove it.

As I commented elsewhere, I want to confirm this isn't still a problem on 1.9.2.

Given how notoriously difficult it is to actually craft an invalidly encoded string when you run a conventually encoded environment I'm uncomfortable with this FYI

myronmarston · 2014-12-29T02:11:44Z

Thanks @bf4, this is looking good. I think I understand it now. Squashing commits during a rebase will definitely help keep the history here more understandable so please do that when you're ready.

bf4 · 2014-12-29T02:27:13Z

@myronmarston I think I'm about ready to rebase the commits. I'll push a few first addressing any remaining issues.

I was following the thoughtbot guide where when a PR is being discussed, to continue adding commits so that changes are easy to see, but then rebase when done to tell a better story. I started doing that because I was making so many changes, I wanted a better history of what I was trying, rather than what I ended up with :)

bf4 · 2014-12-29T02:44:19Z

spec/rspec/support/encoded_string_spec.rb

-            actual_bytes = resulting_string.each_byte.to_a
+          it 'uses the default external encoding when the two strings have incompatible encodings' do
+            utf8_string = build_encoded_string("Tu avec carte {count} item has\n", "UTF-8")
+            utf8_incompatible_string = "Tu avec carté {count} itém has\n".encode('UTF-16LE')


These tests with the 'Tu avec' came from the differ, since I thought the é was a good scenario we weren't testing here

myronmarston · 2015-01-06T08:37:53Z

lib/rspec/support/differ.rb

      def diff(actual, expected)
-        diff = ""
+        diff = EMPTY_DIFF.dup


Why was "" insufficient?

Also, is the dup just to guard against mutations (to ensure EMPTY_DIFF isn't mutated)? If there aren't any mutations maybe you could freeze the constant instead and use it w/o duping for fewer allocations.

myronmarston · 2015-01-06T08:58:12Z

Hey @bf4, I really appreciate the effort you're putting into this, but I also feel like this PR keeps shifting significantly every time I take a look at it. As lead RSpec maintainer I think it's important that I understand any code I merge (as I'll need to be able to refactor it and fix bugs in it in the future) but there's a lot going on here that I don't understand and every time I look at it it seems like there's new stuff I don't understand and I still don't understand the old stuff :(. At this point it seems like here are many concerns being addressed all in one PR:

Improving the existing specs.
Improving the build.
Handling some encoding edge cases that we didn't already handle.

Having all these concerns is making it much harder for me to understand what the actual affect of this PR is and also to ensure we're not losing coverage in all the spec changes. I hate making more work for you (especially since you've done so much work on this PR already...) but do you think you could open up a few smaller PRs that address a single aspect each? That would make your changes much easier to review and reason about and hopefully we could get the stuff that's not still an open question merged quickly and then be left with whatever the WIP stuff is in a much smaller PR.

Ideally I'd like to keep spec refactorings distinct from implementation refactorings and bug fixes.

myronmarston · 2015-01-06T09:00:26Z

For shelling out I've enjoyed using this snippet as it lets me stream the output as the process is running, rather than having to wait.

What does that have to do with your changes here? (I don't see any relation...) Or are you just recommending we change how rspec-support handles shelling out? If it's the latter, please open a separate PR for that -- mixing concerns all in one PR is making my head spin!

bf4 · 2015-01-06T13:58:55Z

@myronmarston thanks for your review. I think you're right that this can now be broken up into smaller PRs. As you know, that happens sometimes that some changes expose others :)

I know the scope has changed a bit, but that's mostly because I've been trying to figure out how to 1) cover the different Encoding exceptions and 2) reliably test them.

I see a few types of comments

About env variables JRUBY_OPTS and LC_etc.
Typos or otherwise unclear comments
Code conventions / practices: e.g. constant lookup, allocations, hash, etc
Understanding the exceptions rescue or other defensive techniques

I'll answer in place, later, I think, but for now:
JRUBY_OPTS are just a workaround for travis failures regarding the removed cext option causing warnings which cause failures 8aa73ae. The LC_etc changes are to make explicit that we expect our tests to run with external UTF-8 encoding. Travis does this, but our suite does not. So, on my vagrant box (rake-dev-compiler, btw), the default encoding is ISO-8859-1, which causes expectation failures. Rather than debug expectation failures stemming from different external encodings, I just made sure we always test against the same external encoding. bf4@db2c3a4 In the future, we might want to try something like EnvUtil.with_default_external to test these cases explicitly.
Typos.. yeah, sorry, happens
I'll change per your recommendations, which I agree with.
The safe_chr rescue is a RangeError. If the codepoint is 128169 (pile of poo), then 128169.chr raises RangeError: 128169 out of char range in which case it returns ("U+%.4X" % [128169]) #=> "U+1F4A9". Contrariwise, 63.chr returns ?. The idea behind the safe_chr, safe_codepoints, and expect_identical_string is to provide the most useful failure info if the strings aren't the same.

First, it checks if the strings are the same encoding. If not, it's a failure.
Next, it turns them into bytes/codepoints for comparison. If they're the same, success! If not, we try to produce the most useful failure message.

If we did a straight-up bytes comparison, we would just get a diff of numbers which is pretty undreadable. So, what I try to do, is encode all the codepoints I can, then compare the characters in the strings until there is a difference, then output that.

Rescuing the ArgumentError in safe_codepoints is if the string has an invalid byte sequence, it raises that on string.codepoints, so I fall back to just bytes.

There's probably other ways to do it :)

myronmarston · 2015-01-06T21:32:58Z

JRUBY_OPTS are just a workaround for travis failures regarding the removed cext option causing warnings which cause failures 8aa73ae.

From what I can tell, it looks like that didn't actually fix the JRuby build? According to the travis issue there was an RVM bug that caused jruby-head to get installed as jruby so we've just been planning to wait until the fix is rolled out. If there was a simple fix like that I'd be in favor but I'd want it made in rspec-dev and pushed out to all rspec repos.

The LC_etc changes are to make explicit that we expect our tests to run with external UTF-8 encoding. Travis does this, but our suite does not. So, on my vagrant box (rake-dev-compiler, btw), the default encoding is ISO-8859-1, which causes expectation failures. Rather than debug expectation failures stemming from different external encodings, I just made sure we always test against the same external encoding. bf4/rspec-support@db2c3a4 In the future, we might want to try something like EnvUtil.with_default_external to test these cases explicitly.

I think that EnvUtil.with_default_external might be a better solution, given that users who don't have UTF-8 encoding probably won't run the travis scripts to run the specs. It also makes it explicit that the encoding is being set for particular specs that rely on it. Lastly, that helper will allow us to write specs for environments that have a different default external coding if we want.

The idea behind the safe_chr, safe_codepoints, and expect_identical_string is to provide the most useful failure info if the strings aren't the same.

Gotcha. Thanks for explaining that. I think the logic you've written could benefit from being refactored into a custom matcher, that uses a simple string equality check for matches?, is diffable, and exposes the safe_chr/safe_codepoints values from failure_message, expected and actual -- that way they will get used in the failure diff. The nice thing about splitting the logic that way is that it keeps the core matching logic simple and easy to reason about, and does the more complicated conversion only when necessary to provide the most useful failure message. If you want help creating such a matcher, let me know.

bf4 · 2015-01-06T23:06:06Z

JRUBY_OPTS just a workaround so i can use ci. We do already modify
it. Will remove before merge or when requested.

LC_ not sure what is most worth my time. Maybe in a refactor? Those
are travis defaults. We could document wwtd?

Byte matcher: pseudo code would help. I haven't written many and it
sounds like you have something in mind.

bf4 · 2015-01-07T06:33:06Z

Let me know what you'd like for next steps here. What's remaining after #151 and #152

Handling the ArgumentError
Adding Differ specs, have differ empty string be an EncodedString
Differ no longer needs to rescue CompatibilityErrors
Making use of getting the compatible encoding before converting, as Differ#pick_encoding
Adding some way to testing for an external encoding like EnvUtil.with_default_external, esp. for tests that only pass when it's UTF-8.
Improving failure messages via the matcher and/or changing the replace string and/or using a fallback method.

And possibly misc Windows/JRuby fixes mixed in

I think we can leave out the Encoding internal/external/etc specs

Extracting EncodedString specs from #134

From: - rspec/rspec-core#1760 - via rspec#134

Map string char with invalid encoding to '?' Format identical string expectation to read easier Refs: - rspec/rspec-core#1760 - via rspec#134

Map string char with invalid encoding to '?' Format identical string expectation to read easier Refs: - rspec/rspec-core#1760 - via rspec#134 Set to pending for JRuby, opened issue jruby/jruby#2580 that JRuby is the only Ruby that returns "\x80" in place of "?"

bf4 · 2015-02-09T04:05:01Z

Looks like the big goals of this 1 through 4 are done. The only things left include

Adding some way to testing for an external encoding like EnvUtil.with_default_external, esp. for tests that only pass when it's UTF-8.

Improving failure messages via the matcher and/or changing the replace string and/or using a fallback method.

And with that, I think we can close this PR.

bf4 force-pushed the handle_non_utf8_encoding_string branch 2 times, most recently from 728cf6d to ff988ae Compare November 20, 2014 05:03

JonRowe reviewed Nov 20, 2014
View reviewed changes

bf4 mentioned this pull request Nov 20, 2014

Handle exception message with invalid UTF-8; For rbx rspec/rspec-core#1760

Merged

bf4 force-pushed the handle_non_utf8_encoding_string branch 2 times, most recently from b70f78e to 5325c3f Compare November 20, 2014 05:53

myronmarston reviewed Nov 20, 2014
View reviewed changes

bf4 force-pushed the handle_non_utf8_encoding_string branch from 5325c3f to c8b0521 Compare December 26, 2014 06:52

bf4 reviewed Dec 26, 2014
View reviewed changes

myronmarston reviewed Dec 29, 2014
View reviewed changes

bf4 reviewed Dec 29, 2014
View reviewed changes

myronmarston reviewed Jan 6, 2015
View reviewed changes

This was referenced Jan 7, 2015

Extracting EncodedString specs from #134 #151

Merged

JRuby specs need to be able to pass args to in_sub_process #152

Merged

myronmarston added a commit that referenced this pull request Jan 23, 2015

Merge pull request #151 from bf4/encoding_specs

eb27b60

Extracting EncodedString specs from #134

myronmarston mentioned this pull request Jan 31, 2015

Spec ./spec/rspec/support/encoded_string_spec.rb:204 will likely never pass on Rubinius #162

Closed

This was referenced Feb 3, 2015

Test Differ#pick_encoding; move to EncodedString #167

Merged

Document how to test #171

Closed

bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015

Fix invalid byte sequence on EncodedString#split

e327630

From: - rspec/rspec-core#1760 - via rspec#134

bf4 mentioned this pull request Feb 8, 2015

Fix invalid byte sequence on EncodedString#split #172

Merged

bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015

Fix invalid byte sequence on EncodedString#split

a07480f

From: - rspec/rspec-core#1760 - via rspec#134

bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015

Fix invalid byte sequence on EncodedString#split

d9645c8

From: - rspec/rspec-core#1760 - via rspec#134

bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015

Fix invalid byte sequence on EncodedString#split

8260a59

Map string char with invalid encoding to '?' Format identical string expectation to read easier Refs: - rspec/rspec-core#1760 - via rspec#134

bf4 mentioned this pull request Feb 8, 2015

Differ tests no longer use Differ to report diff expectation #174

Merged

bf4 added a commit to bf4/rspec-support that referenced this pull request Feb 8, 2015

Fix invalid byte sequence on EncodedString#split

ff3c0a6

Map string char with invalid encoding to '?' Format identical string expectation to read easier Refs: - rspec/rspec-core#1760 - via rspec#134

bf4 closed this Feb 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test EncodedSring#to_s for undefined conversion / invalid byte sequence #134

Test EncodedSring#to_s for undefined conversion / invalid byte sequence #134

bf4 commented Nov 20, 2014

JonRowe Nov 20, 2014

bf4 Nov 20, 2014

myronmarston Nov 20, 2014

bf4 Nov 20, 2014

myronmarston Nov 20, 2014

bf4 Dec 25, 2014

bf4 Dec 26, 2014

bf4 commented Dec 28, 2014

bf4 commented Dec 28, 2014

myronmarston Dec 29, 2014

myronmarston Dec 29, 2014

myronmarston Dec 29, 2014

bf4 Dec 29, 2014

JonRowe Jan 2, 2015

myronmarston commented Dec 29, 2014

bf4 commented Dec 29, 2014

bf4 Dec 29, 2014

myronmarston Jan 6, 2015

myronmarston commented Jan 6, 2015

myronmarston commented Jan 6, 2015

bf4 commented Jan 6, 2015

myronmarston commented Jan 6, 2015

bf4 commented Jan 6, 2015

bf4 commented Jan 7, 2015

bf4 commented Feb 9, 2015

Test EncodedSring#to_s for undefined conversion / invalid byte sequence #134

Test EncodedSring#to_s for undefined conversion / invalid byte sequence #134

Conversation

bf4 commented Nov 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bf4 commented Dec 28, 2014

bf4 commented Dec 28, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myronmarston commented Dec 29, 2014

bf4 commented Dec 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myronmarston commented Jan 6, 2015

myronmarston commented Jan 6, 2015

bf4 commented Jan 6, 2015

myronmarston commented Jan 6, 2015

bf4 commented Jan 6, 2015

bf4 commented Jan 7, 2015

bf4 commented Feb 9, 2015