Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test EncodedSring#to_s for undefined conversion / invalid byte sequence #134

Closed
wants to merge 14 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,24 @@ rvm:
- 2.2
- ruby-head
- ree
- jruby-18mode
- jruby
- jruby-head
- rbx
matrix:
include:
- rvm: jruby
env: JRUBY_OPTS='--2.0'
allow_failures:
env: JRUBY_OPTS='--server -Xcompile.invokedynamic=false -Xcompat.version=2.0'
- rvm: jruby-head
env: JRUBY_OPTS='--server -Xcompile.invokedynamic=false'
# These two are temporary until https://github.com/travis-ci/travis-ci/issues/3067 is solved.
- rvm: jruby-18mode
env: JRUBY_OPTS='--server -Xcompile.invokedynamic=false'
- rvm: jruby
env: JRUBY_OPTS='--server -Xcompile.invokedynamic=false'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all these JRUBY_OPTS changes? I don't understand what these options do and I'd prefer to keep our JRUBY_OPTS simpler without a clear reason to add the complication of these options...especially since we haven't needed these options in the other repos or in this repo before now.

allow_failures:
- rvm: ruby-head
- rvm: rbx
# These two are temporary until https://github.com/travis-ci/travis-ci/issues/3067 is solved.
- rvm: jruby-18mode
env: JRUBY_OPTS='--server -Xcompile.invokedynamic=false'
- rvm: jruby
env: JRUBY_OPTS='--server -Xcompile.invokedynamic=false'
fast_finish: true
3 changes: 2 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ branch = File.read(File.expand_path("../maintenance-branch", __FILE__)).chomp
end

### dep for ci/coverage
gem 'simplecov', '~> 0.8'
gem 'simplecov', '~> 0.9'
gem 'simplecov-html', :github => 'colszowka/simplecov-html'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of your comments mentioned this addresses a failure but I don't understand...we didn't need this gem before so why do we need it now? What failure does it address?


gem 'rubocop', "~> 0.23.0", :platform => [:ruby_19, :ruby_20, :ruby_21]

Expand Down
36 changes: 9 additions & 27 deletions lib/rspec/support/differ.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,14 @@ module RSpec
module Support
# rubocop:disable ClassLength
class Differ
if String.method_defined?(:encoding)
EMPTY_DIFF = EncodedString.new("", Encoding.default_external)
else
EMPTY_DIFF = EncodedString.new("")
end

def diff(actual, expected)
diff = ""
diff = EMPTY_DIFF.dup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was "" insufficient?

Also, is the dup just to guard against mutations (to ensure EMPTY_DIFF isn't mutated)? If there aren't any mutations maybe you could freeze the constant instead and use it w/o duping for fewer allocations.


if actual && expected
if all_strings?(actual, expected)
Expand All @@ -25,12 +31,10 @@ def diff(actual, expected)

# rubocop:disable MethodLength
def diff_as_string(actual, expected)
@encoding = pick_encoding actual, expected

@encoding = EncodedString.pick_encoding(actual, expected)
@actual = EncodedString.new(actual, @encoding)
@expected = EncodedString.new(expected, @encoding)

output = EncodedString.new("\n", @encoding)
output = EncodedString.new("\n", @encoding)

hunks.each_cons(2) do |prev_hunk, current_hunk|
begin
Expand All @@ -47,8 +51,6 @@ def diff_as_string(actual, expected)
finalize_output(output, hunks.last.diff(format_type).to_s) if hunks.last
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being removed because of https://github.com/rspec/rspec-expectations/pull/220/files#diff-9533f5f156a38a3307ecfc610d2282d7R47 ?

I ask because rspec-mocks uses the differ in RSpec 2.2....but it doesn't rescue this error. It makes me wonder if we should move this rescue back into here and/or add a similar rescue to rspec-mocks.

@JonRowe, give that you authored rspec/rspec-expectations#220, what do you think should be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm noticing that rspec-expectations doesn't rescue this error anymore, either...so why is it safe to remove, @bf4? (Apologies if your commit messages explain but at 23 commits it's a lot to look through).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through the rest of the diff it looks like this error isn't possible anymore since EncodedString handles so many more cases internally...is that right, @bf4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myronmarston That's my experience thus far. I have been unable to cause the code to fail with that exception, which is why I tracked down where it came from. Specifically, as I noted in the commit message bf4@96584ba

The Differ no longer catches CompatibilityErrors
This was introduced in
https://github.com/rspec/rspec-expectations/pull/220/files#diff-9533f5f156a38a3307ecfc610d2282d7R47

but is no longer raised either by the the current test code, the
original test code
      @expected="Tu avec carté {count} itém has".encode('UTF-16LE')
      @actual="Tu avec carte {count} item has".encode('UTF-16LE')
or any other variations I've tried

Since I'm moving all the encoding logic into EncodedString, and this
doesn't appear to do anything any more, I am proposing to remove it.

As I commented elsewhere, I want to confirm this isn't still a problem on 1.9.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how notoriously difficult it is to actually craft an invalidly encoded string when you run a conventually encoded environment I'm uncomfortable with this FYI


color_diff output
rescue Encoding::CompatibilityError
handle_encoding_errors
end
# rubocop:enable MethodLength

Expand Down Expand Up @@ -188,26 +190,6 @@ def object_to_string(object)
PP.pp(object, "")
end
end

if String.method_defined?(:encoding)
def pick_encoding(source_a, source_b)
Encoding.compatible?(source_a, source_b) || Encoding.default_external
end
else
def pick_encoding(_source_a, _source_b)
end
end

def handle_encoding_errors
if @actual.source_encoding != @expected.source_encoding
"Could not produce a diff because the encoding of the actual string " \
"(#{@actual.source_encoding}) differs from the encoding of the expected " \
"string (#{@expected.source_encoding})"
else
"Could not produce a diff because of the encoding of the string " \
"(#{@expected.source_encoding})"
end
end
end
# rubocop:enable ClassLength
end
Expand Down
69 changes: 56 additions & 13 deletions lib/rspec/support/encoded_string.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,19 @@ module RSpec
module Support
# @private
class EncodedString
MRI_UNICODE_UNKOWN_CHARACTER = "\xEF\xBF\xBD"
if String.method_defined?(:encoding)
# see https://github.com/ruby/ruby/blob/ca24e581ba/encoding.c#L1191
def self.pick_encoding(source_a, source_b)
Encoding.compatible?(source_a, source_b) || Encoding.default_external
end
else
def self.pick_encoding(_source_a, _source_b)
end
end

# Ruby's default replacement string for is U+FFFD ("\xEF\xBF\xBD") for Unicode encoding forms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence doesn't make grammatical sense ("string for is" reads awkwardly and I'm not sure what you're trying to say). Can you rephrase it?

# else is '?' ("\x3F")
REPLACE = "\x3F"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be REPLACE = "?"? I find the \x escaped form to be much harder to read (and it winds up requiring a comment to explain it's a question mark).

Also why is this better than the default replacement string of U+FFFD? I don't understand that from your comment.


def initialize(string, encoding=nil)
@encoding = encoding
Expand Down Expand Up @@ -33,21 +45,52 @@ def to_s

private

ENCODING_STRATEGY = {
:bad_bytes => {
:invalid => :replace,
# :undef => :nil,
:replace => REPLACE
},
:cannot_convert => {
# :invalid => :nil,
:undef => :replace,
:replace => REPLACE
},
:no_converter => {
:invalid => :replace,
# :undef => :nil,
:replace => REPLACE
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of this? I kind of like it but I've never seen it before. I found it while looking through the Ruby encoding / transcoding code and tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with the commented out entries in these hashes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is there any real benefit to sticking all these strategies in a big named hash? Why not just make them separate constants and refer to them separately? That would remove an unnecessary method call (Hash#[]) and it seems like you're just using the hash as an alternate form of named constants.


# Raised by Encoding and String methods:
# Encoding::UndefinedConversionError:
# when a transcoding operation fails
# e.g. "\x80".encode('utf-8','ASCII-8BIT')
# Encoding::InvalidByteSequenceError:
# when the string being transcoded contains a byte invalid for the either
# the source or target encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "...for the either the source or target..." be "...for either the source or target..."?

# e.g. "\x80".encode('utf-8','US-ASCII')
# Raised by transcoding methods:
# Encoding::ConverterNotFoundError:
# when a named encoding does not correspond with a known converter
# e.g. 'abc'.force_encoding('utf-8').encode('foo')
# Encoding::CompatibilityError
#
def matching_encoding(string)
string.encode(@encoding)
rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
normalize_missing(string.encode(@encoding, :invalid => :replace, :undef => :replace))
encoding = EncodedString.pick_encoding(source_encoding, @encoding)
# Converting it to a higher character set (UTF-16) and then back (to UTF-8)
# ensures that we strip away invalid or undefined byte sequences
# => no need to rescue Encoding::InvalidByteSequenceError, ArgumentError
string.encode(::Encoding::UTF_16LE, ENCODING_STRATEGY[:bad_bytes]).
encode(encoding)
rescue Encoding::UndefinedConversionError, Encoding::CompatibilityError
string.encode(encoding, ENCODING_STRATEGY[:cannot_convert])
# Begin: Needed for 1.9.2
rescue Encoding::ConverterNotFoundError
normalize_missing(string.force_encoding(@encoding).encode(:invalid => :replace))
end

def normalize_missing(string)
if @encoding.to_s == "UTF-8"
string.gsub(MRI_UNICODE_UNKOWN_CHARACTER.force_encoding(@encoding), "?")
else
string
end
string.force_encoding(encoding).encode(ENCODING_STRATEGY[:no_converter])
end
# End: Needed for 1.9.2

def detect_source_encoding(string)
string.encoding
Expand Down
2 changes: 2 additions & 0 deletions lib/rspec/support/spec.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
require 'rspec/support'
RSpec::Support.require_rspec_support "spec/deprecation_helpers"
RSpec::Support.require_rspec_support "spec/encoding_helpers"
RSpec::Support.require_rspec_support "spec/with_isolated_stderr"
RSpec::Support.require_rspec_support "spec/stderr_splitter"
RSpec::Support.require_rspec_support "spec/formatting_support"
Expand All @@ -12,6 +13,7 @@
c.include RSpecHelpers
c.include RSpec::Support::WithIsolatedStdErr
c.include RSpec::Support::FormattingSupport
c.include RSpec::Support::EncodingHelpers

unless defined?(Debugger) # debugger causes warnings when used
c.before do
Expand Down
61 changes: 61 additions & 0 deletions lib/rspec/support/spec/encoding_helpers.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
module RSpec
module Support
module EncodingHelpers
module_function

# For undefined conversions, replace as "U+<codepoint>"
# e.g. '\xa0' becomes 'U+00A0'
# see https://github.com/ruby/ruby/blob/34fbf57aaa/test/ruby/test_transcode.rb#L2050
def safe_chr
# rubocop:disable Style/RescueModifier
@safe_chr ||= Hash.new { |h, x| h[x] = x.chr rescue ("U+%.4X" % [x]) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider trailing rescue to be an anti-pattern since it's such a wide rescue. Can this rescue the specific exception class you are handling here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking these helpers need their own tests :)

# rubocop:enable Style/RescueModifier
end

if String.method_defined?(:encoding)

def safe_codepoints(str)
str.each_codepoint.map { |codepoint| safe_chr[codepoint] }
rescue ArgumentError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part of str.each_codepoint.map { |codepoint| safe_chr[codepoint] } is capable of raising an ArgumentError?

str.each_byte.map { |byte| safe_chr[byte] }
end

# rubocop:disable MethodLength
def expect_identical_string(str1, str2, expected_encoding=str1.encoding)
expect(str1.encoding).to eq(expected_encoding)
str1_bytes = safe_codepoints(str1)
str2_bytes = safe_codepoints(str2)
return unless str1_bytes != str2_bytes
str1_differences = []
str2_differences = []
# rubocop:disable Style/Next
str2_bytes.each_with_index do |str2_byte, index|
str1_byte = str1_bytes.fetch(index) do
str2_differences.concat str2_bytes[index..-1]
return
end
if str1_byte != str2_byte
str1_differences << str1_byte
str2_differences << str2_byte
end
end
# rubocop:enable Style/Next
expect(str1_differences.join).to eq(str2_differences.join)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned before that expect(str1 == str2).to eq true would bypass the differ (and you're right) so why is that insufficient? What does this far more complex way of comparing two strings gain you?

# rubocop:enable Style/MethodLength

else

def safe_codepoints(str)
str.split(//)
end

def expect_identical_string(str1, str2)
str1_bytes = safe_codepoints(str1)
str2_bytes = safe_codepoints(str2)
expect(str1_bytes).to eq(str2_bytes)
end
end
end
end
end
4 changes: 2 additions & 2 deletions lib/rspec/support/spec/in_sub_process.rb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
module RSpec
module Support
module InSubProcess
if Process.respond_to?(:fork) && !(RUBY_PLATFORM == 'java' && RUBY_VERSION == '1.8.7')
if Process.respond_to?(:fork) && !(Ruby.jruby? && RUBY_VERSION == '1.8.7')
# Useful as a way to isolate a global change to a subprocess.

# rubocop:disable MethodLength
Expand Down Expand Up @@ -35,7 +35,7 @@ def in_sub_process(prevent_warnings=true)
raise exception if exception
end
else
def in_sub_process
def in_sub_process(*)
skip "This spec requires forking to work properly, " \
"and your platform does not support forking"
end
Expand Down
6 changes: 5 additions & 1 deletion script/functions.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ source $SCRIPT_DIR/predicate_functions.sh

# idea taken from: http://blog.headius.com/2010/03/jruby-startup-time-tips.html
export JRUBY_OPTS="${JRUBY_OPTS} -X-C" # disable JIT since these processes are so short lived
# Set the external encoding to UTF-8 in a 1.8.7-compatible way
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed now when it wasn't before? I thought the external encoding was already set to UTF-8?


SPECS_HAVE_RUN_FILE=specs.out
MAINTENANCE_BRANCH=`cat maintenance-branch`

Expand Down Expand Up @@ -112,7 +116,7 @@ function check_documentation_coverage {
}

function check_style_and_lint {
echo "bin/rubucop lib"
echo "bin/rubocop lib"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, good catch!

bin/rubocop lib
}

Expand Down
Loading