Don't crash on invalid UTF-8 byte sequences.

If a line of code has an invalid byte sequence in UTF-8, count it as a relevant line rather than crashing. Useful for projects that use `track_files "**/*.rb"` and have the builder gem in a subdir such as vendor. Builder has a line of code with an invalid UTF-8 byte sequence.
simplecov-ruby · Mar 10, 2018 · 67bd66a · 67bd66a
1 parent 1e00add
commit 67bd66a
Show file tree

Hide file tree

Showing 3 changed files with 21 additions and 6 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,7 @@ unreleased
 
 * (breaking) Stop handling string filters as regular expressions, use the dedicated regex filter if you need that behaviour. See [#616](https://github.com/colszowka/simplecov/pull/616) (thanks @yujinakayama)
 * Avoid overwriting the last coverage results on unsuccessful test runs. See [#625](https://github.com/colszowka/simplecov/pull/625) (thanks @thomas07vt)
+* Don't crash on invalid UTF-8 byte sequences.
 
 0.15.1 (2017-09-11) ([changes](https://github.com/colszowka/simplecov/compare/v0.15.0...v0.15.1))
 =======

diff --git a/lib/simplecov/lines_classifier.rb b/lib/simplecov/lines_classifier.rb
@@ -20,12 +20,17 @@ def classify(lines)
       skipping = false
 
       lines.map do |line|
-        if line =~ self.class.no_cov_line
-          skipping = !skipping
-          NOT_RELEVANT
-        elsif skipping || line =~ WHITESPACE_OR_COMMENT_LINE
-          NOT_RELEVANT
-        else
+        begin
+          if line =~ self.class.no_cov_line
+            skipping = !skipping
+            NOT_RELEVANT
+          elsif skipping || line =~ WHITESPACE_OR_COMMENT_LINE
+            NOT_RELEVANT
+          else
+            RELEVANT
+          end
+        rescue ArgumentError
+          # E.g., line contains an invalid byte sequence in UTF-8
           RELEVANT
         end
       end

diff --git a/spec/lines_classifier_spec.rb b/spec/lines_classifier_spec.rb
@@ -20,6 +20,15 @@
         expect(classified_lines.length).to eq 7
         expect(classified_lines).to all be_relevant
       end
+
+      it "determines invalid UTF-8 byte sequences as relevant" do
+        classified_lines = subject.classify [
+          "bytes = \"\xF1t\xEBrn\xE2ti\xF4n\xE0liz\xE6ti\xF8n\"",
+        ]
+
+        expect(classified_lines.length).to eq 1
+        expect(classified_lines).to all be_relevant
+      end
     end
 
     describe "not-relevant lines" do