File: Remove `\r` in `get_as_text()` to keep standardized Unix format #63717

akien-mga · 2022-07-31T11:12:24Z

This fixes a compatibility breakage from #63481.
See discussion in #63434.

This fixes a compatibility breakage from godotengine#63481. See discussion in godotengine#63434.

m4gr3d

Can we instead augment the String::parse_utf8(...) with an additional (optional) parameter that allows us to remove that character from the generated string.

This will avoid performing another traversal and copy of the same string, which may not be an issue for smaller strings, but may affect performance for larger strings.

Also as discussed in the conversation, we want this change to apply only for 3.x in order to avoid breaking compatibility.
As I mentioned in the conversation, for master and onward, we should return the raw file as is.

akien-mga · 2022-07-31T14:34:00Z

Also as discussed in the conversation, we want this change to apply only for 3.x in order to avoid breaking compatibility.
As I mentioned in the conversation, for master and onward, we should return the raw file as is.

Well I prefer fixing the compatibility breakage in master too because it was not an intentional design decision.

The File API for bindings was likely intentionally designed to provide users with Unix-formatted files, as evidenced by the explicit skip of \r in get_line(). This design decision can be challenged but that should be in its own proposal/PR, it shouldn't be done as a side effect of solving an Android-specific I/O performance regression.

For the same reason, I wouldn't change an API as core as String::parse_utf8() as part of this regression fix. First we fix the compat breakage (which is local to File::get_as_text(), and then we can discuss more changing to File and String APIs in follow-up proposals/PRs.

I don't think the cost of replace("\n", "") is too bad and in any case this should stay faster than the previous code, and functionally equivalent. So IMO that's good enough as a regression fix.

akien-mga · 2022-07-31T14:50:21Z

I don't think the cost of replace("\n", "") is too bad and in any case this should stay faster than the previous code, and functionally equivalent. So IMO that's good enough as a regression fix.

I did a test on a very heavy file with a debug build (so not the most telling as not optimized), and it does seem to have a significant impact on such a pathological case:

$ file editor_translations.gen.h*
editor_translations.gen.h:     C source, ASCII text
editor_translations.gen.h_dos: C source, ASCII text, with CRLF line terminators

$ ls -lh editor_translations.gen.h*
-rw-r--r-- 1 akien akien 16M Jul 31 16:36 editor_translations.gen.h
-rw-r--r-- 1 akien akien 19M Jul 31 16:37 editor_translations.gen.h_dos

$ cat editor_translations.gen.h | wc -l
2952162
$ cat editor_translations.gen.h_dos | wc -l
2952162

Measuring conversion time with this script:

var f1 = File.new()
f1.open("editor_translations.gen.h", File.READ)

var start1 = Time.get_ticks_usec()
f1.get_as_text()
var end1 = Time.get_ticks_usec()

print("Unix file: %f s" % ((end1 - start1) / 1e6))


var f2 = File.new()
f2.open("editor_translations.gen.h_dos", File.READ)

var start2 = Time.get_ticks_usec()
f2.get_as_text()
var end2 = Time.get_ticks_usec()

print("DOS file:  %f s" % ((end2 - start2) / 1e6))

Original File::get_as_text() from before Address remaining scoped storage regressions #63481

Unix file: 3.107707 s
DOS file:  3.214315 s

Current master after Address remaining scoped storage regressions #63481

Unix file: 0.136303 s
DOS file:  0.162586 s

This PR with replace("\r", "")

Unix file: 0.180824 s
DOS file:  1.624483 s

So it does make a significant difference on this extreme case. It's still much better than before #63481 especially when no replace is actually needed.

I'll give a try to implementing this in parse_utf8 to compare.

Edit: Gave it a (hacky) try and that's indeed way faster:

diff --git a/core/string/ustring.cpp b/core/string/ustring.cpp
index beefe54faf..252460af14 100644
--- a/core/string/ustring.cpp
+++ b/core/string/ustring.cpp
@@ -1753,6 +1753,11 @@ Error String::parse_utf8(const char *p_utf8, int p_len) {
 		uint8_t c = *p_utf8 >= 0 ? *p_utf8 : uint8_t(256 + *p_utf8);
 
 		if (skip == 0) {
+			if (c == '\r') {
+				cstr_size--;
+				p_utf8++;
+				continue;
+			}
 			/* Determine the number of characters in sequence */
 			if ((c & 0x80) == 0) {
 				*(dst++) = c;

Unix file: 0.138117 s
DOS file:  0.164486 s

But I don't know if we want to do this in something as low level as parse_utf8(). CC @bruvzg @reduz

m4gr3d · 2022-07-31T15:00:45Z

But I don't know if we want to do this in something as low level as parse_utf8()

This should be fine if we put the logic behind an optional parameter which defaults to the current behavior.

bruvzg · 2022-07-31T15:02:24Z

But I don't know if we want to do this in something as low level as parse_utf8()

If it's an optional parameter, why not, it should be fine. But it's probably should be done in two places in parse_utf8, it's doing first pass to calculate the size of target string, and second pass to actually decode it.

akien-mga · 2022-07-31T16:52:25Z

Superseded by #63733.

This was removed in godotengine#63481, and we confirmed that it's better like this, but we add back the possibility to strip CR as an option, to optionally restore the previous behavior. For performance this is done directly in `String::parse_utf8`. Also fixes Android `FileAccess::get_line()` as this one _should_ strip CR. Supersedes godotengine#63717.

This was removed in godotengine#63481, and we confirmed that it's better like this, but we add back the possibility to strip CR as an option, to optionally restore the previous behavior. For performance this is done directly in `String::parse_utf8`. Also fixes Android `FileAccess::get_line()` as this one _should_ strip CR. Supersedes godotengine#63717. (cherry picked from commit 1418f97)

File: Remove \r in get_as_text() to keep standardized Unix format

d83f41f

This fixes a compatibility breakage from godotengine#63481. See discussion in godotengine#63434.

akien-mga added bug topic:core regression cherrypick:3.x Considered for cherry-picking into a future 3.x release labels Jul 31, 2022

akien-mga added this to the 4.0 milestone Jul 31, 2022

akien-mga requested a review from a team as a code owner July 31, 2022 11:12

akien-mga requested a review from m4gr3d July 31, 2022 11:13

m4gr3d requested changes Jul 31, 2022

View reviewed changes

akien-mga mentioned this pull request Jul 31, 2022

File: Re-add support to skip CR (\r) in File::get_as_text #63733

Merged

akien-mga closed this Jul 31, 2022

akien-mga added archived and removed cherrypick:3.x Considered for cherry-picking into a future 3.x release labels Jul 31, 2022

akien-mga deleted the file-get_as_text-remove-CR branch July 31, 2022 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File: Remove `\r` in `get_as_text()` to keep standardized Unix format #63717

File: Remove `\r` in `get_as_text()` to keep standardized Unix format #63717

akien-mga commented Jul 31, 2022

m4gr3d left a comment

akien-mga commented Jul 31, 2022

akien-mga commented Jul 31, 2022 •

edited

Loading

m4gr3d commented Jul 31, 2022

bruvzg commented Jul 31, 2022 •

edited

Loading

akien-mga commented Jul 31, 2022

File: Remove \r in get_as_text() to keep standardized Unix format #63717

File: Remove \r in get_as_text() to keep standardized Unix format #63717

Conversation

akien-mga commented Jul 31, 2022

m4gr3d left a comment

Choose a reason for hiding this comment

akien-mga commented Jul 31, 2022

akien-mga commented Jul 31, 2022 • edited Loading

m4gr3d commented Jul 31, 2022

bruvzg commented Jul 31, 2022 • edited Loading

akien-mga commented Jul 31, 2022

File: Remove `\r` in `get_as_text()` to keep standardized Unix format #63717

File: Remove `\r` in `get_as_text()` to keep standardized Unix format #63717

akien-mga commented Jul 31, 2022 •

edited

Loading

bruvzg commented Jul 31, 2022 •

edited

Loading