Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File: Remove \r in get_as_text() to keep standardized Unix format #63717

Closed

Conversation

akien-mga
Copy link
Member

This fixes a compatibility breakage from #63481.
See discussion in #63434.

@akien-mga akien-mga added bug topic:core regression cherrypick:3.x Considered for cherry-picking into a future 3.x release labels Jul 31, 2022
@akien-mga akien-mga added this to the 4.0 milestone Jul 31, 2022
@akien-mga akien-mga requested a review from a team as a code owner July 31, 2022 11:12
@akien-mga akien-mga requested a review from m4gr3d July 31, 2022 11:13
Copy link
Contributor

@m4gr3d m4gr3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead augment the String::parse_utf8(...) with an additional (optional) parameter that allows us to remove that character from the generated string.

This will avoid performing another traversal and copy of the same string, which may not be an issue for smaller strings, but may affect performance for larger strings.

Also as discussed in the conversation, we want this change to apply only for 3.x in order to avoid breaking compatibility.
As I mentioned in the conversation, for master and onward, we should return the raw file as is.

@akien-mga
Copy link
Member Author

Also as discussed in the conversation, we want this change to apply only for 3.x in order to avoid breaking compatibility.
As I mentioned in the conversation, for master and onward, we should return the raw file as is.

Well I prefer fixing the compatibility breakage in master too because it was not an intentional design decision.

The File API for bindings was likely intentionally designed to provide users with Unix-formatted files, as evidenced by the explicit skip of \r in get_line(). This design decision can be challenged but that should be in its own proposal/PR, it shouldn't be done as a side effect of solving an Android-specific I/O performance regression.

For the same reason, I wouldn't change an API as core as String::parse_utf8() as part of this regression fix. First we fix the compat breakage (which is local to File::get_as_text(), and then we can discuss more changing to File and String APIs in follow-up proposals/PRs.

I don't think the cost of replace("\n", "") is too bad and in any case this should stay faster than the previous code, and functionally equivalent. So IMO that's good enough as a regression fix.

@akien-mga
Copy link
Member Author

akien-mga commented Jul 31, 2022

I don't think the cost of replace("\n", "") is too bad and in any case this should stay faster than the previous code, and functionally equivalent. So IMO that's good enough as a regression fix.

I did a test on a very heavy file with a debug build (so not the most telling as not optimized), and it does seem to have a significant impact on such a pathological case:

$ file editor_translations.gen.h*
editor_translations.gen.h:     C source, ASCII text
editor_translations.gen.h_dos: C source, ASCII text, with CRLF line terminators

$ ls -lh editor_translations.gen.h*
-rw-r--r-- 1 akien akien 16M Jul 31 16:36 editor_translations.gen.h
-rw-r--r-- 1 akien akien 19M Jul 31 16:37 editor_translations.gen.h_dos

$ cat editor_translations.gen.h | wc -l
2952162
$ cat editor_translations.gen.h_dos | wc -l
2952162

Measuring conversion time with this script:

var f1 = File.new()
f1.open("editor_translations.gen.h", File.READ)

var start1 = Time.get_ticks_usec()
f1.get_as_text()
var end1 = Time.get_ticks_usec()

print("Unix file: %f s" % ((end1 - start1) / 1e6))


var f2 = File.new()
f2.open("editor_translations.gen.h_dos", File.READ)

var start2 = Time.get_ticks_usec()
f2.get_as_text()
var end2 = Time.get_ticks_usec()

print("DOS file:  %f s" % ((end2 - start2) / 1e6))
Unix file: 3.107707 s
DOS file:  3.214315 s
Unix file: 0.136303 s
DOS file:  0.162586 s
  • This PR with replace("\r", "")
Unix file: 0.180824 s
DOS file:  1.624483 s

So it does make a significant difference on this extreme case. It's still much better than before #63481 especially when no replace is actually needed.

I'll give a try to implementing this in parse_utf8 to compare.


Edit: Gave it a (hacky) try and that's indeed way faster:

diff --git a/core/string/ustring.cpp b/core/string/ustring.cpp
index beefe54faf..252460af14 100644
--- a/core/string/ustring.cpp
+++ b/core/string/ustring.cpp
@@ -1753,6 +1753,11 @@ Error String::parse_utf8(const char *p_utf8, int p_len) {
 		uint8_t c = *p_utf8 >= 0 ? *p_utf8 : uint8_t(256 + *p_utf8);
 
 		if (skip == 0) {
+			if (c == '\r') {
+				cstr_size--;
+				p_utf8++;
+				continue;
+			}
 			/* Determine the number of characters in sequence */
 			if ((c & 0x80) == 0) {
 				*(dst++) = c;
Unix file: 0.138117 s
DOS file:  0.164486 s

But I don't know if we want to do this in something as low level as parse_utf8(). CC @bruvzg @reduz

@m4gr3d
Copy link
Contributor

m4gr3d commented Jul 31, 2022

But I don't know if we want to do this in something as low level as parse_utf8()

This should be fine if we put the logic behind an optional parameter which defaults to the current behavior.

@bruvzg
Copy link
Member

bruvzg commented Jul 31, 2022

But I don't know if we want to do this in something as low level as parse_utf8()

If it's an optional parameter, why not, it should be fine. But it's probably should be done in two places in parse_utf8, it's doing first pass to calculate the size of target string, and second pass to actually decode it.

@akien-mga
Copy link
Member Author

Superseded by #63733.

@akien-mga akien-mga closed this Jul 31, 2022
@akien-mga akien-mga added archived and removed cherrypick:3.x Considered for cherry-picking into a future 3.x release labels Jul 31, 2022
@akien-mga akien-mga deleted the file-get_as_text-remove-CR branch July 31, 2022 16:52
akien-mga added a commit to akien-mga/godot that referenced this pull request Jul 31, 2022
This was removed in godotengine#63481, and we confirmed that it's better like this,
but we add back the possibility to strip CR as an option, to optionally
restore the previous behavior.

For performance this is done directly in `String::parse_utf8`.

Also fixes Android `FileAccess::get_line()` as this one _should_ strip CR.

Supersedes godotengine#63717.
akien-mga added a commit to akien-mga/godot that referenced this pull request Jul 31, 2022
This was removed in godotengine#63481, and we confirmed that it's better like this,
but we add back the possibility to strip CR as an option, to optionally
restore the previous behavior.

For performance this is done directly in `String::parse_utf8`.

Also fixes Android `FileAccess::get_line()` as this one _should_ strip CR.

Supersedes godotengine#63717.

(cherry picked from commit 1418f97)
Riordan-DC pushed a commit to Riordan-DC/godot that referenced this pull request Jan 24, 2023
This was removed in godotengine#63481, and we confirmed that it's better like this,
but we add back the possibility to strip CR as an option, to optionally
restore the previous behavior.

For performance this is done directly in `String::parse_utf8`.

Also fixes Android `FileAccess::get_line()` as this one _should_ strip CR.

Supersedes godotengine#63717.

(cherry picked from commit 1418f97)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants