Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-110289: C API: Add PyUnicode_EqualToUTF8() function #110297
gh-110289: C API: Add PyUnicode_EqualToUTF8() function #110297
Changes from 6 commits
d39945e
4793161
8b24911
c55f9ac
bdf2f1e
7223c14
b271327
6f26ad6
dd124b8
76b9177
ee5781d
1a4eb7b
b124377
029f1a0
be2ffe8
29b26f7
78de49d
fc79d5e
19ad126
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it does not work so.
NULL is defined as None, and
equaltoutf8('abc', None)
is a TypeError.If
equaltoutf8()
is called with only one argument, it sets the second argument forPyUnicode_EqualToUTF8()
to NULL, so we can test it and ensure that it indeed crashes. It is a common approach used in other tests in this file forconst char *
argument. Some functions do not crash, but raise exception or return successfully for NULL, but this function simply crashes in debug build.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I thought that they were just pseudo-code as comments. Sure, you can leave
# NULL
if you prefer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I copied this pattern from the test for
PyUnicode_CompareWithASCIIString()
which was one of the first written tests. In newer tests I use "z#" which allows to pass None for NULL. Or perhaps I changed this everywhere except the test forPyUnicode_CompareWithASCIIString()
. So perhaps I can change this too.Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to test the length first, to make the code more readable.
Like:
Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same in
_PyUnicode_EqualToASCIIString()
.How
is more readable than simple
return a && b;
? It is what the&&
operator for.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, it's easier to reason about a single test per line when I review code.
Keep
a && b
if you prefer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readability problem as I see it, is that your
&&
use has side effects; it is not a pure logic expression.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, every condition here is already on a separate line.
It is how
&&
works in C. There is a lot of code likearg != NULL and PyDict_Check(arg) && PyDict_GET_SIZE(arg) > count
. I do not think rewriting it in threeif
s withgoto
s can improve readability.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that if we return true at this point then we know that
str
is the utf8 representation ofunicode
, does it make sense to copy the contents intounicode->utf8
so that future operations can fast-path without needing to encode again?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs a separate research and discussion. The disadvantage is that it increases the consumed memory size, also it consumes some CPU time, so the benefit will be only if the UTF-8 cache is used in future.
If the idea turned out to be good, it can simply be implemented in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense. I guess this also sits in an awkward place where it's likely that the user is best suited to know whether or not they want the utf-8 cache populated, but it's also an implementation detail that we don't really want to expose to users.
For now I'll just mark this comment as resolved.Edit I can't, probably lack permissions I guess.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyUnicode_EqualToUTF8() doesn't raise exception and cannot fail. Trying to allocate memory should not raise memory, but it sounds like a non-trivial side effect.
Worst case: 1 GB string, you call PyUnicode_EqualToUTF8() and suddenly, Python allocates 1 GB more. I would be surprised by this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth it to add a comment explaining why we don't cache the UTF-8 encoded string.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.