Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace usage of old unicode API removed in Py3.12/PEP 623 #1860

Merged
merged 4 commits into from
Sep 17, 2022

Conversation

kxrob
Copy link
Collaborator

@kxrob kxrob commented Apr 14, 2022

The "legacy" Unicode object will be removed in Python 3.12
with deprecated APIs. All Unicode objects will be "canonical"
since then. See PEP 623 for more information.

Those old APIs were still used in pywin32:

  • PyUnicode_AsUnicode
  • PyUnicode_GetSize
  • PyUnicode_AS_UNICODE
  • PyUnicode_GET_SIZE, PyUnicode_GET_DATA_SIZE
  • PyUnicode_FromUnicode
  • PyUnicode_EncodeMBCS
  • u u# Z Z# in PyArg_Parse... format strings

@kxrob
Copy link
Collaborator Author

kxrob commented Aug 14, 2022

The first Py3.12 alpha should appear in October on github for testing the deprecation replacements fully: https://peps.python.org/pep-0693/

Copy link
Owner

@mhammond mhammond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I still need to get my head around the new macros - it adds alot of complexity I'm still untangling. I do like the ability it has to provide the PyObject** for PyArg_ParseTuple.

I'm inclined to think that TmpWCHAR is the wrong vehicle for this though. Every current use of TmpWCHAR never calls the constructor that takes a WCHAR - almost every single case is used in a PyArg_ParseTuple("O")/ PyWinObject_AsWCHAR dance. So really, almost every existing use of TmpWCHAR should eventually move to this new mechanism.

ie, I think we should split this functionality - TmpWCHAR (with a new name eventually) should be just "takes ownership of memory allocated by PyMem_New()", probably moving to a void * once PyWinObject_AsWCHAR() has been killed in favor of the new mechanism. IOW, TmpWCHAR shouldn't be touched here other than a note to say it's deprecate/pending a rename/something.

A new object, say, PyWin_WCHAR or similar, which looks like you changes here, but unlike TmpWCHAR, never takes ownership of memory allocated elsewhere. It only supports construction/initialization via a PyObject *. The WCHAR * it holds is always allocated internally (and I suspect there's a future optimization here, where if we can determine the object's PyUnicode_KIND is PyUnicode_4BYTE_KIND we could still borrow the buffer?)

I think that would simplify things significantly. As mentioned though, I'm still getting my head around the new macros, even after staring at this for a while, so (a) I need to think more about them but (b) I really hope there's something that can be done to make that part of this easier to understand.

WDYT?

I'm fine if you don't have the time or inclination to take this on though, in which case I'll have a bit of a poke in a week or 2 - but I really do appreciate your work here!

PyUnicode_AS_UNICODE removed in Py3.12 (PEP 623) / e6f7299
TmpWCHAR does most of the conversions and now required
memory handling.
Replace PyUnicode_GetSize.

> The "legacy" Unicode object (buffered wstr in unicode objects)
> will be removed in Python 3.12 with deprecated APIs. All Unicode
> objects will be "canonical" since then. See PEP 623 for more
> information.

Those deprecated API were still used in pywin32:
* PyUnicode_AsUnicode
* PyUnicode_GetSize
* PyUnicode_AS_UNICODE
* PyUnicode_GET_SIZE, PyUnicode_GET_DATA_SIZE
* PyUnicode_FromUnicode
* PyUnicode_EncodeMBCS
* u u# Z Z# in PyArg_Parse... format strings
and replace PyUnicode_GET_SIZE, PyUnicode_GET_DATA_SIZE.
@kxrob
Copy link
Collaborator Author

kxrob commented Sep 5, 2022

I still need to get my head around the new macros - it adds alot of complexity I'm still untangling. I do like the ability it has to provide the PyObject** for PyArg_ParseTuple. ... hope there's something that can be done to make that part of this easier to understand.

This strange macro mechanism (U2WREC, U2WCONV, u2w ..) and auxiliary union attribute stuff (handling source/target addresses for conversion) probably just should be dropped:
It came only with the last 2 commits (replace "u" and "Z" in PyArg_Parse... ). The only purpose was to not repetitively make extra PyObject* intermediate variables and extra statements doing PyWinObject_AsWCHAR in the PyArg_ParseTuple... use cases (like the cases where TmpWCHAR was used previously ~100x) - but doing it within TmpWCHAR itself and the PyArg_ParseTuple statement and to automatically provide specific help text upon error.
But there are only few new use cases for that. So that strange mechanism does not pay off and its not worth to compress those existing 100 cases. So its probably best to just remove those 2 commits, keep it easy to read, and write out those few new PyArg_ParseTuple use cases using the existing style. Regarding the single big bulk use case in the last commit (bulk "Z") a specific local macro and aux. array(s) can be used for conversions in a loop.
So I'm simply going to drop those 2 commits so far - handling the u and Z arg parse cases in an extra PR later ...

almost every existing use of TmpWCHAR should eventually move to this new mechanism

(well, it doesn't save typing and testing anymore, may not be worth touching and organizing all this in a well readable way... ?)

A new object, say, PyWin_WCHAR or similar, which looks like you changes here, but unlike TmpWCHAR, never takes ownership of memory allocated elsewhere. It only supports construction/initialization via a PyObject *.

Besides the above dropped mechanism TmpWCHAR so far would only gain the function to do auto PyUnicode_AsWideCharString at assignment / construction time (2nd commit). That could become an extra class / name / sub class as well. But so far there is not really a separate purpose (freeing the held temp string).

if we can determine the object's PyUnicode_KIND is PyUnicode_4BYTE_KIND we could still borrow the buffer?

For potentially saving a PyUnicode_AsWideCharString in the (rare?) case of PyUnicode_2BYTE_KIND, it seems the canonical state must be guaranteed first (PyUnicode_READY(), extra cost?, otherwise the string representation could change suddenly), then checked (again). There is also a (non-canonical?) PyUnicode_WCHAR_KIND. Is Py_UCS2* / PyUnicode_2BYTE_KIND always a valid NULL terminated Windows WCHAR string? If this works, is fast and is worth it, there would be an extra PyObject* (inc/decref) to the python string in the holder - being a flag at the same time. Could still happen in the same class.

@mhammond
Copy link
Owner

Sorry for the delay and thanks for persevering!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants