Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACE/TAO Wide Strings on Linux #2145

Open
wkbrd opened this issue Oct 17, 2023 Discussed in #2144 · 2 comments
Open

ACE/TAO Wide Strings on Linux #2145

wkbrd opened this issue Oct 17, 2023 Discussed in #2144 · 2 comments

Comments

@wkbrd
Copy link
Contributor

wkbrd commented Oct 17, 2023

Discussed in #2144

Originally posted by wkbrd October 16, 2023
We are attempting to use wide strings (CORBA::WChar*) on Linux, however we are finding that the encoding does not seem to behave as expected for characters whose representations do not fit into a single UTF-16 character.

On Linux, wchar_t is a 32-bit type. When a wide string comes over the wire, including from other ORBs, the default behavior is that each character in the string is the next 16-bit unit of the UTF-16 representation of the string, rather than the next character in the native UTF-32 layout. Is this intended?

For example, 🂡🂮🂭🂫🂪 is being marshaled as L"\xd83c\xdca1\xd83c\xdcae\xd83c\xdcad\xd83c\xdcab\xd83c\xdcaa"

A search of the source tree found references to -ORBNativeWcharCodeSet UCS-4 -ORBWcharCodesetTranslator WUCS4_UTF16_Factory in TAO/tests/CodeSets/simple/. Use of -ORBNativeWcharCodeSet UCS-4 produces an outcome where the string is represented as an array of wchar_t twice the length of the UTF-32 representation where each pair of elements are filled with the low 16 bits of the UTF-32 character followed by the high 16 bits of the UTF-32 character.

For example, 🂡🂮🂭🂫🂪 is being marshaled as L"¡\001®\001¬\001«\001ª\001"

(gdb) print /x wstrStreamName[0]
$2 = 0xf0a1
(gdb) print /x wstrStreamName[1]
$3 = 0x1
(gdb) print /x wstrStreamName[2]
$4 = 0xf0ae
(gdb) print /x wstrStreamName[3]
$5 = 0x1
(gdb) print /x wstrStreamName[4]
$6 = 0xf0ad
(gdb) print /x wstrStreamName[5]
$7 = 0x1
(gdb) print /x wstrStreamName[6]
$8 = 0xf0ab
(gdb) print /x wstrStreamName[7]
$9 = 0x1
(gdb) print /x wstrStreamName[8]
$10 = 0xf0aa
(gdb) print /x wstrStreamName[9]
$11 = 0x1
(gdb) print /x wstrStreamName[10]
$12 = 0x0
(gdb) print /x wstrStreamName[11]
$13 = 0x0

Character reference: https://en.wikipedia.org/wiki/Playing_cards_in_Unicode

Based on discussion content, we proceeded to attempt to use WUCS4_UTF16.cpp, though encountered issues. The associated PR addresses the issues.

@saper
Copy link
Contributor

saper commented Nov 25, 2023

Did you build ACE with uses_wchar?

On a very recent FreeBSD with LLVM 16 I get build failures in ACEXML with zzip and ACEXML/common/ZipCharStream.cpp - ACEXML_Char is wchar_t and zip/zzip libraries return ordinary (char) values that are not compatible with (ACEXML_Char).

(may be this is totally unrelated to what you are doing)

@mitza-oci
Copy link
Member

(may be this is totally unrelated to what you are doing)

It does seem to be unrelated, please open a new issue/discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants