ACE/TAO Wide Strings on Linux #2145

wkbrd · 2023-10-17T21:20:22Z

Discussed in #2144

^{Originally posted by wkbrd October 16, 2023}
We are attempting to use wide strings (CORBA::WChar*) on Linux, however we are finding that the encoding does not seem to behave as expected for characters whose representations do not fit into a single UTF-16 character.

On Linux, wchar_t is a 32-bit type. When a wide string comes over the wire, including from other ORBs, the default behavior is that each character in the string is the next 16-bit unit of the UTF-16 representation of the string, rather than the next character in the native UTF-32 layout. Is this intended?

For example, 🂡🂮🂭🂫🂪 is being marshaled as L"\xd83c\xdca1\xd83c\xdcae\xd83c\xdcad\xd83c\xdcab\xd83c\xdcaa"

A search of the source tree found references to -ORBNativeWcharCodeSet UCS-4 -ORBWcharCodesetTranslator WUCS4_UTF16_Factory in TAO/tests/CodeSets/simple/. Use of -ORBNativeWcharCodeSet UCS-4 produces an outcome where the string is represented as an array of wchar_t twice the length of the UTF-32 representation where each pair of elements are filled with the low 16 bits of the UTF-32 character followed by the high 16 bits of the UTF-32 character.

For example, 🂡🂮🂭🂫🂪 is being marshaled as L"¡\001®\001¬\001«\001ª\001"

(gdb) print /x wstrStreamName[0]
$2 = 0xf0a1
(gdb) print /x wstrStreamName[1]
$3 = 0x1
(gdb) print /x wstrStreamName[2]
$4 = 0xf0ae
(gdb) print /x wstrStreamName[3]
$5 = 0x1
(gdb) print /x wstrStreamName[4]
$6 = 0xf0ad
(gdb) print /x wstrStreamName[5]
$7 = 0x1
(gdb) print /x wstrStreamName[6]
$8 = 0xf0ab
(gdb) print /x wstrStreamName[7]
$9 = 0x1
(gdb) print /x wstrStreamName[8]
$10 = 0xf0aa
(gdb) print /x wstrStreamName[9]
$11 = 0x1
(gdb) print /x wstrStreamName[10]
$12 = 0x0
(gdb) print /x wstrStreamName[11]
$13 = 0x0

Character reference: https://en.wikipedia.org/wiki/Playing_cards_in_Unicode

Based on discussion content, we proceeded to attempt to use WUCS4_UTF16.cpp, though encountered issues. The associated PR addresses the issues.

saper · 2023-11-25T02:20:04Z

Did you build ACE with uses_wchar?

On a very recent FreeBSD with LLVM 16 I get build failures in ACEXML with zzip and ACEXML/common/ZipCharStream.cpp - ACEXML_Char is wchar_t and zip/zzip libraries return ordinary (char) values that are not compatible with (ACEXML_Char).

(may be this is totally unrelated to what you are doing)

mitza-oci · 2023-11-28T03:21:38Z

(may be this is totally unrelated to what you are doing)

It does seem to be unrelated, please open a new issue/discussion.

wkbrd mentioned this issue Oct 17, 2023

Updated WCS4_UTF16 to allow four octet wchar_t characters to marshal … #2146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACE/TAO Wide Strings on Linux #2145

ACE/TAO Wide Strings on Linux #2145

wkbrd commented Oct 17, 2023

saper commented Nov 25, 2023 •

edited

Loading

mitza-oci commented Nov 28, 2023

ACE/TAO Wide Strings on Linux #2145

ACE/TAO Wide Strings on Linux #2145

Comments

wkbrd commented Oct 17, 2023

Discussed in #2144

saper commented Nov 25, 2023 • edited Loading

mitza-oci commented Nov 28, 2023

saper commented Nov 25, 2023 •

edited

Loading