Skip to content
/ m-n Public

testing unicode interaction with cmake/git/etc

Notifications You must be signed in to change notification settings

meshula/m-n

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 

Repository files navigation

héllòmüñ

This project explores the ins and outs of unicode characters in filenames in development workflows

Clone the project into a directory with a unicode character to start the experiment:

git clone https://github.com/meshula/m-n.git mün

On MSVC,

cl héllòmüñ.c
héllòmüñ.exe

various results and diagnostics ensue.

In the following output, we can conclude that only this declaration

unsigned char* umlaut_a = "ü"; is utf8

can be converted by windows to utf16 in a form that the win sdk can understand.

C string literals declared with any adornments are not compatible with the win sdk. Only umlaut_a is convertible by MultiByteToWideChar such that win sdk functions succeed.

Same for the wide char functions such as wcslen, they expect as input, u16 encoded strings.

+--------------------------------------------------+
| Héllø Mün!                                    |
+--------------------------------------------------+

+--------------------------------------------------+
| Size of wchar_t                                  |
+--------------------------------------------------+
  size of wchar_t is 2

+--------------------------------------------------+
| C umlaut U encodings                             |
+--------------------------------------------------+
wchar_t* umlaut_u = L"ü"; is utf8
c3 00 bc 00 00 00  wcslen is 2

unsigned char* umlaut_a = "ü"; is utf8
c3 bc 00 00  wcslen is 8

wchar_t* umlaut_w = u"ü"; is utf8
c3 00 bc 00 00 00  wcslen is 8

unsigned int* umlaut_W = U"ü"; is utf8
wchar_t* umlaut_uu = L"\uc3bc"; is utf8
c3 00 00 00 bc 00 00 00 00 00 00 00  wcslen is 8


+--------------------------------------------------+
| check wcslen(héllòmüñ), should be 8          |
+--------------------------------------------------+
utf16 a required size is 9
hello_u: 12
hello_a: 6
hello_w: 12
hello_W: 1
hello_uu: 8
hello_utf16: 8

+--------------------------------------------------+
| Check uft16 conversions                          |
+--------------------------------------------------+
utf16 a required size is 11
 L"héllòmüñ.c": 68 C3 A9 6C 6C C3 B2 6D C3 BC C3 B1 2E 63
 as U16:        68
 "héllòmüñ.c":  68 C3 A9 6C 6C C3 B2 6D C3 BC C3 B1 2E 63
 as U16:        68 E9 6C 6C F2 6D FC F1 2E 63

+--------------------------------------------------+
| GetFileAttributesW/A                             |
+--------------------------------------------------+
GetFileAttributesW(u8"héllòmüñ.c") failed: 2
GetFileAttributesA(u8"héllòmüñ.c") failed: 2
GetFileAttributesW(u16"héllòmüñ.c") succeeded

+--------------------------------------------------+
| FindFirstFileW/A                                 |
+--------------------------------------------------+
FindFirstFileA("h*.c") succeeded h�ll�m��.c
FindFirstFileW(L"h*.c") succeeded h�ll�m��.c
  GetFileAttributesW on found path succeeded
FindFirstFileW(utf16 converted) succeeded h�ll�m��.c
  GetFileAttributesW on found path succeeded

+--------------------------------------------------+
| Finished                                         |
+--------------------------------------------------+

Code for mac and linux to be added once the Windows side is fully explored.

clang/gcc/msvc

  • 🦋 all compilers can compile the c file, the embedded utf8 string prints correctly
  • 🦋 msvc debugger displays utf8 correctly
  • 🐛 msvc debugger displays utf16 incorrectly
  • 🐛 msvc compile/link messages, compiler error windows: unicode not displayed consistently or properly

cmake

  • 🦋 Unicode cmake project name is fine
  • 🐛 CMake can't use unicode for target names
CMake Error at CMakeLists.txt:3 (add_executable):
  The target name "héllòmün" is reserved or not valid for certain CMake
  features, such as generator expressions, and may result in undefined
  behavior.
  • 🦋 unicode filename for c file is fine in cmake

boost

  • 🐛 boost build fails when the installation target is a path with a unicode character; every copy fails with a message showing a corrupt path. In the follow example, I tried to build to a directory named üsd-install.
 copy /b "C:\�sd-install\src\boost_1_78_0\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp" + this-file-does-not-exist-A698EE7806899E69 "c:\�sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp"

...failed common.copy c:¿½sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp...
common.copy c:¿½sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_headers.hpp
The system cannot find the path specified.
  • 🐛 bjam creates a corrupt directory name during the installation.
ⁿsd-install

git

  • 🐛 Can't add unicode filename to git on mac
git add "h\303\251ll\303\262m\303\274\303\261.c"
fatal: pathspec 'h\303\251ll\303\262m\303\274\303\261.c' did not match any files

solution:

git config --global core.precomposeunicode true
git add héllòmüñ.c
git status

Changes to be committed:
...
	new file:   "h\303\251ll\303\262m\303\274\303\261.c"
  • 🐛 Git branches with characters in their names that are illegal in filenames on Windows lead to problems because branch names are used as filenames in the .git directory: git-for-windows/git#2904 It's possible to create and push such a branch on macOs and cause errors on Windows. (Thanks @Simran)

github

  • 🐛 github create mün - repo created as m-n
  • 🦋 héllòmüñ.c appears on github as héllòmüñ.c
  • 🦋 modify héllòmüñ.c locally and push
  • 🦋 clone on mac
  • 🦋 clone on mac without core.precomposeunicode true flag
  • 🦋 clone on windows
  • 🦋 touch file on windows, commit, push
  • 🦋 touch file on linux, commit, push

perforce

  • 🦋 add héllòmüñ.c to a changelist

notes

About

testing unicode interaction with cmake/git/etc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published