héllòmüñ

This project explores the ins and outs of unicode characters in filenames in development workflows

Clone the project into a directory with a unicode character to start the experiment:

git clone https://github.com/meshula/m-n.git mün

On MSVC,

cl héllòmüñ.c
héllòmüñ.exe

various results and diagnostics ensue.

In the following output, we can conclude that only this declaration

unsigned char* umlaut_a = "ü"; is utf8

can be converted by windows to utf16 in a form that the win sdk can understand.

C string literals declared with any adornments are not compatible with the win sdk. Only umlaut_a is convertible by MultiByteToWideChar such that win sdk functions succeed.

Same for the wide char functions such as wcslen, they expect as input, u16 encoded strings.

+--------------------------------------------------+
| Héllø Mün!                                    |
+--------------------------------------------------+

+--------------------------------------------------+
| Size of wchar_t                                  |
+--------------------------------------------------+
  size of wchar_t is 2

+--------------------------------------------------+
| C umlaut U encodings                             |
+--------------------------------------------------+
wchar_t* umlaut_u = L"ü"; is utf8
c3 00 bc 00 00 00  wcslen is 2

unsigned char* umlaut_a = "ü"; is utf8
c3 bc 00 00  wcslen is 8

wchar_t* umlaut_w = u"ü"; is utf8
c3 00 bc 00 00 00  wcslen is 8

unsigned int* umlaut_W = U"ü"; is utf8
wchar_t* umlaut_uu = L"\uc3bc"; is utf8
c3 00 00 00 bc 00 00 00 00 00 00 00  wcslen is 8


+--------------------------------------------------+
| check wcslen(héllòmüñ), should be 8          |
+--------------------------------------------------+
utf16 a required size is 9
hello_u: 12
hello_a: 6
hello_w: 12
hello_W: 1
hello_uu: 8
hello_utf16: 8

+--------------------------------------------------+
| Check uft16 conversions                          |
+--------------------------------------------------+
utf16 a required size is 11
 L"héllòmüñ.c": 68 C3 A9 6C 6C C3 B2 6D C3 BC C3 B1 2E 63
 as U16:        68
 "héllòmüñ.c":  68 C3 A9 6C 6C C3 B2 6D C3 BC C3 B1 2E 63
 as U16:        68 E9 6C 6C F2 6D FC F1 2E 63

+--------------------------------------------------+
| GetFileAttributesW/A                             |
+--------------------------------------------------+
GetFileAttributesW(u8"héllòmüñ.c") failed: 2
GetFileAttributesA(u8"héllòmüñ.c") failed: 2
GetFileAttributesW(u16"héllòmüñ.c") succeeded

+--------------------------------------------------+
| FindFirstFileW/A                                 |
+--------------------------------------------------+
FindFirstFileA("h*.c") succeeded h�ll�m��.c
FindFirstFileW(L"h*.c") succeeded h�ll�m��.c
  GetFileAttributesW on found path succeeded
FindFirstFileW(utf16 converted) succeeded h�ll�m��.c
  GetFileAttributesW on found path succeeded

+--------------------------------------------------+
| Finished                                         |
+--------------------------------------------------+

Code for mac and linux to be added once the Windows side is fully explored.

clang/gcc/msvc

🦋 all compilers can compile the c file, the embedded utf8 string prints correctly
🦋 msvc debugger displays utf8 correctly
🐛 msvc debugger displays utf16 incorrectly
🐛 msvc compile/link messages, compiler error windows: unicode not displayed consistently or properly

cmake

🦋 Unicode cmake project name is fine
🐛 CMake can't use unicode for target names

CMake Error at CMakeLists.txt:3 (add_executable):
  The target name "héllòmün" is reserved or not valid for certain CMake
  features, such as generator expressions, and may result in undefined
  behavior.

🦋 unicode filename for c file is fine in cmake

boost

🐛 boost build fails when the installation target is a path with a unicode character; every copy fails with a message showing a corrupt path. In the follow example, I tried to build to a directory named üsd-install.

 copy /b "C:\ï¿½sd-install\src\boost_1_78_0\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp" + this-file-does-not-exist-A698EE7806899E69 "c:\ï¿½sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp"

...failed common.copy c:\ï¿½sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp...
common.copy c:\ï¿½sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_headers.hpp
The system cannot find the path specified.

🐛 bjam creates a corrupt directory name during the installation.

ⁿsd-install

git

🐛 Can't add unicode filename to git on mac

git add "h\303\251ll\303\262m\303\274\303\261.c"
fatal: pathspec 'h\303\251ll\303\262m\303\274\303\261.c' did not match any files

solution:

git config --global core.precomposeunicode true
git add héllòmüñ.c
git status

Changes to be committed:
...
	new file:   "h\303\251ll\303\262m\303\274\303\261.c"

🐛 Git branches with characters in their names that are illegal in filenames on Windows lead to problems because branch names are used as filenames in the .git directory: git-for-windows/git#2904 It's possible to create and push such a branch on macOs and cause errors on Windows. (Thanks @Simran)

github

🐛 github create mün - repo created as m-n
🦋 héllòmüñ.c appears on github as héllòmüñ.c
🦋 modify héllòmüñ.c locally and push
🦋 clone on mac
🦋 clone on mac without core.precomposeunicode true flag
🦋 clone on windows
🦋 touch file on windows, commit, push
🦋 touch file on linux, commit, push

perforce

🦋 add héllòmüñ.c to a changelist

notes

💻 Some Utf8 -> 16 code that's easy to read: https://gist.github.com/tommai78101/3631ed1f136b78238e85582f08bdc618
📚 The Absolute Minimum https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
📚 UTF8 everywhere https://utf8everywhere.org
📚 OpenAssetIO string encoding https://github.com/OpenAssetIO/OpenAssetIO/blob/main/doc/decisions/DR005-String-encoding.md
📚 Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems, David A. Wheeler, 15 Nov 2020 https://dwheeler.com/essays/fixing-unix-linux-filenames.html
💻 a yikes hack to bridge utf16 and utf8, mostly intended to help with Microsoft wchar APIs: https://simonsapin.github.io/wtf-8/

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
héllòmüñ.c		héllòmüñ.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

héllòmüñ

clang/gcc/msvc

cmake

boost

git

github

perforce

notes

About

Releases

Packages

Languages

meshula/m-n

Folders and files

Latest commit

History

Repository files navigation

héllòmüñ

clang/gcc/msvc

cmake

boost

git

github

perforce

notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages