-
-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement subset of arch_prctl syscall to support local-exec TLS of statically linked executables #1137
Comments
Your idea of supporting arch_prctl only for static executables (which can't call any OSv functions) makes sense. However, because the issue you created is specific to golang, before we do this I would ask to consider why golang went back to static executables, and whether they really did that switch. If I remember the history corrected, golang started with static executables, but then switched to dynamic executables which use glibc only for minimal things. Why did they switch back - do you know the motivation? Did they switch back? The page https://www.arp242.net/static-go.html suggests that maybe both compilation modes are still supported - but the dynamic version is used only if "cgo" is used. Since OSv is, for all intents and purposes, a C library, maybe it would be fine to force Go programs which are built for OSv to declare that they do, in fact, use a C library - and force the go compiler to generate a dynamic executable? |
Thanks for the link to that article. Indeed it is still possible to generate dynamically linked Go executables and very often it is the default if the app uses networking (which non-trivial app does not use networking?). There are some changes between some executables generated by older (~1.12) and newer (~1.15) Go toolchain which affects its support on OSv which I will open a separate issue to address. The bottom line though is that golang apps built as static are not supported on OSv and one of the reasons is the missing |
Over the weekend I have been trying to make some progress on the static ELF support. To that end, I resurrected some of the old commits from the static-elf branch originally written by Pekka and refined some of those to add more functional Obviously, as it was raised in the original Pekka's patches reviews (8 years ago) and somewhat described in one of my comments above, the Now the problem is how do we exactly accomplish it? I mean setting the Now there are at least two problems with this:
diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index 3678eafc..d1555575 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -81,6 +81,7 @@ void thread::switch_to()
// barriers
barrier();
set_fsbase(reinterpret_cast<u64>(_tcb));
+ processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb));
barrier();
auto c = _detached_state->_cpu;
old->_state.exception_stack = c->arch.get_exception_stack(); makes the context switch (as measured by
xorl %ebx,%ebx
testl $3,CS+8(%rsp)
je error_kernelspace
SWAPGS which amounts to (from that article) - "pick this info off the entry frame on the kernel stack, from the CS of the ptregs area of the kernel stack". From what I understand this code checks the CPL (Current Privilege Level) based on the last 2 bits in the CS register value placed in the exception/interrupt frame. Now, in OSv cannot apply a similar solution (I might be incorrect) as everything (kernel and app) runs in ring 0 and those 2 low bits in CS would never change, would they (see this). So this is what I propose instead which is based on the assumption that applications would not be using the GS register or we would not let them:
Any other ideas? |
By the way, regarding GS, https://www.kernel.org/doc/html/latest/x86/x86_64/fsgs.html says that "The GS segment has no common use and can be used freely by applications.". I guess this means we can decide not to support the applications (if there are any) that use it - and it's not used by glibc or the ABI or anything like that. |
…utables Even simplest executables need thread local storage (TLS) and a good example of it is errno which is a thread local variable. The OSv kernel itself uses many thread local variables and when running dynamically linked executables it shares the TLS memory block with the application. In this case OSv fully controls the setup of TLS and stores the pointer to TCB (Thread Control Block) as part of a thread state and updates the FS register on every context switch. On other hand, the statically linked executables setup their TLS and register it with kernel by executing syscall arch_prtcl on x86_64. In order to support it in OSv, we need to implement the arch_prtcl syscall and modify some key places in kernel code - syscall handler, exception handlers and VDSO - to switch from application TLS to the kernel one and back. The newly implemented arch_prtcl syscall on ARCH_SET_FS stores the application TCB address in the new field app_tcb added to the thread_control_block structure. At the same time, we modify following places to support switching between the application and kernel TCB if necessary (app_tcb != 0): The exception handler assembly in entry.S is modified to detect if on entry the current FS register points to the kernel TCB and if not it switches to the kernel one; on exit from exception, it switches back to the application TCB. To make this possible we "duplicate" the current thread kernel TCB address and store it in the new field _current_thread_kernel_tcb of the cpu_arch structure which is updated on every context switch and can be accessed in assembly as gs:16. The first 8 bytes field self of the thread control block holds the address to itself so we can easily compare fs:0 with gs:16 to know if FS register points to the kernel TCB or not. Please note this scheme is simpler and faster than one of the original version relying on extra counter that also required the interrupts to be disabled. It also works in nested scenarios - for example a page fault interrupted during a sleep. Similarly, we also change the syscall handler and VDSO code where we use simple RAII utility - tls_switch - to detect if current thread has non-zero application TCB and if so to switch to the kernel one before the code in scope and switch back to the application one after. This scheme is a little different from the exception handler because both syscall and VDSO functions are only executed on application threads which could have the FS register pointing to the application TCB for example when running statically linked executables and we do not need to deal with nesting. In addision, the vdso code is changed to C++ to allow using the C++ RAII utility described above. In essence, this PR makes possible to launch simple statically linked executables like "Hello World" on OSv: gcc -static -o hello-static apps/native-example/hello.c ./scripts/run.py -e /hello-static OSv v0.57.0-74-g2a835078 Booted up in 142.76 ms Cmdline: /hello-static WARNING: Statically linked executables are only supported to limited extent! syscall(): unimplemented system call 218 syscall(): unimplemented system call 273 syscall(): unimplemented system call 334 syscall(): unimplemented system call 302 Hello from C code Please note, that the code changes touch some critical places of the kernel functionality - context switching, syscall handling, exception handling, and VDSO implementation - and thus may slightly affect their performance. As far as context switching goes, this patch adds only a single memory write operation that does not seem to affect it in any measurable way based on what the misc-ctxsw.cc indicates. On the other hand, one could see the syscall handling cost go up by 2-5 ns (3-5% of the total cost ~100ns based on what misc-syscall-perf.cc measures) when executing statically linked executables due to the fact we need to switch the fsbase from the app TCB to the kernel one and back. The good news is that the syscall handling does not seem to be affected in any significant way when running dynamically linked executables. The VDSO function calls are affected most 7-10 ns (from 23 to 30ns) even though the VDSO code uses the same exact tls_switch RAII utility and seems to get inlined in similar way as above in the syscall handler. Finally, I did not measure the impact of changes to the exception handling (interrupts, page faults, etc) but I think it should be similar to syscall handling. The interrupts are in general relatively expensive in the virtualized environment (guest/host) as this email by Avi Kivity explains - https://groups.google.com/g/osv-dev/c/w_fuxsYla-M/m/WxpRZTXQ-twJ. On top of this, the the FPU saving/restoring takes ~60ns which is far more expensive than switching fsbase. Fixes cloudius-systems#1137 Signed-off-by: Waldemar Kozaczuk <[email protected]>
…utables Even simplest executables need thread local storage (TLS) and a good example of it is errno which is a thread local variable. The OSv kernel itself uses many thread local variables and when running dynamically linked executables it shares the TLS memory block with the application. In this case OSv fully controls the setup of TLS and stores the pointer to TCB (Thread Control Block) as part of a thread state and updates the FS register on every context switch. On other hand, the statically linked executables setup their TLS and register it with kernel by executing syscall arch_prtcl on x86_64. In order to support it in OSv, we need to implement the arch_prtcl syscall and modify some key places in kernel code - syscall handler, exception handlers and VDSO - to switch from application TLS to the kernel one and back. The newly implemented arch_prtcl syscall on ARCH_SET_FS stores the application TCB address in the new field app_tcb added to the thread_control_block structure. At the same time, we modify following places to support switching between the application and kernel TCB if necessary (app_tcb != 0): The exception handler assembly in entry.S is modified to detect if on entry the current FS register points to the kernel TCB and if not it switches to the kernel one; on exit from exception, it switches back to the application TCB. To make this possible we "duplicate" the current thread kernel TCB address and store it in the new field _current_thread_kernel_tcb of the cpu_arch structure which is updated on every context switch and can be accessed in assembly as gs:16. The first 8 bytes field self of the thread control block holds the address to itself so we can easily compare fs:0 with gs:16 to know if FS register points to the kernel TCB or not. Please note this scheme is simpler and faster than one of the original version relying on extra counter that also required the interrupts to be disabled. It also works in nested scenarios - for example a page fault interrupted during a sleep. Similarly, we also change the syscall handler and VDSO code where we use simple RAII utility - tls_switch - to detect if current thread has non-zero application TCB and if so to switch to the kernel one before the code in scope and switch back to the application one after. This scheme is a little different from the exception handler because both syscall and VDSO functions are only executed on application threads which could have the FS register pointing to the application TCB for example when running statically linked executables and we do not need to deal with nesting. In addision, the vdso code is changed to C++ to allow using the C++ RAII utility described above. In essence, this PR makes possible to launch simple statically linked executables like "Hello World" on OSv: gcc -static -o hello-static apps/native-example/hello.c ./scripts/run.py -e /hello-static OSv v0.57.0-74-g2a835078 Booted up in 142.76 ms Cmdline: /hello-static WARNING: Statically linked executables are only supported to limited extent! syscall(): unimplemented system call 218 syscall(): unimplemented system call 273 syscall(): unimplemented system call 334 syscall(): unimplemented system call 302 Hello from C code Please note, that the code changes touch some critical places of the kernel functionality - context switching, syscall handling, exception handling, and VDSO implementation - and thus may slightly affect their performance. As far as context switching goes, this patch adds only a single memory write operation that does not seem to affect it in any measurable way based on what the misc-ctxsw.cc indicates. On the other hand, one could see the syscall handling cost go up by 2-5 ns (3-5% of the total cost ~100ns based on what misc-syscall-perf.cc measures) when executing statically linked executables due to the fact we need to switch the fsbase from the app TCB to the kernel one and back. The good news is that the syscall handling does not seem to be affected in any significant way when running dynamically linked executables. The VDSO function calls are affected most 7-10 ns (from 23 to 30ns) even though the VDSO code uses the same exact tls_switch RAII utility and seems to get inlined in similar way as above in the syscall handler. Finally, I did not measure the impact of changes to the exception handling (interrupts, page faults, etc) but I think it should be similar to syscall handling. The interrupts are in general relatively expensive in the virtualized environment (guest/host) as this email by Avi Kivity explains - https://groups.google.com/g/osv-dev/c/w_fuxsYla-M/m/WxpRZTXQ-twJ. On top of this, the the FPU saving/restoring takes ~60ns which is far more expensive than switching fsbase. Fixes cloudius-systems#1137 Signed-off-by: Waldemar Kozaczuk <[email protected]>
…utables Even simplest executables need thread local storage (TLS) and a good example of it is errno which is a thread local variable. The OSv kernel itself uses many thread local variables and when running dynamically linked executables it shares the TLS memory block with the application. In this case OSv fully controls the setup of TLS and stores the pointer to TCB (Thread Control Block) as part of a thread state and updates the FS register on every context switch. On other hand, the statically linked executables setup their TLS and register it with kernel by executing syscall arch_prtcl on x86_64. In order to support it in OSv, we need to implement the arch_prtcl syscall and modify some key places in kernel code - syscall handler, exception handlers and VDSO - to switch from application TLS to the kernel one and back. The newly implemented arch_prtcl syscall on ARCH_SET_FS stores the application TCB address in the new field app_tcb added to the thread_control_block structure. At the same time, we modify following places to support switching between the application and kernel TCB if necessary (app_tcb != 0): The exception handler assembly in entry.S is modified to detect if on entry the current FS register points to the kernel TCB and if not it switches to the kernel one; on exit from exception, it switches back to the application TCB. To make this possible we "duplicate" the current thread kernel TCB address and store it in the new field _current_thread_kernel_tcb of the cpu_arch structure which is updated on every context switch and can be accessed in assembly as gs:16. The first 8 bytes field self of the thread control block holds the address to itself so we can easily compare fs:0 with gs:16 to know if FS register points to the kernel TCB or not. Please note this scheme is simpler and faster than one of the original version relying on extra counter that also required the interrupts to be disabled. It also works in nested scenarios - for example a page fault interrupted during a sleep. Similarly, we also change the syscall handler and VDSO code where we use simple RAII utility - tls_switch - to detect if current thread has non-zero application TCB and if so to switch to the kernel one before the code in scope and switch back to the application one after. This scheme is a little different from the exception handler because both syscall and VDSO functions are only executed on application threads which could have the FS register pointing to the application TCB for example when running statically linked executables and we do not need to deal with nesting. In addision, the vdso code is changed to C++ to allow using the C++ RAII utility described above. In essence, this PR makes possible to launch simple statically linked executables like "Hello World" on OSv: gcc -static -o hello-static apps/native-example/hello.c ./scripts/run.py -e /hello-static OSv v0.57.0-74-g2a835078 Booted up in 142.76 ms Cmdline: /hello-static WARNING: Statically linked executables are only supported to limited extent! syscall(): unimplemented system call 218 syscall(): unimplemented system call 273 syscall(): unimplemented system call 334 syscall(): unimplemented system call 302 Hello from C code Please note, that the code changes touch some critical places of the kernel functionality - context switching, syscall handling, exception handling, and VDSO implementation - and thus may slightly affect their performance. As far as context switching goes, this patch adds only a single memory write operation that does not seem to affect it in any measurable way based on what the misc-ctxsw.cc indicates. On the other hand, one could see the syscall handling cost go up by 2-5 ns (3-5% of the total cost ~100ns based on what misc-syscall-perf.cc measures) when executing statically linked executables due to the fact we need to switch the fsbase from the app TCB to the kernel one and back. The good news is that the syscall handling does not seem to be affected in any significant way when running dynamically linked executables. The VDSO function calls are affected most 7-10 ns (from 23 to 30ns) even though the VDSO code uses the same exact tls_switch RAII utility and seems to get inlined in similar way as above in the syscall handler. Finally, I did not measure the impact of changes to the exception handling (interrupts, page faults, etc) but I think it should be similar to syscall handling. The interrupts are in general relatively expensive in the virtualized environment (guest/host) as this email by Avi Kivity explains - https://groups.google.com/g/osv-dev/c/w_fuxsYla-M/m/WxpRZTXQ-twJ. On top of this, the the FPU saving/restoring takes ~60ns which is far more expensive than switching fsbase. Fixes #1137 Signed-off-by: Waldemar Kozaczuk <[email protected]>
The Golang apps like our
golang-pie-example
built with newer Golang compiler/linker (at least as of 1.15) no longer run correctly on OSv and crash like so:The unimplemented syscall 158 is
__NR_arch_prctl
which can be used to set FS_BASE for local-exec TLS. It looks like something had changed in the Golang build-chain between 1.12 and 1.15. The Golang apps built with-buildmode=pie
used to work just fine on OSv. It seems that in the past the Golang produced pies used to use two ways of communicating with kernel - 1) via SYSCALL instruction and 2) via libc calls for pthreads (OSv still does not support the clone syscall). The newer build chain seems to produce an executable that only communicates via SYSCALL.Here is some info about the executable:
The info seems to be a bit confusing as it suggests that the executable is position-dependent (the entry point) and is statically linked.
I have also experimented with basic C hello world app linked statically and here is some info:
So in each case, apps make the
arch_prctl
syscall which OSv would need to support.It is not the first time the issue of
arch_prctl
is raised - please see this conversation - https://groups.google.com/g/osv-dev/c/qLfRBRRUIHM and this one related to one of the patches attempting to add thearch_prctl
support - https://groups.google.com/g/osv-dev/c/FPkZzUb5uB4/m/npEg9GqDAgAJ.Here is what @nyh wrote in the latter one:
My sense is that adding support of
arch_prctl
will be a bit tricky but not impossible if we stick to certain assumptions:arch_prctl
glibc wrapper on purposearch_prctl
with statically linked executables both position-dependent and independent ones only; the key thing is that the SYSCALL instruction would provide us with necessary trigger points to switch theFS
register to point to correct TLS block for a given threadSo here is a concrete list of things to be done:
arch_prctl
syscall to handleARCH_SET_FS
andARCH_GET_FS
and store theARCH_SET_FS
parameter in a new field for the current threadARCH_SET_FS
for current threadARCH_SET_FS
for the current thread (only if app?)The text was updated successfully, but these errors were encountered: