-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw an error when compiling and running on a non-4k page host #1221
Comments
Hmm it sounds like you want me to try fex on my 16k kernel? I'll make sure to file bug reports if it breaks spectacularly 😋 |
plz no. Support 4k pages for regular memory. Keep the mmio stuff tied to 16kb :< |
In case you are not aware... That's causing problems for Asahi Linux. Asahi is a port of Linux to Apple M1's. The GCC Compile Farm recently got one (an Apple M1 running Linux). Several key projects have run into problems on the M1, including FEX. The Emacs folks said "to heck with it, we'll make allocation on 64K boundaries to satisfy all possible platforms". Also see bug#47125: 28.0.50; pdumper assumes compile time page size remains valid. 16k support would really help the Asahi Linux folks. I hope you revisit things in the future. |
We added then removed the compile time page check here already. If at runtime it changes to something other than 4k then any application expecting 4k page alignments will break, which we already know games that do this. |
Thanks @Sonicadvance1,
Great!
I hate to point this out because I don't want to sound pugilistic and sour you in the future. We really need your help. The project is making a fallacious argument. Its called Appeal to Hypocrisy. Don't worry about what other projects are doing. Worry about FEX. The free software community will focus on other problematic projects. |
What I mean is. That a game we are emulating is expecting a 4k page size. Thus changing its hardcoded page size to 16k will break its file streaming code. |
The M1 SoC definitely supports having some process using 4K pages while everything else uses 16K pages because that's what Rosetta 2 do. I believe this could be supported with major changes to the Linux virtual memory subsystem. Also, Rosetta on the DTK (which does not support 4K pages) worked with some hacks on a very wide range of apps with 16KB pages. And ExaGear worked on systems with 64KB pages too. Finally, how about emulating 4K pages on a 16K environment? |
Yes. I would love Linux to gain support for mixed 4k and 16k processes. But that is outside of the scope of FEX-Emu. We can gain some form of minimal support for running on a 16k environment, which some apps will be fine with, but of course compatibility will never be as good as running 4k page size. |
FWIW, I think it would be useful to provide partial support for page size mismatches. I know it's not going to always work, but at least stuff like working around load segment misalignment by mapping the straddling page as the union of the permissions of the inner 4K pages would probably help out a fair bit? At least for software that doesn't try to map stuff at fixed offsets. |
I did hack this in a branch when I first installed asahi, however some core library had their ELF sections too closely aligned, so it'd require fudging the loader too much. |
(For the case of m1 and such, I think it'd be best if access to the mmu was provided via some module, and there was some userspace paging API that works around the kernel limitations) |
The kernel limitation is the entire thing assuming a constant page size, all throughout the kernel. You are certainly welcome to try to fix it and support processes in 4K mode on a 16K kernel like macOS/xnu does, and I would love to have such a solution, but I guarantee it's not going to be as easy as "some module" and "some userspace paging API". Keep in mind that you can't mix 4K/16K pages within a single process, and 4K mode implies a different page size for the kernel and for userspace. I'll let you guess how much of the kernel assumes identical page sizes :) (Consider: what happens if a 4K process tries to share memory with a 16K process?) |
Of course. Upstreaming improved vm infra to the kernel is def. more work though.
That's why I'd want to do userspace paging - to work around that
Sure, but a task could have multiple address spaces associated and switch them during runtime.
Why would that need to be supported? |
And we aren't going to be shipping any out-of-tree hacks that have zero chance of being upstreamed and no reasonable solution in sight. That's not the goal of our project, and we've made that very explicit from the very beginning.
What is "userspace paging"? You can't expose the page tables directly to userspace, that grossly violates the security model.
I don't think that's as easy as you think it is.
Shared memory is a thing used all over the place in modern systems. Things like sound servers, video capture through out-of-process APIs, etc. all rely on that. Even pipes fundamentally work, behind the scenes, by sharing memory pages inside the kernel. |
It is impossible to emulate 4k paging in a 16k host with any semblance of speed. Eg, taking
As you can see the linker script there has alignment of 4096 and multiple file mappings require 4k boundaries
In this example, since the mappings are on the same file and continuous in the file, you could just map everything RWE and pray the application doesn't depend on that - but there's no guarantee that other SOs (like libc) do this. This is not counting that any application doing mmap to file, or mprotect would get weird results, or things like emulators and databases that exercise the mm api more. |
That is up to you. This is not something that can be worked around sufficiently userspace though, so ...
Sure you can, you just need to guarantee that the process doesn't have access to physical pages it's not meant to, then add an ioctl to update the hardware tables (and enforce security). Whenever M1 can do, or is feasible to do in Linux is another discussion.
Sure, but then don't run an guest sound server and expect it to work with host sound apps. Most shared mem apis aren't compatible across architecture boundaries (alignment is different between x86/64 and arm64, and so are memory ordering semantics, and type sizes for 32 bits). One could make
work and call it a day |
That's why I said "partial support".
mmap on files can be made to work perfectly fine as long as the vaddr is not requested as fixed. You just align the mapping to 16K and give the app an offset halfway into the page if needed. munmap would then align back before unmapping the 16K aligned block. mprotect similarly would only fail if the app is trying to do its own page management at page granularity. Many uses of mprotect are used on whatever mmap returned, and since that would be 16K aligned (or the above hack for file maps), it could similarly be made to work. Again, "partial support".
So how do you keep track of what 4K pages the process has mapped? How does this interact with copy-on-write? File mappings? What about when things get unmapped? You do realize that the moment you try to have a parallel address space, it ends up having to interact with everything the primary address space interacts with anyway, right? :) mmap() is "userspace paging". There is no magic lower level you can expose to userspace without either being horribly broken, horribly insecure, or both. You're better off actually biting the bullet and trying to implement real support for mixed page sizes.
You certainly run a host sound server and expect it to work with guest sound apps.
So? FEX is already thunking libraries. You'd use the host library for this. Libraries that work in 16K mode absolutely keep working in 4K mode. You still need to come up with sane kernel semantics for shared memory crossing page sizes (even if it means forbidding some cases) and make the bookkeeping in the kernel work; you can't just handwave that problem away. |
Sure, but "partial support" is unlikely to get far enough as things are now.
vaddr is often fixed, or with a fixed offset, eg, when loading elf files. It's not as uncommon, sadly.
This is also happens, though hopefully less
Those are fairly straightforward and interesting questions to answer.
Yes, I've written paging implementations for MMUs in the past.
No, I wouldn't. Opinions differ here. In my view, only vulkan and GL need to be thunked, everything else should run emulated. This is in stark difference with the box86/64 concept, where they are an emulator and a thunk linux distro in one. |
In another direction though, I think there was someone on discord interested on tackling this. We now have a fairly clean interface - one would need to add support for this in |
The only other issue is that we bundle je_malloc with a fixed page size, but that's just a |
Sure, host libraries would use the kernel's mmap (16kb pages), and guest would use our own mappings. As long as we can map PFNs as we see fit we can mirror things correctly. We already largely manage the address space for the process to control the kernel's mmap ranges, and also keep our VMA lists. |
It's far enough for box64 to run WorldOfGoo, and I don't think they have any magical solution beyond the kinds of things I brought up, if even that.
The ELF case can be special cased, as I explained. It's only a problem when the app is doing custom page/VM management outside loading binaries.
If the shared memory ABI differs between architectures, you have to thunk it. There's no way around that. We certainly aren't going to be having people mutex between arm64 pipewire and emulated x86_64 pipewire for their audio system - if that's the approach FEX is planning on, I'll have to revisit my plans of shipping 4K kernels and FEX in the future and just start telling people to give up on running x86 binaries, because there's no way I'm giving them such a mess. Guest apps need to cleanly interact with the host system at the very least to a passable extent, this isn't a VM. |
So how do you plan on handling syscalls? Is your thunk layer going to move everything to 16K-space before calling back into the rest of the kernel, since it is not designed to expect 4K address spaces? What happens if the 4K pages aren't mapped in groups of 4 such that you could treat them as 16K pages? What about keeping track of page usage counts, does each 4K subpage mapping keep a reference to the 16K page? What happens if the 16K side tries to unmap a page, what is keeping track of the 4K subpages that are still mapped outside the main kernel bookkeeping which is going to be assuming 16K pages? Sorry, I don't think you've really thought this through. |
Maybe it is, maybe it isn't. I spent a few hours on this, and I got away with the "unlikely to work for complex apps". I could be wrong. I do have to close 50+ tickets by the end of the month though, so not volunteering for this.
Again, maybe it works enough for most apps, or maybe just for a handful. It's all a guess.
Have is a strong word there. Maybe upstream fixes their shared memory usage to work fine. Maybe it just happens to work. Maybe people don't care switching. I don't know. As an emulator developer, i see any thunk as a compatibility liability.
Before a working GL/VK driver in a release, all of this is conceptual, so not sure it's the best moment to make firm decisions on anything wrt that.
Maybe they do, or maybe they don't - that depends more on downstream uses of FEX. I don't think this is something that needs to be decided in FEX-Emu. One could take the userspace from box86/64, tie it with FEX's trampolines for thunks, and use that, or even make a whole fork of ubuntu/debian/fedora/arch that uses thunks. Who knows? |
I think that would make sense. One could do it on syscall enter in the kernel side.
The kernel would always allocate and work with 16kb flags. It'd be up to the usermode to keep the 4K page view in sync. Any 16kb mapping can be mirrored with 4kb pages. There might be complications there with shadow frame replacement or CoW, though kernel could just send a signal there and user space would handle it. Of course the whole situation would be a bit involved.
We can create fake MAP_SHARED maps for used pages to make sure they aren't GC.
No, I actually haven't. I've just been thinking kernel mode paging is a relic of the past - independently of this issue here. |
Oh also one thing to note about thunking in general, we actually run as a 64-bit process for 32-bit applications, so 32-bit thunks are way, way, way more involved than box86 that runs x86 as armv7 and x86_64 as aarch64 |
Generally I think it is fine to do some sort of partial support. I just expect most applications to be very angry about it. |
So you copy data around to work around page size mismatches? What about mmap and mmap-like things, or any other syscall related to VM?
I think "a bit" is a bit of an understatement here. If you manage to pull this off sanely you are basically solving most of the problems that need to be solved for hybrid page size support in the kernel in the first place. It's not about where the code lives, it's about the interaction with the rest of the kernel, and you aren't going to solve that by moving some logic to userspace.
How do you stop userspace from unmapping the 16K side and leaving dangling 4K pages? No matter how you slice it, the kernel needs to track the 4K VM space and its mapping to 16K physical pages.
As someone with enough knowledge about paging to know I'm actually clueless about it, I fully expect you're either going to give up if you ever try this, or come up with a horribly insecure and broken contraption, or realize your only choice is to actually solve the problem and make the kernel support mixed 4K pages natively :-). |
The thought was that the page table would be replicated with a similar view but in 4K, not any of the actual page frames.
I'm not saying it's not a hairy problem. It's less complex than some of the tasks for FEX though :p
You don't. Userspace needs to know what it is doing, or its view of the world collapses.
I don't quite agree there - I don't see why there's a need for the kernel itself to implement paging. In a modern, exokernel, microservices based design, it doesn't quite fit. Just like there's a dynamic loader that takes care of relocations and such out of the kernel, there can be a "memory mapping" sibling process / server that takes memory mapping outside of the kernel. POSIX memory management and information management apis are very limited and lead to inefficient patterns overall, imo, and linux for the most part is an implementation of that.
Considering how uhm unwelcoming the linux kernel dev team at times is, and how uhm peculiar the codebase is, I don't know if I'll actually get around to that. I was thinking more among the lines of toying around in fuschia tbh. Then again, til 2023 fex will keep me busy so who knows |
to get back to the topic, I think this sums it up. |
FEX (and other fast x86 emulators) are "special" and can't be worked
around like this.
|
Completely untested, expected to break spectacularly. Don't try it.
The text was updated successfully, but these errors were encountered: