Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw an error when compiling and running on a non-4k page host #1221

Closed
Sonicadvance1 opened this issue Aug 24, 2021 · 31 comments
Closed

Throw an error when compiling and running on a non-4k page host #1221

Sonicadvance1 opened this issue Aug 24, 2021 · 31 comments

Comments

@Sonicadvance1
Copy link
Member

Completely untested, expected to break spectacularly. Don't try it.

@alyssarosenzweig
Copy link
Collaborator

Hmm it sounds like you want me to try fex on my 16k kernel? I'll make sure to file bug reports if it breaks spectacularly 😋

@Sonicadvance1
Copy link
Member Author

plz no. Support 4k pages for regular memory. Keep the mmio stuff tied to 16kb :<

@noloader
Copy link

noloader commented Mar 12, 2022

@Sonicadvance1,

plz no. Support 4k pages for regular memory. Keep the mmio stuff tied to 16kb :<

In case you are not aware... That's causing problems for Asahi Linux. Asahi is a port of Linux to Apple M1's. The GCC Compile Farm recently got one (an Apple M1 running Linux).

Several key projects have run into problems on the M1, including FEX. The Emacs folks said "to heck with it, we'll make allocation on 64K boundaries to satisfy all possible platforms". Also see bug#47125: 28.0.50; pdumper assumes compile time page size remains valid.

16k support would really help the Asahi Linux folks. I hope you revisit things in the future.

@Sonicadvance1
Copy link
Member Author

We added then removed the compile time page check here already.
Hit the issue that Termux CI uses servers with >4k page size but then running on any consumer hardware would be using 4k pages.

If at runtime it changes to something other than 4k then any application expecting 4k page alignments will break, which we already know games that do this.

@noloader
Copy link

Thanks @Sonicadvance1,

We added then removed the compile time page check here already.

Great!

Hit the issue that Termux CI uses servers with >4k page size but then running on any consumer hardware would be using 4k pages.

If at runtime it changes to something other than 4k then any application expecting 4k page alignments will break, which we already know games that do this.

I hate to point this out because I don't want to sound pugilistic and sour you in the future. We really need your help.

The project is making a fallacious argument. Its called Appeal to Hypocrisy. Don't worry about what other projects are doing. Worry about FEX. The free software community will focus on other problematic projects.

@Sonicadvance1
Copy link
Member Author

What I mean is. That a game we are emulating is expecting a 4k page size. Thus changing its hardcoded page size to 16k will break its file streaming code.
We don't have the luxury of working with a different page size and expecting things to work.
x86 applications make the correct assumption that they are working in an environment with a 4k page size.
FEX-Emu HAS to care about other applications, because we are emulating them. We can't just ask a dev to recompile their closed source game from the 90s to work with arbitrary page sizes. That won't work.

@asdfugil
Copy link

asdfugil commented Apr 17, 2022

The M1 SoC definitely supports having some process using 4K pages while everything else uses 16K pages because that's what Rosetta 2 do. I believe this could be supported with major changes to the Linux virtual memory subsystem.

Also, Rosetta on the DTK (which does not support 4K pages) worked with some hacks on a very wide range of apps with 16KB pages. And ExaGear worked on systems with 64KB pages too.

Finally, how about emulating 4K pages on a 16K environment?

@Sonicadvance1
Copy link
Member Author

Yes. I would love Linux to gain support for mixed 4k and 16k processes. But that is outside of the scope of FEX-Emu.

We can gain some form of minimal support for running on a 16k environment, which some apps will be fine with, but of course compatibility will never be as good as running 4k page size.

@marcan
Copy link

marcan commented Aug 14, 2022

FWIW, I think it would be useful to provide partial support for page size mismatches. I know it's not going to always work, but at least stuff like working around load segment misalignment by mapping the straddling page as the union of the permissions of the inner 4K pages would probably help out a fair bit? At least for software that doesn't try to map stuff at fixed offsets.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

FWIW, I think it would be useful to provide partial support for page size mismatches. I know it's not going to always work, but at least stuff like working around load segment misalignment by mapping the straddling page as the union of the permissions of the inner 4K pages would probably help out a fair bit? At least for software that doesn't try to map stuff at fixed offsets.

I did hack this in a branch when I first installed asahi, however some core library had their ELF sections too closely aligned, so it'd require fudging the loader too much.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

(For the case of m1 and such, I think it'd be best if access to the mmu was provided via some module, and there was some userspace paging API that works around the kernel limitations)

@marcan
Copy link

marcan commented Aug 14, 2022

The kernel limitation is the entire thing assuming a constant page size, all throughout the kernel. You are certainly welcome to try to fix it and support processes in 4K mode on a 16K kernel like macOS/xnu does, and I would love to have such a solution, but I guarantee it's not going to be as easy as "some module" and "some userspace paging API". Keep in mind that you can't mix 4K/16K pages within a single process, and 4K mode implies a different page size for the kernel and for userspace. I'll let you guess how much of the kernel assumes identical page sizes :)

(Consider: what happens if a 4K process tries to share memory with a 16K process?)

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

but I guarantee it's not going to be as easy as "some module" and "some userspace paging API".

Of course. Upstreaming improved vm infra to the kernel is def. more work though.

The kernel limitation is the entire thing assuming a constant page size, all throughout the kernel

That's why I'd want to do userspace paging - to work around that

Keep in mind that you can't mix 4K/16K pages within a single process, and 4K mode implies a different page size for the kernel and for userspace.

Sure, but a task could have multiple address spaces associated and switch them during runtime.

(Consider: what happens if a 4K process tries to share memory with a 16K process?)

Why would that need to be supported?

@marcan
Copy link

marcan commented Aug 14, 2022

but I guarantee it's not going to be as easy as "some module" and "some userspace paging API".

Of course. Upstreaming improved vm infra to the kernel is def. more work though.

And we aren't going to be shipping any out-of-tree hacks that have zero chance of being upstreamed and no reasonable solution in sight. That's not the goal of our project, and we've made that very explicit from the very beginning.

The kernel limitation is the entire thing assuming a constant page size, all throughout the kernel

That's why I'd want to do userspace paging - to work around that

What is "userspace paging"? You can't expose the page tables directly to userspace, that grossly violates the security model.

Keep in mind that you can't mix 4K/16K pages within a single process, and 4K mode implies a different page size for the kernel and for userspace.

Sure, but a task could have multiple address spaces associated and switch them during runtime.

I don't think that's as easy as you think it is.

(Consider: what happens if a 4K process tries to share memory with a 16K process?)

Why would that need to be supported?

Shared memory is a thing used all over the place in modern systems. Things like sound servers, video capture through out-of-process APIs, etc. all rely on that. Even pipes fundamentally work, behind the scenes, by sharing memory pages inside the kernel.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

It is impossible to emulate 4k paging in a 16k host with any semblance of speed.

Eg, taking /usr/bin/ls

skmp@mangie:~/projects/FEX/build$ readelf -l /usr/bin/ls

Elf file type is DYN (Position-Independent Executable file)
Entry point 0x6ab0
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000003428 0x0000000000003428  R      0x1000
  LOAD           0x0000000000004000 0x0000000000004000 0x0000000000004000
                 0x0000000000013146 0x0000000000013146  R E    0x1000
  LOAD           0x0000000000018000 0x0000000000018000 0x0000000000018000
                 0x0000000000007458 0x0000000000007458  R      0x1000
  LOAD           0x0000000000020000 0x0000000000021000 0x0000000000021000
                 0x0000000000001278 0x0000000000002540  RW     0x1000
  DYNAMIC        0x0000000000020a98 0x0000000000021a98 0x0000000000021a98
                 0x00000000000001c0 0x00000000000001c0  RW     0x8
  NOTE           0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  NOTE           0x0000000000000368 0x0000000000000368 0x0000000000000368
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_PROPERTY   0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  GNU_EH_FRAME   0x000000000001cdcc 0x000000000001cdcc 0x000000000001cdcc
                 0x000000000000056c 0x000000000000056c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000020000 0x0000000000021000 0x0000000000021000
                 0x0000000000001000 0x0000000000001000  R      0x1

As you can see the linker script there has alignment of 4096 and multiple file mappings require 4k boundaries
eg

  LOAD           0x0000000000000000 *0x0000000000000000 0x0000000000000000*
                 0x0000000000003428 0x0000000000003428  R      0x1000
  LOAD           0x0000000000004000 *0x0000000000004000 0x0000000000004000*
                 0x0000000000013146 0x0000000000013146  R E    0x1000
  LOAD           0x0000000000018000 *0x0000000000018000 0x0000000000018000*
                 0x0000000000007458 0x0000000000007458  R      0x1000
  LOAD           0x0000000000020000 *0x0000000000021000 0x0000000000021000*
                 0x0000000000001278 0x0000000000002540  RW     0x1000

In this example, since the mappings are on the same file and continuous in the file, you could just map everything RWE and pray the application doesn't depend on that - but there's no guarantee that other SOs (like libc) do this.

This is not counting that any application doing mmap to file, or mprotect would get weird results, or things like emulators and databases that exercise the mm api more.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

but I guarantee it's not going to be as easy as "some module" and "some userspace paging API".

Of course. Upstreaming improved vm infra to the kernel is def. more work though.

And we aren't going to be shipping any out-of-tree hacks that have zero chance of being upstreamed and no reasonable solution in sight. That's not the goal of our project, and we've made that very explicit from the very beginning.

That is up to you. This is not something that can be worked around sufficiently userspace though, so ...

The kernel limitation is the entire thing assuming a constant page size, all throughout the kernel

That's why I'd want to do userspace paging - to work around that

What is "userspace paging"? You can't expose the page tables directly to userspace, that grossly violates the security model.

Sure you can, you just need to guarantee that the process doesn't have access to physical pages it's not meant to, then add an ioctl to update the hardware tables (and enforce security). Whenever M1 can do, or is feasible to do in Linux is another discussion.

Keep in mind that you can't mix 4K/16K pages within a single process, and 4K mode implies a different page size for the kernel and for userspace.

Sure, but a task could have multiple address spaces associated and switch them during runtime.

I don't think that's as easy as you think it is.

(Consider: what happens if a 4K process tries to share memory with a 16K process?)

Why would that need to be supported?

Shared memory is a thing used all over the place in modern systems. Things like sound servers, video capture through out-of-process APIs, etc. all rely on that. Even pipes fundamentally work, behind the scenes, by sharing memory pages inside the kernel.

Sure, but then don't run an guest sound server and expect it to work with host sound apps.

Most shared mem apis aren't compatible across architecture boundaries (alignment is different between x86/64 and arm64, and so are memory ordering semantics, and type sizes for 32 bits).

One could make

16 kb PS share -> view from 4kb PS
4 kb PS share -> view from 4kb PS

work and call it a day

@marcan
Copy link

marcan commented Aug 14, 2022

In this example, since the mappings are on the same file and continuous in the file, you could just map everything RWE and pray the application doesn't depend on that - but there's no guarantee that other SOs (like libc) do this.

That's why I said "partial support".

This is not counting that any application doing mmap to file, or mprotect would get weird results, or things like emulators and databases that exercise the mm api more.

mmap on files can be made to work perfectly fine as long as the vaddr is not requested as fixed. You just align the mapping to 16K and give the app an offset halfway into the page if needed. munmap would then align back before unmapping the 16K aligned block.

mprotect similarly would only fail if the app is trying to do its own page management at page granularity. Many uses of mprotect are used on whatever mmap returned, and since that would be 16K aligned (or the above hack for file maps), it could similarly be made to work.

Again, "partial support".

Sure you can, you just need to guarantee that the process doesn't have access to physical pages it's not meant to, then add an ioctl to update the hardware tables (and enforce security). Whenever M1 can do, or is feasible to do in Linux is another discussion.

So how do you keep track of what 4K pages the process has mapped? How does this interact with copy-on-write? File mappings? What about when things get unmapped? You do realize that the moment you try to have a parallel address space, it ends up having to interact with everything the primary address space interacts with anyway, right? :)

mmap() is "userspace paging". There is no magic lower level you can expose to userspace without either being horribly broken, horribly insecure, or both. You're better off actually biting the bullet and trying to implement real support for mixed page sizes.

Sure, but then don't run an guest sound server and expect it to work with host sound apps.

You certainly run a host sound server and expect it to work with guest sound apps.

Most shared mem apis aren't compatible across architecture boundaries (alignment is different between x86/64 and arm64, and so are memory ordering semantics, and type sizes for 32 bits).

So? FEX is already thunking libraries. You'd use the host library for this. Libraries that work in 16K mode absolutely keep working in 4K mode. You still need to come up with sane kernel semantics for shared memory crossing page sizes (even if it means forbidding some cases) and make the bookkeeping in the kernel work; you can't just handwave that problem away.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

Sure, but "partial support" is unlikely to get far enough as things are now.

mmap on files can be made to work perfectly fine as long as the vaddr is not requested as fixed.

vaddr is often fixed, or with a fixed offset, eg, when loading elf files. It's not as uncommon, sadly.

mprotect similarly would only fail if the app is trying to do its own page management at page granularity

This is also happens, though hopefully less

So how do you keep track of what 4K pages the process has mapped? How does this interact with copy-on-write? File mappings? What about when things get unmapped?

Those are fairly straightforward and interesting questions to answer.

You do realize that the moment you try to have a parallel address space, it ends up having to interact with everything the primary address space interacts with anyway, right? :)

Yes, I've written paging implementations for MMUs in the past.

FEX is already thunking libraries. You'd use the host library for this.

No, I wouldn't. Opinions differ here. In my view, only vulkan and GL need to be thunked, everything else should run emulated. This is in stark difference with the box86/64 concept, where they are an emulator and a thunk linux distro in one.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

In another direction though, I think there was someone on discord interested on tackling this.

We now have a fairly clean interface - one would need to add support for this in virtual void *GuestMmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset) = 0; and GuestMunmap, and maybe introduce a GuestMprotect and then it'd all work.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

The only other issue is that we bundle je_malloc with a fixed page size, but that's just a #define that needs to be tweaked

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

Libraries that work in 16K mode absolutely keep working in 4K mode. You still need to come up with sane kernel semantics for shared memory crossing page sizes (even if it means forbidding some cases) and make the bookkeeping in the kernel work; you can't just handwave that problem away.

Sure, host libraries would use the kernel's mmap (16kb pages), and guest would use our own mappings. As long as we can map PFNs as we see fit we can mirror things correctly.

We already largely manage the address space for the process to control the kernel's mmap ranges, and also keep our VMA lists.

@marcan
Copy link

marcan commented Aug 14, 2022

Sure, but "partial support" is unlikely to get far enough as things are now.

It's far enough for box64 to run WorldOfGoo, and I don't think they have any magical solution beyond the kinds of things I brought up, if even that.

mmap on files can be made to work perfectly fine as long as the vaddr is not requested as fixed.

vaddr is often fixed, or with a fixed offset, eg, when loading elf files. It's not as uncommon, sadly.

The ELF case can be special cased, as I explained. It's only a problem when the app is doing custom page/VM management outside loading binaries.

FEX is already thunking libraries. You'd use the host library for this.

No, I wouldn't. Opinions differ here. In my view, only vulkan and GL need to be thunked, everything else should run emulated. This is in stark difference with the box86/64 concept, where they are an emulator and a thunk linux distro in one.

If the shared memory ABI differs between architectures, you have to thunk it. There's no way around that. We certainly aren't going to be having people mutex between arm64 pipewire and emulated x86_64 pipewire for their audio system - if that's the approach FEX is planning on, I'll have to revisit my plans of shipping 4K kernels and FEX in the future and just start telling people to give up on running x86 binaries, because there's no way I'm giving them such a mess. Guest apps need to cleanly interact with the host system at the very least to a passable extent, this isn't a VM.

@marcan
Copy link

marcan commented Aug 14, 2022

Sure, host libraries would use the kernel's mmap (16kb pages), and guest would use our own mappings. As long as we can map PFNs as we see fit we can mirror things correctly.

We already largely manage the address space for the process to control the kernel's mmap ranges, and also keep our VMA lists.

So how do you plan on handling syscalls? Is your thunk layer going to move everything to 16K-space before calling back into the rest of the kernel, since it is not designed to expect 4K address spaces? What happens if the 4K pages aren't mapped in groups of 4 such that you could treat them as 16K pages? What about keeping track of page usage counts, does each 4K subpage mapping keep a reference to the 16K page? What happens if the 16K side tries to unmap a page, what is keeping track of the 4K subpages that are still mapped outside the main kernel bookkeeping which is going to be assuming 16K pages?

Sorry, I don't think you've really thought this through.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

It's far enough for box64 to run WorldOfGoo, and I don't think they have any magical solution beyond the kinds of things I brought up, if even that.

Maybe it is, maybe it isn't. I spent a few hours on this, and I got away with the "unlikely to work for complex apps". I could be wrong. I do have to close 50+ tickets by the end of the month though, so not volunteering for this.

The ELF case can be special cased, as I explained. It's only a problem when the app is doing custom page/VM management outside loading binaries.

Again, maybe it works enough for most apps, or maybe just for a handful. It's all a guess.

If the shared memory ABI differs between architectures, you have to thunk it.

Have is a strong word there. Maybe upstream fixes their shared memory usage to work fine. Maybe it just happens to work. Maybe people don't care switching. I don't know.

As an emulator developer, i see any thunk as a compatibility liability.

There's no way around that. We certainly aren't going to be having people mutex between arm64 pipewire and emulated x86_64 pipewire for their audio system - if that's the approach FEX is planning on, I'll have to revisit my plans of shipping 4K kernels and FEX in the future and just start telling people to give up on running x86 binaries, because there's no way I'm giving them such a mess.

Before a working GL/VK driver in a release, all of this is conceptual, so not sure it's the best moment to make firm decisions on anything wrt that.

Guest apps need to cleanly interact with the host system at the very least to a passable extent, this isn't a VM.

Maybe they do, or maybe they don't - that depends more on downstream uses of FEX. I don't think this is something that needs to be decided in FEX-Emu.

One could take the userspace from box86/64, tie it with FEX's trampolines for thunks, and use that, or even make a whole fork of ubuntu/debian/fedora/arch that uses thunks. Who knows?

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

So how do you plan on handling syscalls? Is your thunk layer going to move everything to 16K-space before calling back into the rest of the kernel, since it is not designed to expect 4K address spaces?

I think that would make sense. One could do it on syscall enter in the kernel side.

What happens if the 4K pages aren't mapped in groups of 4 such that you could treat them as 16K pages?

The kernel would always allocate and work with 16kb flags. It'd be up to the usermode to keep the 4K page view in sync. Any 16kb mapping can be mirrored with 4kb pages. There might be complications there with shadow frame replacement or CoW, though kernel could just send a signal there and user space would handle it. Of course the whole situation would be a bit involved.

What about keeping track of page usage counts, does each 4K subpage mapping keep a reference to the 16K page? What happens if the 16K side tries to unmap a page, what is keeping track of the 4K subpages that are still mapped outside the main kernel bookkeeping which is going to be assuming 16K pages?

We can create fake MAP_SHARED maps for used pages to make sure they aren't GC.

Sorry, I don't think you've really thought this through.

No, I actually haven't. I've just been thinking kernel mode paging is a relic of the past - independently of this issue here.

@skmp
Copy link
Contributor

skmp commented Aug 14, 2022

Oh also one thing to note about thunking in general, we actually run as a 64-bit process for 32-bit applications, so 32-bit thunks are way, way, way more involved than box86 that runs x86 as armv7 and x86_64 as aarch64

@Sonicadvance1
Copy link
Member Author

Generally I think it is fine to do some sort of partial support. I just expect most applications to be very angry about it.
It's quite low on the priority list right now though, especially since there isn't even a GL 3.x or Vulkan driver available on these devices yet.

@marcan
Copy link

marcan commented Aug 15, 2022

So how do you plan on handling syscalls? Is your thunk layer going to move everything to 16K-space before calling back into the rest of the kernel, since it is not designed to expect 4K address spaces?

I think that would make sense. One could do it on syscall enter in the kernel side.

So you copy data around to work around page size mismatches? What about mmap and mmap-like things, or any other syscall related to VM?

What happens if the 4K pages aren't mapped in groups of 4 such that you could treat them as 16K pages?

The kernel would always allocate and work with 16kb flags. It'd be up to the usermode to keep the 4K page view in sync. Any 16kb mapping can be mirrored with 4kb pages. There might be complications there with shadow frame replacement or CoW, though kernel could just send a signal there and user space would handle it. Of course the whole situation would be a bit involved.

I think "a bit" is a bit of an understatement here. If you manage to pull this off sanely you are basically solving most of the problems that need to be solved for hybrid page size support in the kernel in the first place. It's not about where the code lives, it's about the interaction with the rest of the kernel, and you aren't going to solve that by moving some logic to userspace.

What about keeping track of page usage counts, does each 4K subpage mapping keep a reference to the 16K page? What happens if the 16K side tries to unmap a page, what is keeping track of the 4K subpages that are still mapped outside the main kernel bookkeeping which is going to be assuming 16K pages?

We can create fake MAP_SHARED maps for used pages to make sure they aren't GC.

How do you stop userspace from unmapping the 16K side and leaving dangling 4K pages? No matter how you slice it, the kernel needs to track the 4K VM space and its mapping to 16K physical pages.

Sorry, I don't think you've really thought this through.

No, I actually haven't. I've just been thinking kernel mode paging is a relic of the past - independently of this issue here.

As someone with enough knowledge about paging to know I'm actually clueless about it, I fully expect you're either going to give up if you ever try this, or come up with a horribly insecure and broken contraption, or realize your only choice is to actually solve the problem and make the kernel support mixed 4K pages natively :-).

@skmp
Copy link
Contributor

skmp commented Aug 15, 2022

So how do you plan on handling syscalls? Is your thunk layer going to move everything to 16K-space before calling back into the rest of the kernel, since it is not designed to expect 4K address spaces?

I think that would make sense. One could do it on syscall enter in the kernel side.

So you copy data around to work around page size mismatches? What about mmap and mmap-like things, or any other syscall related to VM?

The thought was that the page table would be replicated with a similar view but in 4K, not any of the actual page frames.

What happens if the 4K pages aren't mapped in groups of 4 such that you could treat them as 16K pages?

The kernel would always allocate and work with 16kb flags. It'd be up to the usermode to keep the 4K page view in sync. Any 16kb mapping can be mirrored with 4kb pages. There might be complications there with shadow frame replacement or CoW, though kernel could just send a signal there and user space would handle it. Of course the whole situation would be a bit involved.

I think "a bit" is a bit of an understatement here. If you manage to pull this off sanely you are basically solving most of the problems that need to be solved for hybrid page size support in the kernel in the first place. It's not about where the code lives, it's about the interaction with the rest of the kernel, and you aren't going to solve that by moving some logic to userspace.

I'm not saying it's not a hairy problem. It's less complex than some of the tasks for FEX though :p

What about keeping track of page usage counts, does each 4K subpage mapping keep a reference to the 16K page? What happens if the 16K side tries to unmap a page, what is keeping track of the 4K subpages that are still mapped outside the main kernel bookkeeping which is going to be assuming 16K pages?

We can create fake MAP_SHARED maps for used pages to make sure they aren't GC.

How do you stop userspace from unmapping the 16K side and leaving dangling 4K pages? No matter how you slice it, the kernel needs to track the 4K VM space and its mapping to 16K physical pages.

You don't. Userspace needs to know what it is doing, or its view of the world collapses.

Sorry, I don't think you've really thought this through.

No, I actually haven't. I've just been thinking kernel mode paging is a relic of the past - independently of this issue here.

As someone with enough knowledge about paging to know I'm actually clueless about it, I fully expect you're either going to give up if you ever try this, or come up with a horribly insecure and broken contraption, or realize your only choice is to actually solve the problem and make the kernel support mixed 4K pages natively :-).

I don't quite agree there - I don't see why there's a need for the kernel itself to implement paging. In a modern, exokernel, microservices based design, it doesn't quite fit.

Just like there's a dynamic loader that takes care of relocations and such out of the kernel, there can be a "memory mapping" sibling process / server that takes memory mapping outside of the kernel.

POSIX memory management and information management apis are very limited and lead to inefficient patterns overall, imo, and linux for the most part is an implementation of that.

going to give up if you ever try this

Considering how uhm unwelcoming the linux kernel dev team at times is, and how uhm peculiar the codebase is, I don't know if I'll actually get around to that. I was thinking more among the lines of toying around in fuschia tbh.

Then again, til 2023 fex will keep me busy so who knows

@skmp
Copy link
Contributor

skmp commented Aug 15, 2022

Generally I think it is fine to do some sort of partial support. I just expect most applications to be very angry about it. It's quite low on the priority list right now though, especially since there isn't even a GL 3.x or Vulkan driver available on these devices yet.

to get back to the topic, I think this sums it up.

@alyssarosenzweig
Copy link
Collaborator

alyssarosenzweig commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants