Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmap() on device #1216

Open
FedeParola opened this issue Feb 7, 2023 · 16 comments
Open

mmap() on device #1216

FedeParola opened this issue Feb 7, 2023 · 16 comments

Comments

@FedeParola
Copy link
Contributor

Hello,
I'm trying to add support for the QEMU ivshmem device [1], to share memory between the host and the guest. I wrote a small driver that checks the existence of the device, creates an item on the devfs, and retrieves the address of the shared memory region from the PCI BAR used by ivshmem. At this point I would like to make this memory available to an application, that first open()s the device knowing its path and then performs an mmap() operation.
Any advice on how to proceed on this second point? It seems to me that there is no support to apply mmap operations on devices.

Thanks in advance

[1] https://www.qemu.org/docs/master/system/devices/ivshmem.html

@wkozaczuk
Copy link
Collaborator

Hi and welcome to OSv community,

I have never tried to mmap() a device and based on my research it does not seem to be something commonly done (but I might be wrong). In any case, after researching OSv code, I think it should work but maybe not the way you want it to.

I am assuming the code in your app would look something like this:

int devFd = open("/dev/path/to/my/device", O_RDWR);
void *myDevPtr = mmap(start, length, prot, flags, devFd, offset);

The mmap() itself would start in libc/mman.cc, move on to call mmu::map_file() in core/mmu.cc, then call mmap() on the underlying file ref in fs/vfs/vfs_fops.cc and because the DevFS does not implement the vop_cache function it would end up calling mmu::default_file_mmap() that installs map_file_page_read handler. Now, this means that when a page fault happens, the map_file_page_read:fill() would create a copy of the page and fill with data by calling a read() on the device file descriptor.

This may not be what you want because I am not sure how OSv will handle the case when the data in the populated region of the ivshmem device changes and I am assuming you want to see the change in the app. Maybe we can evict that region when something happens? I am assuming the host has a way to notify guests of data changes maybe via interrupts?

I think that ideally, we would want to have this mmap be handled without the extra copy in between the way we do it with ZFS and recently ROFS (see this commit and this issue for details). And in this case, we would have to create PTEs pointing directly to the physical memory of the device.

BTW. I am very interested to see the implementation of your driver and may be try things myself so I can guide you better.

@FedeParola
Copy link
Contributor Author

Hi @wkozaczuk,
thank you for your comments.
The application should definitely have access to the original physical memory, otherwise that would defeat the use of a shmem device.

I pushed my work in progress here. At the moment I just wrote support for the ivshmem-plain mode of the device, so only shared memory and no interrupts. As a temporary solution I'm exposing the address of the shmem to the application through an ioctl() call (I'm not even checking the ioctl request code), but I guess mmap() would be the correct way of doing that. If you see this implementation of the driver for Linux, the device driver can directly handle the mmap() operation and expose the memory address.

To test my driver:

./scripts/build image=ivshmem-test
sudo ./scripts/run.py -nv --verbose --ivshmem 2M

(BTW I submitted a PR to fix a problem related to BARs size computation, that was causing problems when allocating shm > 4M)

@FedeParola
Copy link
Contributor Author

FedeParola commented Feb 10, 2023

Another possible use case (useful in my project) would be related to exposing virtio rings directly to the application. This concept is mentioned in the OSv paper (Sec. 3), but I could not find any follow up on the topic (did I miss something?). I think that an mmap() operation would be a good fit for this task.

@dorlaor
Copy link
Contributor

dorlaor commented Feb 10, 2023 via email

@FedeParola
Copy link
Contributor Author

FedeParola commented Feb 10, 2023

Thanks @dorlaor, is this the memcached version you are mentioning?

@dorlaor
Copy link
Contributor

dorlaor commented Feb 12, 2023 via email

@wkozaczuk
Copy link
Collaborator

I am guessing this code that implemented the --assigned-net functionality - 65c558c - would need to be re-instated and refactored to fit with the code abstractions in the driver's source tree. But maybe it was not used by my osv-memcached, maybe only Seastar used it.

@wkozaczuk
Copy link
Collaborator

After analyzing the relevant code I think we need to implement the functionality to support mmap-ing devices in a different way.

First, we cannot rely on the current implementation of mmu::map_file() in core/mmu.cc which calls mmap() on the underlying file ref in fs/vfs/vfs_fops.cc because the resulting mapping is a dynamic vma (see here) which is not what we want. Instead as a result of calling a mmap on a device file, we need to create a linear mapping using mmu::linear_map() just like we do mapping configuration bars in PCI devices for example.

To accommodate it, I propose we make the following changes to VFS and DevFS abstractions:

  1. Add a new vnop called vop_mmap to the struct vnops (see include/osv/vnode.hh) in a similar way Linux has one in the file_operations struct.
  2. Re-factor current mmap() implementation in libc/mman.cc and mmu::map_file() and possibly file::mmap() in fs/vfs/vfs_fops.cc to delegate to the new vop_mmap.
  3. Add a new devop called devop_mmap to the struct devops.
  4. Implement this new devop_mmap in the new kvmshmem device which would call mmu::linear_map.

This is obviously just a sketch of the changes but I wonder what others think about my proposal.

@FedeParola I will try to prototype my changes and integrate them with your implementation of the kvmshmem driver and see how a simple example of opening a device and mmap-ing it and reading and writing works.

@wkozaczuk
Copy link
Collaborator

BTW I am also adding the device spec for Inter-VM Shared Memory.

As I understand it is QEMU specific device. Are there any efforts to standardize it as part of VirtIO?

@FedeParola
Copy link
Contributor Author

FedeParola commented Feb 20, 2023

Hi @wkozaczuk,
Thanks for the update. There is this ongoing work to create a virtio-based shared memory device called virtio-ism. The work seems much more promising than the QEMU device, that has been stuck in an experimental status for a long time. However, as far as I understand, the virtio is still in a preliminary stage.

I have an additional question concerning the use of the POSIX system call API. What is the overhead with respect to direct function calls (e.g. the one performed in the memcached example)? Even if all the code in the unikernel runs at the same level, it seems to me that there's still a level of indirection (SYSCALL instruction, that probably just jumps to the syscall handling code without changing CPU ring + lookup in a table of system calls to retrieve the correct function). Is this correct or am I missing something?

@wkozaczuk
Copy link
Collaborator

wkozaczuk commented Feb 24, 2023

@FedeParola You are right that the direct libc function calls have the least overhead. Handling system calls (SYSCALL instruction in x86_64 and SVC on ARMv8) in the OSv world requires pushing and popping the registers onto/from the stack and eventually calling the relevant libc function (see the x86_64 implementation and this).
So it is slightly slower but as you point out there is no mapping table switch nor ring change as in Linux. BTW the golang tends to use a system call API a lot.
There is also the syscall() function as well in linux.cc (see the other link above) which is slower than function call but faster than SYSCALL/SVC call.

@FedeParola
Copy link
Contributor Author

FedeParola commented Feb 24, 2023

I see, thanks for the clarification.
Out of curiosity, do you think it would be possible to rewrite the musl libc implementation, so that instead of relying on the SYSCALL asm instruction (and equivalent ones on non-x86 architectures), it directly calls the syscall() function from linux.cc? Or maybe directly edit every single libc syscall wrapper to map it to its corresponding kernel implementation, to bypass the switch-case in the syscall() function

@nyh
Copy link
Contributor

nyh commented Feb 27, 2023

I see, thanks for the clarification. Out of curiosity, do you think it would be possible to rewrite the musl libc implementation, so that instead of relying on the SYSCALL asm instruction (and equivalent ones on non-x86 architectures), it directly calls the syscall() function from linux.cc? Or maybe directly edit every single libc syscall wrapper to map it to its corresponding kernel implementation, to bypass the switch-case in the syscall() function

We sort of did this already. We use in OSv musl as-is where it doesn't use system calls, but modified it when it does. For example (I just randomly picked a file), musl/src/stdio/remove.c calls the SYS_unlink system call (and a few others), but our version libc/stdio/remove.c was modified (in this case, almost without leaving a single line...) to call the unlink() function.

@wkozaczuk
Copy link
Collaborator

And sometimes we use a preprocessor trick to make the compiler replace SYS_* with a local function call. See syscall_to_function.h and how it is applied. This way in some cases we still use unmodified musl code but compiled to use local functions.

@wkozaczuk
Copy link
Collaborator

@FedeParola are you still interested in implementing this feature? It looks like I finally may have some time to work on it. I am also trying to figure out what in general to work on next. And I have my interests as well :-) So please let me know.

@FedeParola
Copy link
Contributor Author

Hi @wkozaczuk, at the moment the sketchy implementation I made is fine for my work, and unfortunately I don't have time to allocate to a proper implementation :( . Hopefully in the future, but I guess it is worth waiting for the new virtio-ism device to become a thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants