Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very inefficient block cache #1318

Open
wkozaczuk opened this issue Jun 23, 2024 · 2 comments
Open

Very inefficient block cache #1318

wkozaczuk opened this issue Jun 23, 2024 · 2 comments

Comments

@wkozaczuk
Copy link
Collaborator

As Jan Braunwarth eloquently explains in his bachelor thesis, the OSv block cache is very inefficient:

"OSv also has a cache that should increase I/O performance, but it is very inefficient and, as you can see in Figure 4.8, does not lead to an increase but rather a dramatic drop. If you look at how the block cache works, it quickly becomes clear why this is. Each I/O is initially divided by the cache into 512 byte blocks. Then, when a read request is made, each block is checked to see whether it is already in the cache and, if so, copied directly from there to the target address. Since the RAM can answer the request much faster, this administrative effort is worth it. The problem is what happens when the block is not yet in the cache.

For example, if an application wants to read a 1 MiB file that is not yet in the cache, the request is divided into 2048 I/Os, each 512B in size. These 2048 requests are then all processed sequentially and also copied from the block cache to the target address. The measured IOPS are therefore significantly lower than the number of SQEs that were processed by the NVMe."

It must be noted, however, that most applications will not be affected by it as they will go through the VFS layer. The filesystems drivers in OSv (ZFS, RoFS, and recently EXT4) bypass the block cache and call devops->strategy() directly

To reproduce this problem, one can use the fio app setup to read from disk directly (as bypasses the file system):

/fio --name=fiotest --filename=/dev/nvme1n1 --size 10Mb --rw=read ....

There are at least two options to fix this moderately important issue:

  • improve the block cache and ideally make devops->strategy() use it as well (more difficult)
  • change block device drivers to replace bread and bwrite with code similar to what the strategy functions do like in this proposed patch (easy):
diff --git a/drivers/virtio-blk.cc b/drivers/virtio-blk.cc
index 48750a01..4f7676e9 100644
--- a/drivers/virtio-blk.cc
+++ b/drivers/virtio-blk.cc
@@ -49,6 +49,9 @@ TRACEPOINT(trace_virtio_blk_req_err, "bio=%p, sector=%lu, len=%lu, type=%x", str
 using namespace memory;
 
 
+int
+bdev_direct_read_write(struct device *dev, struct uio *uio, int ioflags);
+
 namespace virtio {
 
 int blk::_instance = 0;
@@ -71,7 +74,8 @@ blk_strategy(struct bio *bio)
 static int
 blk_read(struct device *dev, struct uio *uio, int ioflags)
 {
-    return bdev_read(dev, uio, ioflags);
+    return bdev_direct_read_write(dev, uio, ioflags);
 }
 
 static int
@@ -82,6 +86,7 @@ blk_write(struct device *dev, struct uio *uio, int ioflags)
     if (prv->drv->is_readonly()) return EROFS;
 
-     return bdev_write(dev, uio, ioflags);
+    return bdev_direct_read_write(dev, uio, ioflags);
 }
 
 static struct devops blk_devops {
diff --git a/fs/vfs/kern_physio.cc b/fs/vfs/kern_physio.cc
index c7c99c72..80c22ccc 100644
--- a/fs/vfs/kern_physio.cc
+++ b/fs/vfs/kern_physio.cc
@@ -138,3 +138,50 @@ void multiplex_strategy(struct bio *bio)
 		len -= req_size;
 	}
 }
+
+int
+bdev_direct_read_write(struct device *dev, struct uio *uio, int ioflags)
+{
+    bio* complete_io = alloc_bio();
+
+    u8 opcode;
+    switch (uio->uio_rw) {
+    case UIO_READ :
+        opcode = BIO_READ;
+        break;
+    case UIO_WRITE :
+        opcode = BIO_WRITE;
+        break;
+    default :
+        return EINVAL;
+    }
+
+    refcount_init(&complete_io->bio_refcnt, uio->uio_iovcnt);
+
+    while(uio->uio_iovcnt > 0) 
+    {
+        bio* bio = alloc_bio();
+        bio->bio_cmd = opcode;
+        bio->bio_dev = dev;
+
+        bio->bio_bcount = uio->uio_iov->iov_len;
+        bio->bio_data = uio->uio_iov->iov_base;
+        bio->bio_offset = uio->uio_offset;
+
+        bio->bio_caller1 = complete_io;
+        bio->bio_private = complete_io->bio_private;
+        bio->bio_done = multiplex_bio_done;
+
+        dev->driver->devops->strategy(bio);
+
+        uio->uio_offset += uio->uio_iov->iov_len;
+        uio->uio_resid -= uio->uio_iov->iov_len;
+        uio->uio_iov++;
+        uio->uio_iovcnt--;
+    }
+    assert(uio->uio_resid == 0);
+    int ret = bio_wait(complete_io);
+    destroy_bio(complete_io);
+
+    return ret;
+}
@nyh
Copy link
Contributor

nyh commented Jun 24, 2024

As Jan Braunwarth eloquently explains in his bachelor thesis, the OSv block cache is very inefficient:

Interesting, can you please post here a link to this bachelor thesis?

@wkozaczuk
Copy link
Collaborator Author

Let me send it to you! It is in german but you can easily google-translate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants