Refactor blob refs #454

dcoutts · 2024-11-06T16:27:59Z

PR designed to be reviewed patch by patch.

Overall, this is a refactoring in preparation for changes to the ref counting API. The ref counting change will be about making things follow a model of there being singular references, rather than manipulating raw reference counts. This changes the BlobRef representation so that it follows this style: it now holds a reference to a BlobFile (which is itself a reference counted object). Previously it was more ad-hoc. Also more cleanly separate three kinds of reference: raw, weak and strong.

Not yet used in this patch. Also move the BlobSpan defintion to this module.

Where before it contained a file handle and a ref counter, it now just contains a BlobFile (which itself is the pair of a handle and counter).

data BlobRef m h previously contained: blobRefFile :: !h which meant allmost all uses of BlobRef had to be BlobRef m (Handle h) Now we make that internal: blobRefFile :: !(FS.Handle h) and so all uses are now simply BlobRef m h This continues with the trend to avoid having to use (Handle h) everywhere. This is a large but simple patch, that just deals with the fallout of this local change.

To better reflect that it does not itself maintain a reference count, and to distinguish it from a new type we will introduce soon: a strong blob ref. Then we will have a clear type distinction: raw, weak and strong.

Originally it contained the file handle and the refcounter of either the Run or write buffer that the blob ref came from (pointed to). In a previous patch we changed the Run and WriteBufferBlobs to themselves contain an independently ref-counted BlobFile, and so at the same time the blob ref refcount refers to the blob file refcount not the refcount of the parent object. In this patch we now change RawBlobRef to directly hold a BlobFile. This makes the representation logically more uniform between the run and write buffer cases: since we now just point to the BlobFile being held by each blob source. This refactoring will make things much clearer once we switch the representation of ref-counted objects. Since we'll maintain a reference to a BlobFile rather than a lower-level combo of a file handle and reference count.

Previously a RawBlobRef served dual purpose for raw and strong. Keep that distinction clear. For the moment, all three have equivalent representations, but this is likely to change in a later refactoring of the reference counting API. Also split the functions for making blob refs from runs or the write buffer into the two variants that we use: raw (internally) and weak (externally).

The only uses of BlobRef.readBlob were in the context of withWeakBlobRef so we can go straight to a helper function that does just that. Also remove now-unused WriteBufferBlobs.readBlob

Instead of open-coding the block IO stuff in the top level Internal module in retrieveBlobs, it is moved next to BlobRef.readWeakBlobRef, so we have the singular and bulk versions next to each other.

We don't actually need to export the StrongBlobRef type since all uses of it are temporary, for doing I/O.

Noticed while refactoring

Currently, pre-snapshots, the Run does _not_ delete its blob files when the run is closed, but the WriteBufferBlobs _does_ delete its blob file. Insert a temporary hack to accomodate this. This commit can be reverted as part of implementing snapshots properly.

jorisdral

Nice, I agree this makes the handling of blob references nicer. Some suggestions:

jorisdral · 2024-11-07T10:44:13Z

src/Database/LSMTree/Internal/BlobFile.hs

+-- | Adopt an existing open file handle to make a 'BlobFile'. The file must at
+-- least be open for reading (but may or may not be open for writing).
+--
+-- The finaliser will close and delete the file.
+--
+newBlobFile ::
+     PrimMonad m
+  => HasFS m h
+  -> FS.Handle h
+  -> m (BlobFile m h)
+newBlobFile fs blobFileHandle = do


I wonder if this function should be in charge of creating the file handle itself. That seems nicer than having BlobFile adopt an existing file handler

I did it this way because the different use cases differ in how they handle the file: open for reading only in runs, but read/write in write buffer. I figured, just have it deal with closing. But yeah, we could change it, it'd need to take a read/write param.

jorisdral · 2024-11-07T10:45:09Z

src/Database/LSMTree/Internal/BlobFile.hs

+readBlobFile ::
+     (MonadThrow m, PrimMonad m)
+  => HasFS m h
+  -> BlobFile m h
+  -> BlobSpan
+  -> m SerialisedBlob
+readBlobFile fs BlobFile {blobFileHandle}


Would it make sense to put the HasFS in the BlobFile type so that we don't have to pass it around?

It seems to be pretty readily available at all the call sites.

Also, when we read multiple blob refs (which each may refer to a different file) we must submit that I/O using a single HasBlockIO.

This is one of those cases where type classes and class coherence are a positive benefit.

jorisdral · 2024-11-07T10:47:51Z

src/Database/LSMTree/Internal/WriteBufferBlobs.hs

@@ -102,17 +103,14 @@ import qualified System.Posix.Types as FS (ByteCount)
 --
 data WriteBufferBlobs m h =
     WriteBufferBlobs {
-       blobFileHandle     :: {-# UNPACK #-} !(FS.Handle h)
+       blobFile        :: !(BlobFile m h)


When we implement singular reference, would this become something along the lines of Ref BlobFile?

Yes, exactly. And the different versions of BlobRef use different kinds of reference to the BlobFile: raw (no Ref), strong Ref, and weak WeakRef.

That's exactly where this refactoring is going. It makes it possible to use the singular reference style.

jorisdral · 2024-11-07T10:48:42Z

src/Database/LSMTree/Internal/WriteBufferBlobs.hs

-addReference WriteBufferBlobs {blobFileRefCounter} =
-    RC.addReference blobFileRefCounter
+addReference WriteBufferBlobs {blobFile} =
+    RC.addReference (blobFileRefCounter blobFile)


Should we add BlobFile.addReference and use that instead?

I was hoping we'd just switch to the Ref API and then those things go away anyway. Then there's no more add reference, just create, duplicate and release.

jorisdral · 2024-11-07T10:49:07Z

src/Database/LSMTree/Internal/BlobFile.hs

This module is missing SPECIALISE pragmas

jorisdral · 2024-11-07T11:37:52Z

src/Database/LSMTree/Internal/BlobFile.hs

+{-# LANGUAGE DeriveAnyClass     #-}
+{-# LANGUAGE DeriveGeneric      #-}
+{-# LANGUAGE DerivingStrategies #-}


DeriveGeneric seems unused, and the others are enabled by default, I think

Copy'n'paste 😄

jorisdral · 2024-11-07T11:38:36Z

src/Database/LSMTree/Internal/BlobFile.hs

+import           System.FS.API (HasFS)
+import qualified System.FS.BlockIO.API as FS
+
+-- | An open handle to a file containing blobs.


Suggested change

-- | An open handle to a file containing blobs.

-- | A handle to a file containing blobs.

jorisdral · 2024-11-07T11:39:42Z

src/Database/LSMTree/Internal/BlobFile.hs

+{-# SPECIALISE readBlobFile :: HasFS IO h -> BlobFile IO h -> BlobSpan -> IO SerialisedBlob #-}
+readBlobFile ::
+     (MonadThrow m, PrimMonad m)
+  => HasFS m h
+  -> BlobFile m h
+  -> BlobSpan
+  -> m SerialisedBlob
+readBlobFile fs BlobFile {blobFileHandle}


Nitpick: readFile typically means "read the whole file". Maybe it should be just readBlob?

read from a blob file. Yes, maybe needs renaming.

jorisdral · 2024-11-07T11:42:10Z

src/Database/LSMTree/Internal/BlobRef.hs

+-- | A raw blob reference is a reference to a blob within a blob file.
 --
-- See 'Database.LSMTree.Common.BlobRef' for more info.
-data BlobRef m h = BlobRef {
-      blobRefFile  :: !h
-    , blobRefCount :: {-# UNPACK #-} !(RefCounter m)
-    , blobRefSpan  :: {-# UNPACK #-} !BlobSpan
+-- The \"raw\" means that it does no reference counting, so does not maintain
+-- ownership of the 'BlobFile'. Thus these are only safe to use in the context
+-- of code that already (directly or indirectly) owns the blob file that the
+-- blob ref uses (such as within run merging).
+--
+-- Thus these cannot be handed out via the API. Use 'WeakBlobRef' for that.
+--
+data RawBlobRef m h = RawBlobRef {
+      rawBlobRefFile :: {-# NOUNPACK #-} !(BlobFile m h)
+    , rawBlobRefSpan :: {-# UNPACK #-}   !BlobSpan
    }
  deriving stock (Show)

-instance NFData h => NFData (BlobRef m h) where
-  rnf (BlobRef a b c) = rnf a `seq` rnf b `seq` rnf c
-
-- | Location of a blob inside a blob file.
-data BlobSpan = BlobSpan {
-    blobSpanOffset :: {-# UNPACK #-} !Word64
-  , blobSpanSize   :: {-# UNPACK #-} !Word32
-  }
-  deriving stock (Show, Eq)
-
-instance NFData BlobSpan where
-  rnf (BlobSpan a b) = rnf a `seq` rnf b
-
-blobRefSpanSize :: BlobRef m h -> Int
-blobRefSpanSize = fromIntegral . blobSpanSize . blobRefSpan
+-- | A \"weak\" reference to a blob within a blob file. These are the ones we
+-- can return in the public API and can outlive their parent table.
+--
+-- They are weak references in that they do not keep the file open using a
+-- reference count. So when we want to use our weak reference we have to
+-- dereference them to obtain a normal strong reference while we do the I\/O
+-- to read the blob. This ensures the file is not closed under our feet.
+--
+-- See 'Database.LSMTree.Common.BlobRef' for more info.
+--
+data WeakBlobRef m h = WeakBlobRef {
+      weakBlobRefFile :: {-# NOUNPACK #-} !(BlobFile m h)
+    , weakBlobRefSpan :: {-# UNPACK #-}   !BlobSpan
+    }
+  deriving stock (Show)


Is there any difference between raw and weak blob references, other than that the names serve as documentation? Is it necessary to make the distinction if both are a blob reference without ownership?

I'm probably answering my own question here, but yes the dinstinction is necessary, because weak references have to be upgraded before reading the blobs. However, you could make the same distinction by getting rid of RawBlobRef and providing unsafeReadWeakBlobRef and readWeakBlobRef functions

Weak ones will change to use WeakRef once we use the new Ref API. So then we'll get the enforcement from the types rather than convention. In the meantime it's a helpful type distinction.

A raw one is "safe" to read, but you must already own a reference. A weak one cannot be read without upgrading to a strong ref.

jorisdral · 2024-11-07T11:43:24Z

src/Database/LSMTree/Internal/BlobRef.hs

+  -> V.Vector (WeakBlobRef m h)
+  -> m (V.Vector SerialisedBlob)
+readWeakBlobRefs hbio wrefs =
+    bracket (deRefWeakBlobRefs wrefs) (V.mapM_ removeReference) $ \refs -> do


And this is withWeakBlobRefs, which is also removed later. Might be okay to keep that function

Yes, inlined here because it's the only use. We can always add it back if we need it. We'd also export StrongBlobRef, which does not need to be exported.

dcoutts · 2024-11-07T16:57:25Z

Thanks for the detailed review @jorisdral !

dcoutts requested review from jorisdral, mheinzel and recursion-ninja as code owners November 6, 2024 16:28

dcoutts changed the title ~~Refactor blob refscoutts/blob ref blob file refactor~~ Refactor blob refs Nov 6, 2024

dcoutts added 13 commits November 6, 2024 23:03

Introduce a new BlobFile abstraction

f57fa71

Not yet used in this patch. Also move the BlobSpan defintion to this module.

Convert WriteBufferBlobs to use BlobFile

d1c237e

Where before it contained a file handle and a ref counter, it now just contains a BlobFile (which itself is the pair of a handle and counter).

Convert Run to use BlobFile

6261f0c

Where before it contained a file handle and a ref counter, it now just contains a BlobFile (which itself is the pair of a handle and counter).

Rename internal BlobRef type to RawBlobRef

2b496f1

To better reflect that it does not itself maintain a reference count, and to distinguish it from a new type we will introduce soon: a strong blob ref. Then we will have a clear type distinction: raw, weak and strong.

Remove BlobRef.readBlob and add BlobRef.readWeakBlobRef

01ce40b

The only uses of BlobRef.readBlob were in the context of withWeakBlobRef so we can go straight to a helper function that does just that. Also remove now-unused WriteBufferBlobs.readBlob

Move impl of retrieveBlobs to new BlobRef.readWeakBlobRefs

3cdd99d

Instead of open-coding the block IO stuff in the top level Internal module in retrieveBlobs, it is moved next to BlobRef.readWeakBlobRef, so we have the singular and bulk versions next to each other.

Remove now-unused definitions in BlobRef

819353e

We don't actually need to export the StrongBlobRef type since all uses of it are temporary, for doing I/O.

Add a couple of TODOs into related code

b06ec76

Noticed while refactoring

stylish-haskell export list formatting fixes

102597d

dcoutts force-pushed the dcoutts/blob-ref-blob-file-refactor branch from 20c2161 to 102597d Compare November 6, 2024 23:03

jorisdral requested changes Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor blob refs #454

Refactor blob refs #454

dcoutts commented Nov 6, 2024

jorisdral left a comment •

edited

Loading

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024 •

edited

Loading

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024

jorisdral Nov 7, 2024

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024

jorisdral Nov 7, 2024

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024 •

edited

Loading

jorisdral Nov 7, 2024

dcoutts Nov 7, 2024

dcoutts commented Nov 7, 2024

	-- \| An open handle to a file containing blobs.
	-- \| A handle to a file containing blobs.

Refactor blob refs #454

Are you sure you want to change the base?

Refactor blob refs #454

Conversation

dcoutts commented Nov 6, 2024

jorisdral left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcoutts Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcoutts Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcoutts commented Nov 7, 2024

jorisdral left a comment •

edited

Loading

dcoutts Nov 7, 2024 •

edited

Loading

dcoutts Nov 7, 2024 •

edited

Loading