Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor blob refs #454

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Refactor blob refs #454

wants to merge 13 commits into from

Conversation

dcoutts
Copy link
Collaborator

@dcoutts dcoutts commented Nov 6, 2024

PR designed to be reviewed patch by patch.

Overall, this is a refactoring in preparation for changes to the ref counting API. The ref counting change will be about making things follow a model of there being singular references, rather than manipulating raw reference counts. This changes the BlobRef representation so that it follows this style: it now holds a reference to a BlobFile (which is itself a reference counted object). Previously it was more ad-hoc. Also more cleanly separate three kinds of reference: raw, weak and strong.

@dcoutts dcoutts changed the title Refactor blob refscoutts/blob ref blob file refactor Refactor blob refs Nov 6, 2024
Not yet used in this patch.

Also move the BlobSpan defintion to this module.
Where before it contained a file handle and a ref counter, it now just
contains a BlobFile (which itself is the pair of a handle and counter).
Where before it contained a file handle and a ref counter, it now just
contains a BlobFile (which itself is the pair of a handle and counter).
data BlobRef m h previously contained:
      blobRefFile  :: !h
which meant allmost all uses of BlobRef had to be BlobRef m (Handle h)

Now we make that internal:
      blobRefFile  :: !(FS.Handle h)
and so all uses are now simply BlobRef m h

This continues with the trend to avoid having to use (Handle h)
everywhere.

This is a large but simple patch, that just deals with the fallout of
this local change.
To better reflect that it does not itself maintain a reference count,
and to distinguish it from a new type we will introduce soon: a strong
blob ref.

Then we will have a clear type distinction: raw, weak and strong.
Originally it contained the file handle and the refcounter of either the
Run or write buffer that the blob ref came from (pointed to).

In a previous patch we changed the Run and WriteBufferBlobs to
themselves contain an independently ref-counted BlobFile, and so at the
same time the blob ref refcount refers to the blob file refcount not
the refcount of the parent object.

In this patch we now change RawBlobRef to directly hold a BlobFile. This
makes the representation logically more uniform between the run and
write buffer cases: since we now just point to the BlobFile being held
by each blob source.

This refactoring will make things much clearer once we switch the
representation of ref-counted objects. Since we'll maintain a reference
to a BlobFile rather than a lower-level combo of a file handle and
reference count.
Previously a RawBlobRef served dual purpose for raw and strong. Keep
that distinction clear.

For the moment, all three have equivalent representations, but this is
likely to change in a later refactoring of the reference counting API.

Also split the functions for making blob refs from runs or the write
buffer into the two variants that we use: raw (internally) and weak
(externally).
The only uses of BlobRef.readBlob were in the context of withWeakBlobRef
so we can go straight to a helper function that does just that.

Also remove now-unused WriteBufferBlobs.readBlob
Instead of open-coding the block IO stuff in the top level Internal
module in retrieveBlobs, it is moved next to BlobRef.readWeakBlobRef, so
we have the singular and bulk versions next to each other.
We don't actually need to export the StrongBlobRef type since all uses
of it are temporary, for doing I/O.
Noticed while refactoring
Currently, pre-snapshots, the Run does _not_ delete its blob files when
the run is closed, but the WriteBufferBlobs _does_ delete its blob file.

Insert a temporary hack to accomodate this. This commit can be reverted
as part of implementing snapshots properly.
@dcoutts dcoutts force-pushed the dcoutts/blob-ref-blob-file-refactor branch from 20c2161 to 102597d Compare November 6, 2024 23:03
Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I agree this makes the handling of blob references nicer. Some suggestions:

Comment on lines 59 to 69
-- | Adopt an existing open file handle to make a 'BlobFile'. The file must at
-- least be open for reading (but may or may not be open for writing).
--
-- The finaliser will close and delete the file.
--
newBlobFile ::
PrimMonad m
=> HasFS m h
-> FS.Handle h
-> m (BlobFile m h)
newBlobFile fs blobFileHandle = do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this function should be in charge of creating the file handle itself. That seems nicer than having BlobFile adopt an existing file handler

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way because the different use cases differ in how they handle the file: open for reading only in runs, but read/write in write buffer. I figured, just have it deal with closing. But yeah, we could change it, it'd need to take a read/write param.

Comment on lines +80 to +86
readBlobFile ::
(MonadThrow m, PrimMonad m)
=> HasFS m h
-> BlobFile m h
-> BlobSpan
-> m SerialisedBlob
readBlobFile fs BlobFile {blobFileHandle}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to put the HasFS in the BlobFile type so that we don't have to pass it around?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be pretty readily available at all the call sites.

Also, when we read multiple blob refs (which each may refer to a different file) we must submit that I/O using a single HasBlockIO.

This is one of those cases where type classes and class coherence are a positive benefit.

@@ -102,17 +103,14 @@ import qualified System.Posix.Types as FS (ByteCount)
--
data WriteBufferBlobs m h =
WriteBufferBlobs {
blobFileHandle :: {-# UNPACK #-} !(FS.Handle h)
blobFile :: !(BlobFile m h)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we implement singular reference, would this become something along the lines of Ref BlobFile?

Copy link
Collaborator Author

@dcoutts dcoutts Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. And the different versions of BlobRef use different kinds of reference to the BlobFile: raw (no Ref), strong Ref, and weak WeakRef.

That's exactly where this refactoring is going. It makes it possible to use the singular reference style.

addReference WriteBufferBlobs {blobFileRefCounter} =
RC.addReference blobFileRefCounter
addReference WriteBufferBlobs {blobFile} =
RC.addReference (blobFileRefCounter blobFile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add BlobFile.addReference and use that instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping we'd just switch to the Ref API and then those things go away anyway. Then there's no more add reference, just create, duplicate and release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is missing SPECIALISE pragmas

Comment on lines +1 to +3
{-# LANGUAGE DeriveAnyClass #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE DerivingStrategies #-}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeriveGeneric seems unused, and the others are enabled by default, I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy'n'paste 😄

import System.FS.API (HasFS)
import qualified System.FS.BlockIO.API as FS

-- | An open handle to a file containing blobs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- | An open handle to a file containing blobs.
-- | A handle to a file containing blobs.

Comment on lines +88 to +95
{-# SPECIALISE readBlobFile :: HasFS IO h -> BlobFile IO h -> BlobSpan -> IO SerialisedBlob #-}
readBlobFile ::
(MonadThrow m, PrimMonad m)
=> HasFS m h
-> BlobFile m h
-> BlobSpan
-> m SerialisedBlob
readBlobFile fs BlobFile {blobFileHandle}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: readFile typically means "read the whole file". Maybe it should be just readBlob?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read from a blob file. Yes, maybe needs renaming.

Comment on lines +37 to +66
-- | A raw blob reference is a reference to a blob within a blob file.
--
-- See 'Database.LSMTree.Common.BlobRef' for more info.
data BlobRef m h = BlobRef {
blobRefFile :: !h
, blobRefCount :: {-# UNPACK #-} !(RefCounter m)
, blobRefSpan :: {-# UNPACK #-} !BlobSpan
-- The \"raw\" means that it does no reference counting, so does not maintain
-- ownership of the 'BlobFile'. Thus these are only safe to use in the context
-- of code that already (directly or indirectly) owns the blob file that the
-- blob ref uses (such as within run merging).
--
-- Thus these cannot be handed out via the API. Use 'WeakBlobRef' for that.
--
data RawBlobRef m h = RawBlobRef {
rawBlobRefFile :: {-# NOUNPACK #-} !(BlobFile m h)
, rawBlobRefSpan :: {-# UNPACK #-} !BlobSpan
}
deriving stock (Show)

instance NFData h => NFData (BlobRef m h) where
rnf (BlobRef a b c) = rnf a `seq` rnf b `seq` rnf c

-- | Location of a blob inside a blob file.
data BlobSpan = BlobSpan {
blobSpanOffset :: {-# UNPACK #-} !Word64
, blobSpanSize :: {-# UNPACK #-} !Word32
}
deriving stock (Show, Eq)

instance NFData BlobSpan where
rnf (BlobSpan a b) = rnf a `seq` rnf b

blobRefSpanSize :: BlobRef m h -> Int
blobRefSpanSize = fromIntegral . blobSpanSize . blobRefSpan
-- | A \"weak\" reference to a blob within a blob file. These are the ones we
-- can return in the public API and can outlive their parent table.
--
-- They are weak references in that they do not keep the file open using a
-- reference count. So when we want to use our weak reference we have to
-- dereference them to obtain a normal strong reference while we do the I\/O
-- to read the blob. This ensures the file is not closed under our feet.
--
-- See 'Database.LSMTree.Common.BlobRef' for more info.
--
data WeakBlobRef m h = WeakBlobRef {
weakBlobRefFile :: {-# NOUNPACK #-} !(BlobFile m h)
, weakBlobRefSpan :: {-# UNPACK #-} !BlobSpan
}
deriving stock (Show)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between raw and weak blob references, other than that the names serve as documentation? Is it necessary to make the distinction if both are a blob reference without ownership?

I'm probably answering my own question here, but yes the dinstinction is necessary, because weak references have to be upgraded before reading the blobs. However, you could make the same distinction by getting rid of RawBlobRef and providing unsafeReadWeakBlobRef and readWeakBlobRef functions

Copy link
Collaborator Author

@dcoutts dcoutts Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weak ones will change to use WeakRef once we use the new Ref API. So then we'll get the enforcement from the types rather than convention. In the meantime it's a helpful type distinction.

A raw one is "safe" to read, but you must already own a reference. A weak one cannot be read without upgrading to a strong ref.

-> V.Vector (WeakBlobRef m h)
-> m (V.Vector SerialisedBlob)
readWeakBlobRefs hbio wrefs =
bracket (deRefWeakBlobRefs wrefs) (V.mapM_ removeReference) $ \refs -> do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is withWeakBlobRefs, which is also removed later. Might be okay to keep that function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, inlined here because it's the only use. We can always add it back if we need it. We'd also export StrongBlobRef, which does not need to be exported.

@dcoutts
Copy link
Collaborator Author

dcoutts commented Nov 7, 2024

Thanks for the detailed review @jorisdral !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants