-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor blob refs #454
base: main
Are you sure you want to change the base?
Refactor blob refs #454
Conversation
Not yet used in this patch. Also move the BlobSpan defintion to this module.
Where before it contained a file handle and a ref counter, it now just contains a BlobFile (which itself is the pair of a handle and counter).
Where before it contained a file handle and a ref counter, it now just contains a BlobFile (which itself is the pair of a handle and counter).
data BlobRef m h previously contained: blobRefFile :: !h which meant allmost all uses of BlobRef had to be BlobRef m (Handle h) Now we make that internal: blobRefFile :: !(FS.Handle h) and so all uses are now simply BlobRef m h This continues with the trend to avoid having to use (Handle h) everywhere. This is a large but simple patch, that just deals with the fallout of this local change.
To better reflect that it does not itself maintain a reference count, and to distinguish it from a new type we will introduce soon: a strong blob ref. Then we will have a clear type distinction: raw, weak and strong.
Originally it contained the file handle and the refcounter of either the Run or write buffer that the blob ref came from (pointed to). In a previous patch we changed the Run and WriteBufferBlobs to themselves contain an independently ref-counted BlobFile, and so at the same time the blob ref refcount refers to the blob file refcount not the refcount of the parent object. In this patch we now change RawBlobRef to directly hold a BlobFile. This makes the representation logically more uniform between the run and write buffer cases: since we now just point to the BlobFile being held by each blob source. This refactoring will make things much clearer once we switch the representation of ref-counted objects. Since we'll maintain a reference to a BlobFile rather than a lower-level combo of a file handle and reference count.
Previously a RawBlobRef served dual purpose for raw and strong. Keep that distinction clear. For the moment, all three have equivalent representations, but this is likely to change in a later refactoring of the reference counting API. Also split the functions for making blob refs from runs or the write buffer into the two variants that we use: raw (internally) and weak (externally).
The only uses of BlobRef.readBlob were in the context of withWeakBlobRef so we can go straight to a helper function that does just that. Also remove now-unused WriteBufferBlobs.readBlob
Instead of open-coding the block IO stuff in the top level Internal module in retrieveBlobs, it is moved next to BlobRef.readWeakBlobRef, so we have the singular and bulk versions next to each other.
We don't actually need to export the StrongBlobRef type since all uses of it are temporary, for doing I/O.
Noticed while refactoring
Currently, pre-snapshots, the Run does _not_ delete its blob files when the run is closed, but the WriteBufferBlobs _does_ delete its blob file. Insert a temporary hack to accomodate this. This commit can be reverted as part of implementing snapshots properly.
20c2161
to
102597d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I agree this makes the handling of blob references nicer. Some suggestions:
-- | Adopt an existing open file handle to make a 'BlobFile'. The file must at | ||
-- least be open for reading (but may or may not be open for writing). | ||
-- | ||
-- The finaliser will close and delete the file. | ||
-- | ||
newBlobFile :: | ||
PrimMonad m | ||
=> HasFS m h | ||
-> FS.Handle h | ||
-> m (BlobFile m h) | ||
newBlobFile fs blobFileHandle = do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this function should be in charge of creating the file handle itself. That seems nicer than having BlobFile
adopt an existing file handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because the different use cases differ in how they handle the file: open for reading only in runs, but read/write in write buffer. I figured, just have it deal with closing. But yeah, we could change it, it'd need to take a read/write param.
readBlobFile :: | ||
(MonadThrow m, PrimMonad m) | ||
=> HasFS m h | ||
-> BlobFile m h | ||
-> BlobSpan | ||
-> m SerialisedBlob | ||
readBlobFile fs BlobFile {blobFileHandle} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to put the HasFS
in the BlobFile
type so that we don't have to pass it around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be pretty readily available at all the call sites.
Also, when we read multiple blob refs (which each may refer to a different file) we must submit that I/O using a single HasBlockIO
.
This is one of those cases where type classes and class coherence are a positive benefit.
@@ -102,17 +103,14 @@ import qualified System.Posix.Types as FS (ByteCount) | |||
-- | |||
data WriteBufferBlobs m h = | |||
WriteBufferBlobs { | |||
blobFileHandle :: {-# UNPACK #-} !(FS.Handle h) | |||
blobFile :: !(BlobFile m h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we implement singular reference, would this become something along the lines of Ref BlobFile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. And the different versions of BlobRef use different kinds of reference to the BlobFile
: raw (no Ref
), strong Ref
, and weak WeakRef
.
That's exactly where this refactoring is going. It makes it possible to use the singular reference style.
addReference WriteBufferBlobs {blobFileRefCounter} = | ||
RC.addReference blobFileRefCounter | ||
addReference WriteBufferBlobs {blobFile} = | ||
RC.addReference (blobFileRefCounter blobFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add BlobFile.addReference
and use that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping we'd just switch to the Ref
API and then those things go away anyway. Then there's no more add reference, just create, duplicate and release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module is missing SPECIALISE
pragmas
{-# LANGUAGE DeriveAnyClass #-} | ||
{-# LANGUAGE DeriveGeneric #-} | ||
{-# LANGUAGE DerivingStrategies #-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeriveGeneric
seems unused, and the others are enabled by default, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy'n'paste 😄
import System.FS.API (HasFS) | ||
import qualified System.FS.BlockIO.API as FS | ||
|
||
-- | An open handle to a file containing blobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- | An open handle to a file containing blobs. | |
-- | A handle to a file containing blobs. |
{-# SPECIALISE readBlobFile :: HasFS IO h -> BlobFile IO h -> BlobSpan -> IO SerialisedBlob #-} | ||
readBlobFile :: | ||
(MonadThrow m, PrimMonad m) | ||
=> HasFS m h | ||
-> BlobFile m h | ||
-> BlobSpan | ||
-> m SerialisedBlob | ||
readBlobFile fs BlobFile {blobFileHandle} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: readFile
typically means "read the whole file". Maybe it should be just readBlob
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read from a blob file. Yes, maybe needs renaming.
-- | A raw blob reference is a reference to a blob within a blob file. | ||
-- | ||
-- See 'Database.LSMTree.Common.BlobRef' for more info. | ||
data BlobRef m h = BlobRef { | ||
blobRefFile :: !h | ||
, blobRefCount :: {-# UNPACK #-} !(RefCounter m) | ||
, blobRefSpan :: {-# UNPACK #-} !BlobSpan | ||
-- The \"raw\" means that it does no reference counting, so does not maintain | ||
-- ownership of the 'BlobFile'. Thus these are only safe to use in the context | ||
-- of code that already (directly or indirectly) owns the blob file that the | ||
-- blob ref uses (such as within run merging). | ||
-- | ||
-- Thus these cannot be handed out via the API. Use 'WeakBlobRef' for that. | ||
-- | ||
data RawBlobRef m h = RawBlobRef { | ||
rawBlobRefFile :: {-# NOUNPACK #-} !(BlobFile m h) | ||
, rawBlobRefSpan :: {-# UNPACK #-} !BlobSpan | ||
} | ||
deriving stock (Show) | ||
|
||
instance NFData h => NFData (BlobRef m h) where | ||
rnf (BlobRef a b c) = rnf a `seq` rnf b `seq` rnf c | ||
|
||
-- | Location of a blob inside a blob file. | ||
data BlobSpan = BlobSpan { | ||
blobSpanOffset :: {-# UNPACK #-} !Word64 | ||
, blobSpanSize :: {-# UNPACK #-} !Word32 | ||
} | ||
deriving stock (Show, Eq) | ||
|
||
instance NFData BlobSpan where | ||
rnf (BlobSpan a b) = rnf a `seq` rnf b | ||
|
||
blobRefSpanSize :: BlobRef m h -> Int | ||
blobRefSpanSize = fromIntegral . blobSpanSize . blobRefSpan | ||
-- | A \"weak\" reference to a blob within a blob file. These are the ones we | ||
-- can return in the public API and can outlive their parent table. | ||
-- | ||
-- They are weak references in that they do not keep the file open using a | ||
-- reference count. So when we want to use our weak reference we have to | ||
-- dereference them to obtain a normal strong reference while we do the I\/O | ||
-- to read the blob. This ensures the file is not closed under our feet. | ||
-- | ||
-- See 'Database.LSMTree.Common.BlobRef' for more info. | ||
-- | ||
data WeakBlobRef m h = WeakBlobRef { | ||
weakBlobRefFile :: {-# NOUNPACK #-} !(BlobFile m h) | ||
, weakBlobRefSpan :: {-# UNPACK #-} !BlobSpan | ||
} | ||
deriving stock (Show) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any difference between raw and weak blob references, other than that the names serve as documentation? Is it necessary to make the distinction if both are a blob reference without ownership?
I'm probably answering my own question here, but yes the dinstinction is necessary, because weak references have to be upgraded before reading the blobs. However, you could make the same distinction by getting rid of RawBlobRef
and providing unsafeReadWeakBlobRef
and readWeakBlobRef
functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weak ones will change to use WeakRef
once we use the new Ref
API. So then we'll get the enforcement from the types rather than convention. In the meantime it's a helpful type distinction.
A raw one is "safe" to read, but you must already own a reference. A weak one cannot be read without upgrading to a strong ref.
-> V.Vector (WeakBlobRef m h) | ||
-> m (V.Vector SerialisedBlob) | ||
readWeakBlobRefs hbio wrefs = | ||
bracket (deRefWeakBlobRefs wrefs) (V.mapM_ removeReference) $ \refs -> do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is withWeakBlobRefs
, which is also removed later. Might be okay to keep that function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, inlined here because it's the only use. We can always add it back if we need it. We'd also export StrongBlobRef
, which does not need to be exported.
Thanks for the detailed review @jorisdral ! |
PR designed to be reviewed patch by patch.
Overall, this is a refactoring in preparation for changes to the ref counting API. The ref counting change will be about making things follow a model of there being singular references, rather than manipulating raw reference counts. This changes the BlobRef representation so that it follows this style: it now holds a reference to a BlobFile (which is itself a reference counted object). Previously it was more ad-hoc. Also more cleanly separate three kinds of reference: raw, weak and strong.