This document describes how we will implement server-side copy.
- it should be an atomic operation.
- there are no decrypt/encrypt operations: we don't change the encryption parameters. We may need to re-encrypt the object's metadata.
- only committed objects can be copied
- we want to avoid downloading and reuploading
- the most common scenario in S3 is for server-side move, implemented in two steps:
- server-side copy of the source object to the destination object
- delete the source object. The expectation for clients is that server-side move is a fast operation, so we should try to implement server-side copy as fast as possible too. The impact on operations on original objects should be minimized.
- Atomic server side copy is limited to objects below 5GB in size.
- For objects between 5GB and 5TB (max object size), the Upload Part-Copy API must be used.
- read access to the source object, write access to the destination bucket.
- metadata can be preserved or replaced
- copy to the same location is the recommended way of changing the metadata of an object.
At some point, we will implement upload-part-copy
, so we need to take this into account in the design. upload-part-copy allows the user to define byte ranges.
Source and destination objects must have the same expiration date, as the source and destination objects share the same pieces on the storage nodes. Storage nodes are aware of a piece expiration date, and routinely delete the expired ones.
In our database, we keep the following information about an object (objects
table):
ProjectID, BucketName, ObjectKey, Version, StreamID (ObjectStream)
CreatedAt, ExpiresAt
,Status (Pending/Committed), SegmentCount
,EncryptedMetadataNonce, EncryptedMetadata, EncryptedMetadataEncryptedKey
,TotalPlainSize, TotalEncryptedSize, FixedSegmentSize, Encryption, ZombieDeletionDeadline
.
And the following information about its segments (segments table):
StreamID, Position
(SegmentPosition
= part + index in part)CreatedAt, RepairedAt, ExpiresAt
,RootPieceID, EncryptedKeyNonce, EncryptedKey
,EncryptedSize, PlainSize, PlainOffset, EncryptedETag
,Redundancy
,InlineData, Pieces
Each Piece
is a reference to piece on a storage node. It contains the piece number and the nodeID of the storage node holding the data.
The general idea is that the satellite copies the row corresponding to the copied object and adapt it to fit the newly created object. It needs to adjust:
BucketName, ObjectKey, StreamID
CreatedAt
(ExpiresAt
cannot be changed, as the storage node holding expired pieces will delete them)EncryptedMetadataNonce, EncryptedMetadata, EncryptedMetadataEncryptedKey
.
It also needs to copy the corresponding segments (same operation than the one done in upload-part-copy). It needs to adjust:
StreamID
CreatedAt
EncryptedKeyNonce, EncryptedKey
If it copies an inline segment, it just copies InlineData
.
The remote pieces of a segment are not copied, but we keep track of the number of segments referencing them. The amount of data we store on the storage nodes will be less than what we get if we consider the sum of sizes advertised in the segments table.
ExpiresAt
cannot be changed, as expiration date of the original segment is used by storage node when deciding if it should delete pieces.
RepairedAt
should be retrieved from the original segment.
If the source object is deleted between calls to BeginCopyObject
and FinishCopyObject
, a FailedPrecondition
error is thrown.
Table segments
contains a ancestor_stream_id
. It is empty for original segment. For copied segments, it contains the stream_id
of the original segment. Table segments
also contains a reference_counter
integer counting the number of segments having stream_id
as ancestor_stream_id
.
In this solution, ancestor_stream_id
has to be read at each download, which could cause performance issues if this information was kept in another table.
3 segments A
, B
, and C
with the corresponding stream_ids A.stream_id
, B.stream_id
and C.stream_id
.
Segment B
is a copy of segment A
. Segment C
is a copy of segment B
.
A
is uploaded
segment |
counter |
ancestor_stream_id |
---|---|---|
A |
0 | - |
B
is uploaded
segment |
counter |
ancestor_stream_id |
---|---|---|
A |
1 | - |
B |
0 | A.stream_id |
If we use a reference list, we would have to add B.stream_id
to A.reference_list
.
C
is uploaded: we have two choices: we can markC
as a copy ofB
, or resolve links, and markC
as a copy ofA
.
segment |
counter |
ancestor_stream_id |
---|---|---|
A |
2 | - |
B |
0 | A.stream_id |
C |
0 | A.stream_id |
If we use a reference list, we would have to add C.stream_id
to A.reference_list
.
When wanting to download a segment, we first check the ancestor_stream_id
. If it not set, remote_alias_pieces
is used. Otherwise, remote_alias_pieces
of segment ancestor_stream_id
is used.
For a segment that is not a copy, only the standard query is used. A second query for retrieving remote_alias_pieces
is run only if ancestor_stream_id
is set.
Let's consider the deletion of each segment at this point:
B
orC
is deleted:- with a reference counter:
A.counter
(ancestor_stream_id
ofB
andC
) is decreased - with a reference list:
stream_id
ofB
orC
is removed fromA.reference_list
- with a reference counter:
A
is deleted (original segment):- with a reference counter: as
A
's counter is not zero, the first segment withancestor_stream_id = A
becomes the original segment. Its reference counter is set toA.counter-1
.remote_alias_pieces
is copied. - with a reference list:
A.reference_list
is not empty, the firststream_id
inA.reference_list
becomes the original segment. Its reference list is set toA.reference_list
minus(stream_id, position)
- with a reference counter: as
In the most common case we foresee (i.e. copy A
to B
and right after, delete A
):
- copy: create row for segment
B
, reference counters will be:
segment |
counter |
ancestor_stream_id |
---|---|---|
A |
1 | - |
B |
0 | A |
- deleting
A
:- with reference counter: look for
1
copy ofA
to update theancestor_stream_id
(We want to avoid to have an index onancestor_stream_id
, this could lead this query to be slow). - with reference list: we don't have to make another query, we just use the
stream_id
inA.reference_list
- with reference counter: look for
Having a reference list speeds up the delete operation of our predicted most common use case. It could also ease debugging in case something goes wrong.
We have a similar problem with reference counter and reference list in segments table. Both need an update of all children's ancestor_stream_id
when the original segment is deleted.
This scan can be avoided by adding a segment_copies
table that would keep track of copied segments' ancestor. It contains (stream_id, ancestor_stream_id)
with primary key on stream_id
When creating a copy B
of segment A
, an entry (B.stream_id, A.stream_id)
is created.
When creating a copy C
of segment B
, an entry (C.stream_id, A.stream_id)
is created.
When downloading a segment, after querying the segments
table:
- if the segment is not a copy of a remote segment (
remote_alias_pieces
not null orinline_data
not null or zeroencrypted_size
) it is processed as it currently is. - if it is a copy of a remote segment (no
remote_alias_pieces
and noinline_data
and non-zeroencrypted_size
), theancestor_stream_id
is read from the reference table, and the correspondingremote_alias_pieces
are retrieved from the segments table.
When deleting original segment A
, the following is done:
- a look-up for elements with
ancestor_stream_id = A.stream_id
is performed. - if such elements exist (it does in our example):
- the first entry with
ancestor_stream_id = A.stream_id
gives the stream_id that will replaceA.stream_id
asancestor_stream_id
in thesegment_copies
table. In our example, it would beB.stream_id
. - Update the necessary fields for the new original segment (
B
) in thesegments
table (size informations,RepairedAt
andExpiresAt
,remote_alias_pieces
) - replace
A.stream_id
byB.stream_id
in theancestor_stream_id
column of thesegment_copies
table.
- the first entry with
- segment A is deleted from the segments table
This full scan table could slow down deletes, especially for segments that have not been copied. We need at least an index on ancestor_stream_id
. Another solution would be add a column to the segments that indicates if the segment has been copied.
In the most common case we foresee (copy/delete), this means we will have:
- one lookup with index, one insert in the
segments
table (create copy) - one insert in the
segment_copies
table (create copy) - one lookup (on an index column) in the
segment_copies
table, to find that the only existing copy becomes the original (delete original) - one update in the
segments
table (delete original) - one delete in the
segments
table (delete original)
We choose to implement the reference table. This way, we don't need to modify the segments
table, and we keep track of all copy references, so that debugging, should it be needed, will be simplified. This solution should not cause overhead for downloading an original object (has it been copied or not). It will have some overhead for deleting an original segment. If it has not been copied, a full scan of the reference table will be necessary, as we will have to check that it has not been copied.
We limit the number of objects an object can be the ancestor_stream_id
of to 100
as a way to limit the growing of the segment_copies
table.
Copy operation will behave a lot like move operation, except that we don't modify the source object and its segments.
- Uplink sends a
ObjectBeginCopyRequest
. - Satellite responds with
ObjectBeginCopyResponse
containing all segments encrypted keys - Uplink re-encrypt keys with the new object key, sends a
ObjectFinishCopyResponse
- Satellite creates new segments with the new encrypted keys
- Update the reference table to keep track of the link between the source stream_id and the destination stream_id.
ObjectBeginCopyRequest
- contains
Bucket, EncryptedObjectKey, NewBucket, NewEncryptedObjectKey
ObjectBeginCopyResponse
- contains
StreamId
,EncryptedMetadataKeyNonce
,EncryptedMetadataKey
, an array ofPosition, EncryptedKeyNonce, EncryptedKey
of all segments, andEncryptionParameters
.
ObjectFinishCopyRequest
- contains
StreamId
,NewBucket
,NewEncryptedObjectKey
,NewEncryptedMetadataKeyNonce
,NewEncryptedMetadataKey
,NewEncryptedMetadata
an array of newPosition, EncryptedKeyNonce, EncryptedKey
of all segments.
ObjectFinishCopyResponse
- Object's metadata
- Garbage Collection:
- No impact. Copied segments have no
remote_alias_pieces
, so no piece references to add to the bloom filter.
- No impact. Copied segments have no
- Audit/reputation
- Audit should ignore segments with a set `ancestor_stream_id
- Data Repair:
- No impact. The checker, responsible for identifying injured segments, does not consider segments with no
remote_alias_pieces
.
- No impact. The checker, responsible for identifying injured segments, does not consider segments with no
- Graceful Exit:
- No impact. Graceful exit is done at the pieces level. As copied segments hold no piece, this does not impact the service.
- Metainfo > Expired Deletion:
- no impact. Copies have the same expiration time as originals. Expired objects cannot be accessed by the user.
- Accounting / Tally:
- no impact - we bill copies as original objects.
- Metrics:
- we should add the metric
req_copy_object
in metainfo endpoint to measure the use of this functionality.
- we should add the metric
The first implementation can handle only objects with a number of segments that can be handled in a single request (e.g. 10000). Copying objects with a larger number of segments should be rejected.
As a next step, we can handle copying objects with a large number of segments by keeping partial results of copy in a separate database table. The final copy operation will be applied when all new encrypted keys will be collected. For such operation BeginCopyObject
method should return a unique operation identifier which should be used with FinishCopyObject
. This will give us the ability to safely link together begin and finish requests. Keeping partial results should be limited by time as the user can start copy operation but not necessarily finish it (e.g. cancellation, crash during operation).
- migration step for adding
segment_copies
table with columnsstream_id, ancestor_stream_id
- protobuf for
BeginCopyObject
andFinishCopyObject
in common - add endpoint
BeginCopyObject
to metainfo - the corresponding tests should check that:- an error is returned if the object does not exist, or is in pending state.
- if the number of segments is above the maximum number of segments, an error is returned
- the correct infos are returned
- add endpoint
FinishCopyObject
to metainfo - the corresponding tests should check that:- an error is thrown if called on a deleted object
- there is a mismatch in the number of segments in FinishCopyObject and the number of segments of the source object.
- throw an error if there is an overflow of the reference counter of the source object.
- add method in uplink public API
CopyObject(oldBucket, newBucket, oldName, newName)
- adapt
ListSegments
Later we can:
- assess impact on the satellite performances
- derive the destination
stream_id
from the sourcestream_id
to get better performances.
As stated earlier, we are going to benchmark different solutions. When doing so, we will have the following considerations:
- the solution should have as little impact as possible on operations on a segment that has not been copied.
- we want to minimize impact on downloads and then uploads
- we think the most common use case will be copy of an object, followed by the deletion of the original object. If this is the case, the percentage of copies should be small, and we should not have copies of copies most of the time.
The test plan will be a separate document (see storj/storj#4317) and will be available under docs/testplan/server-side-copy-testplan.md
.
- do we add a column to the
segments
table or query thesegment_copies
table each time we need to know if a segment is an original (i.e. it is referenced by other segments)?