-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: split lance-core into lance-core, lance-io, lance-file, and lance-table #1852
refactor: split lance-core into lance-core, lance-io, lance-file, and lance-table #1852
Conversation
8acfd80
to
d3c202f
Compare
/// The file format currently includes a "manifest" where it stores the schema for | ||
/// self-describing files. Historically this has been a table format manifest that | ||
/// is empty except for the schema field. | ||
/// | ||
/// Since this crate is not aware of the table format we need this to be provided | ||
/// externally. You should always use lance_table::io::manifest::ManifestDescribing | ||
/// for this today. | ||
#[async_trait] | ||
pub trait ManifestProvider { | ||
/// Store the schema in the file | ||
/// | ||
/// This should just require writing the schema (or a manifest wrapper) as a proto struct | ||
/// | ||
/// Note: the dictionaries have already been written by this point and the schema should | ||
/// be populated with the dictionary lengths/offsets | ||
async fn store_schema( | ||
object_writer: &mut ObjectWriter, | ||
schema: &Schema, | ||
) -> Result<Option<usize>>; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new
/// | ||
/// The offsets and lengths of the written buffers are stored in the given | ||
/// schema so that the dictionaries can be loaded in the future. | ||
async fn write_dictionaries(writer: &mut ObjectWriter, schema: &mut Schema) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change here, just refactored this into its own method.
Self::write_dictionaries(&mut self.object_writer, &mut self.schema).await?; | ||
let pos = M::store_schema(&mut self.object_writer, &self.schema).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the new ManifestProvider
gets used.
rust/lance-table/protos/table.proto
Outdated
@@ -231,156 +232,3 @@ message DeletionFile { | |||
uint64 num_deleted_rows = 4; | |||
} // DeletionFile | |||
|
|||
// Metadata of one Lance file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old format.proto
split into file.proto
and table.proto
. In the process I changed the package
field. This should be a non-breaking change for protobuf as long as we aren't using these messages is a protobuf Any
field (we aren't)
pub const MAJOR_VERSION: i16 = 0; | ||
pub const MINOR_VERSION: i16 = 1; | ||
pub const MAGIC: &[u8; 4] = b"LANC"; | ||
pub const INDEX_MAGIC: &[u8; 8] = b"LANC_IDX"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this. It was never used anywhere. I think it's a good idea to a different magic at the end of index files. But we aren't doing that today.
} | ||
} | ||
|
||
pub struct Fields(pub Vec<pb::Field>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor introduction here of Fields
and (later) FieldsWithMeta
. This is because Field
and Schema
are in lance-core
but the to/from protobuf methods are in lance-file
. You can't do impl From<X> for Vec<Y>
if X
doesn't belong to your crate (even if Y
does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it makes some sense to put this here as it is the deserialization part, but part of me still thinks it's easier to just keep this associated with the Schema
definition. We could consider having a separate schema.proto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making a schema.proto
is fine (can do later). I agree it is a little inconsistent.
rust/lance-core/Cargo.toml
Outdated
@@ -23,8 +23,6 @@ arrow-select.workspace = true | |||
async-recursion.workspace = true | |||
async-trait.workspace = true | |||
lance-arrow.workspace = true | |||
aws-config.workspace = true | |||
aws-credential-types.workspace = true | |||
aws-sdk-dynamodb = { workspace = true, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there are more dependencies that can be removed. I want to take a pass at doing this later.
d3c202f
to
b0e2708
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YOLO
rust/lance-table/protos/table.proto
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually liked that the protos had a top-level folder. This separated the specification of the format from our particular implementation.
Also, portions of our format docs make direct references to these files, so if we split them like this we need to fix those references. Example:
Lines 61 to 65 in f3d2ae3
.. literalinclude:: ../protos/format.proto | |
:language: protobuf | |
:linenos: | |
:start-at: message Metadata { | |
:end-at: } // Metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've moved the docs back to the central location (I hadn't realized/remembered it was a symlink) but I kept the split of format.proto
into file.proto
and table.proto
. I've fixed the relevant rst files (and done a search for any other references to format.proto
and didn't find any)
rust/lance-io/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to, there is a nice directive you can use to include this readme as the module-level docs:
#![doc = include_str!("path/to/docs.md")]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll save this for a follow-up as we aren't doing it anywhere else and I probably wouldn't be able to resist the temptation to clean the module-level docs up some if I'm going to expose them.
However, this is a neat trick!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine, except I think we should keep the proto files in that consolidated folder.
There is very little code change in this PR. It is primarily moving files around.
The only substantial change should be a change to the
FileWriter
. It is now eitherFileWriter<NotSelfDescribing>
(for testing inlance-file
) andFileWriter<ManifestDescribing>
for use everywhere else. This is because a lance file writes the schema as a manifest and manifest is part of the table format. In the future we could fix this by creating a schema message in the protobuf in lance-file and changing theFileWriter
andFileReader
to read this. However, this would be a breaking change and would require a bump in the lance file version (or feature flags). I'd prefer to save that change for later.