Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: split lance-core into lance-core, lance-io, lance-file, and lance-table #1852

Merged
merged 10 commits into from
Jan 19, 2024

Conversation

westonpace
Copy link
Contributor

There is very little code change in this PR. It is primarily moving files around.

The only substantial change should be a change to the FileWriter. It is now either FileWriter<NotSelfDescribing> (for testing in lance-file) and FileWriter<ManifestDescribing> for use everywhere else. This is because a lance file writes the schema as a manifest and manifest is part of the table format. In the future we could fix this by creating a schema message in the protobuf in lance-file and changing the FileWriter and FileReader to read this. However, this would be a breaking change and would require a bump in the lance file version (or feature flags). I'd prefer to save that change for later.

@westonpace
Copy link
Contributor Author

Putting this in draft because it depends on #1842 and #1848

Comment on lines +43 to +62
/// The file format currently includes a "manifest" where it stores the schema for
/// self-describing files. Historically this has been a table format manifest that
/// is empty except for the schema field.
///
/// Since this crate is not aware of the table format we need this to be provided
/// externally. You should always use lance_table::io::manifest::ManifestDescribing
/// for this today.
#[async_trait]
pub trait ManifestProvider {
/// Store the schema in the file
///
/// This should just require writing the schema (or a manifest wrapper) as a proto struct
///
/// Note: the dictionaries have already been written by this point and the schema should
/// be populated with the dictionary lengths/offsets
async fn store_schema(
object_writer: &mut ObjectWriter,
schema: &Schema,
) -> Result<Option<usize>>;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new

///
/// The offsets and lengths of the written buffers are stored in the given
/// schema so that the dictionaries can be loaded in the future.
async fn write_dictionaries(writer: &mut ObjectWriter, schema: &mut Schema) -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change here, just refactored this into its own method.

Comment on lines +638 to +639
Self::write_dictionaries(&mut self.object_writer, &mut self.schema).await?;
let pos = M::store_schema(&mut self.object_writer, &self.schema).await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the new ManifestProvider gets used.

@@ -231,156 +232,3 @@ message DeletionFile {
uint64 num_deleted_rows = 4;
} // DeletionFile

// Metadata of one Lance file.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old format.proto split into file.proto and table.proto. In the process I changed the package field. This should be a non-breaking change for protobuf as long as we aren't using these messages is a protobuf Any field (we aren't)

pub const MAJOR_VERSION: i16 = 0;
pub const MINOR_VERSION: i16 = 1;
pub const MAGIC: &[u8; 4] = b"LANC";
pub const INDEX_MAGIC: &[u8; 8] = b"LANC_IDX";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this. It was never used anywhere. I think it's a good idea to a different magic at the end of index files. But we aren't doing that today.

}
}

pub struct Fields(pub Vec<pb::Field>);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor introduction here of Fields and (later) FieldsWithMeta. This is because Field and Schema are in lance-core but the to/from protobuf methods are in lance-file. You can't do impl From<X> for Vec<Y> if X doesn't belong to your crate (even if Y does).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes some sense to put this here as it is the deserialization part, but part of me still thinks it's easier to just keep this associated with the Schema definition. We could consider having a separate schema.proto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making a schema.proto is fine (can do later). I agree it is a little inconsistent.

@@ -23,8 +23,6 @@ arrow-select.workspace = true
async-recursion.workspace = true
async-trait.workspace = true
lance-arrow.workspace = true
aws-config.workspace = true
aws-credential-types.workspace = true
aws-sdk-dynamodb = { workspace = true, optional = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect there are more dependencies that can be removed. I want to take a pass at doing this later.

@westonpace westonpace marked this pull request as ready for review January 19, 2024 20:15
Copy link
Contributor

@eddyxu eddyxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YOLO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually liked that the protos had a top-level folder. This separated the specification of the format from our particular implementation.

Also, portions of our format docs make direct references to these files, so if we split them like this we need to fix those references. Example:

lance/docs/format.rst

Lines 61 to 65 in f3d2ae3

.. literalinclude:: ../protos/format.proto
:language: protobuf
:linenos:
:start-at: message Metadata {
:end-at: } // Metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've moved the docs back to the central location (I hadn't realized/remembered it was a symlink) but I kept the split of format.proto into file.proto and table.proto. I've fixed the relevant rst files (and done a search for any other references to format.proto and didn't find any)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to, there is a nice directive you can use to include this readme as the module-level docs:

#![doc = include_str!("path/to/docs.md")]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll save this for a follow-up as we aren't doing it anywhere else and I probably wouldn't be able to resist the temptation to clean the module-level docs up some if I'm going to expose them.

However, this is a neat trick!

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine, except I think we should keep the proto files in that consolidated folder.

@westonpace westonpace merged commit a8def71 into lancedb:main Jan 19, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants