Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] [CC Refactor #1] Add ConfigurationProvider, TableIdentifier, TableBuilder APIs #3781

Conversation

scottsand-db
Copy link
Collaborator

@scottsand-db scottsand-db commented Oct 18, 2024

NOTE: I won't be merging any of these PRs until the entire e2e implementation is complete and all PRs are approved

This is part of some stacked PRs:

  1. [Kernel] [CC Refactor #1] Add ConfigurationProvider, TableIdentifier, TableBuilder APIs #3781
  2. [Kernel] [CC Refactor #2] Add new Kernel CommitCoordinatorClient interface #3782

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

New Interfaces/Classes

  • adds a new ConfigurationProvider interface, that will let kernel get per-session configurations in the future
  • adds a new TableIdentifier class, that kernel will pass on to Commit Coordinator services
  • adds a new TableBuilder interface, that that constructs kernel Table instances, with optional config and tableId params

Cleanup

  • Remove unecessary test references to code like TableImpl.forPath(engine, tablePath, () => System.currentTimeMillis) --> the System.currentTimeMilis is the default clock, anyways. No need to point to an internal API for that, too.

How was this patch tested?

Existing UTs.

Does this PR introduce any user-facing changes?

Yes. New ConfigurationProvider interface, new TableIdentifier class, new TableBuilder interface

@scottsand-db scottsand-db self-assigned this Oct 18, 2024
@scottsand-db scottsand-db changed the title [Kernel] [Commit Coordinator Refactor #1] Add ConfigurationProvider, TableIdentifier, TableBuilder APIs [Kernel] [CC Refactor #1] Add ConfigurationProvider, TableIdentifier, TableBuilder APIs Oct 18, 2024
@@ -181,14 +197,6 @@ public CloseableIterator<ColumnarBatch> getChanges(
});
}

protected Path getDataPath() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move protected methods to be below public methods


package io.delta.kernel;

/** Identifier for a table. e.g. <catalog> / <schema> / <tableName> */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@since 3.3.0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Evolving


package io.delta.kernel;

/** Identifier for a table. e.g. <catalog> / <schema> / <tableName> */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[error] /home/runner/work/delta/delta/kernel/kernel-api/src/main/java/io/delta/kernel/TableIdentifier.java:19:1:  error: unknown tag: catalog
[error] /** Identifier for a table. e.g. <catalog> / <schema> / <tableName> */

Copy link
Contributor

@raveeram-db raveeram-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some qns (not blocking)

/** Returns the name of the table. */
public String getName() {
return name;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString (for printing into logs). equals, hashCode (may add if needed later, but I do see this is useful in near future).

Comment on lines +26 to +27
* <p>These values are not Delta table properties. Delta table properties are stored in the Delta
* log metadata and apply globally to all readers and writers interacting with the table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example of where this gets used or what sort of configuration can it contain?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First example usage

The implementation of

  CommitCoordinatorClient Engine::getCommitCoordinatorClient(
      String commitCoordinatorName,
      ConfigurationProvider sessionConfig,
      Map<String, String> commitCoordinatorConf);

will use it to lookup which builder to use for the given commitCoordinatorName.

It will lookup a value like io.delta.kernel.commitCoordinatorBuilder.unity-catalog.impl

Second example usage

Then, the actual builder referenced above will lookup various keys / configurations needed to instantiate the client.

* @param key the configuration key
* @return {@code true} if the key exists, else {@code false}
*/
boolean contains(String key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contains is tricky to define for conf. There is always some default conf and what does it mean when call contains.

Is it better to just have Optional<String> get(String key) and avoid the multiple APIs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks!

* configuration is provided or expected.
*/
public class EmptyConfigurationProvider implements ConfigurationProvider {
@Override
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can define just one static instance and use it everywhere rather than creating a new instance everytime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could be even in ConfigurationProvider (avoids a new file).

import java.util.Optional;

public class TableBuilderImpl implements TableBuilder {
private final Engine engine;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we avoid this kind of pattern where Engine is stored in class member variables.

Copy link
Collaborator Author

@scottsand-db scottsand-db Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. We will have to think about how to work around this, then

// Table.java
static TableBuilder builder() { return new TableBuilderImpl(); }

// TableBuilder.java
public Table build(Engine engine, String path)

requireNonNull(engine, "Expected non-null value for 'engine'");
requireNonNull(path, "Expected non-null value for 'path'");

this.engine = engine;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.engine = engine;
this.engine = requireNonNull(engine, "blah blah");

* @since 3.3.0
*/
@Evolving
public interface ConfigurationProvider {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the lifecycle of this object? Can Kernel maintain a reference to it? expected to release it after some event? can Engine update it while Kernel is using it (i.e return different values at different times for the same key)?

Comment on lines +21 to +25
/** The name of the table. */
private final String name;

/** The namespace of the table. e.g. <catalog> / <schema> */
private final String[] namespace;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The order of constructor arguments & getters is namespacename. I would recommend using the same order in fields.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will do! I'm still working on some more PRs to fully flesh out this design. Will get to this later.

Comment on lines +28 to +29
this.namespace = namespace;
this.name = name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireNonNull seems missing. Also, it would be nice to ensure non-empty in my opinion.

Comment on lines +40 to +41
requireNonNull(engine, "Expected non-null value for 'engine'");
requireNonNull(path, "Expected non-null value for 'path'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing message pattern of requireNonNull is {variable} is null. I would recommend using the same pattern for consistency.

Comment on lines +33 to +37
private Optional<ConfigurationProvider> configProviderOpt = Optional.empty();
private Optional<TableIdentifier> tableIdOpt = Optional.empty();

/** Note: This is not for a public API. It's for internal testing use only. */
private Optional<Clock> clockOpt = Optional.empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to add opt suffix in my opinion. It's obvious from the type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear that. This is certainly one of my personal preferences. I like variables being fully self-describing. The value isn't a Clock, but distinctly a Clock Optional. With this current naming convention, if we did a .map, I could refer to the inner value as clock, but if I had the outer variable also named clock, this would get confusing.

* @since 3.3.0
*/
@Evolving
public interface ConfigurationProvider {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UC Commit Coordinator Client needs to be able to iterate through all configs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants