Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(format): add additional features to 1.1.0 spec #765

Merged
merged 13 commits into from
Jun 27, 2023

Conversation

@github-actions
Copy link

⚠️ Please follow the Conventional Commits format in CONTRIBUTING.md for PR titles.

@lidavidm lidavidm changed the title format: add additional features to 1.1.0 spec feat(format): add additional features to 1.1.0 spec Jun 12, 2023
@lidavidm lidavidm force-pushed the spec-1.1.0-clarifications branch 2 times, most recently from c0486ed to f91c0db Compare June 20, 2023 16:14
@lidavidm
Copy link
Member Author

@zeroshade this should be the rest of the requested APIs for 1.1.0, if this looks reasonable I'll port to Java and Go

adbc.h Outdated Show resolved Hide resolved
adbc.h Outdated
Comment on lines 1263 to 1306
/// \brief Get a hierarchical view of system and user-defined functions.
///
/// The result is an Arrow dataset with the following schema:
///
/// | Field Name | Field Type |
/// |--------------------------|-------------------------|
/// | catalog_name | utf8 |
/// | catalog_db_schemas | list<DB_SCHEMA_SCHEMA> |
///
/// DB_SCHEMA_SCHEMA is a Struct with fields:
///
/// | Field Name | Field Type |
/// |--------------------------|-------------------------|
/// | db_schema_name | utf8 |
/// | db_schema_functions | list<FUNCTION_SCHEMA> |
///
/// FUNCTION_SCHEMA is a Struct with fields:
///
/// | Field Name | Field Type | Comments |
/// |--------------------------|-------------------------|----------|
/// | function_name | utf8 not null | |
/// | remarks | utf8 | (1) |
/// | function_type | int16 | (2) |
/// | specific_name | utf8 | (3) |
/// | function_columns | list<COLUMN_SCHEMA> | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the difficulty and complexity we've seen with GetObjects, do we want to follow the same path here for UDFs and Stored procedures? Or do we want to try to find a less complex / less nested route for this to avoid the same problems / complaints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opted to keep it for consistency. I think this being difficult is a reflection of the Arrow APIs still being immature, since all things considered this is a fairly straightforward nested schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said you could say GetTableStatistics doesn't follow it; I can adjust that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the difficulty and complexity we've seen with GetObjects, do we want to follow the same path here for UDFs and Stored procedures? Or do we want to try to find a less complex / less nested route for this to avoid the same problems / complaints?

Can you elaborate on the difficulty? Is it about easily accessing nested data from C++? If so, perhaps some separate helpers could be added for ease of navigating struct arrays and scalars...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building and traversing these structures is a bit verbose and easy to mess up. Also this is all via Nanoarrow which of course is a more minimal API. I've been prototyping a code generator (generating iterators/builders for fixed schemas, sort of like Protobuf) as a possible solution but it isn't anywhere near ready yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping some of the helpers defined in utils.h via https://github.com/apache/arrow-adbc/pull/769/files#diff-3a15518d6cfd7263838e981957bf0280c06324e662700a6e8409969e3c5db6ed could help here. These are still probably too strict for a lot of use cases today but wondering if we can't iterate on those to help?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my idea with the code generator, to be able to just generate the code for helpers along those lines to build/access these structures and check that the schema is the expected schema

adbc.h Outdated Show resolved Hide resolved
@lidavidm
Copy link
Member Author

Updated to include some additional common statistics after reviewing Hive.

It may also be good to include the 'histogram' statistic (Oracle, Hive, PostgreSQL) but the encoding for this gets very messy with Arrow: either you need a union[list[int], list[double], ...], or you have to devise some way to pack it into a binary column. (Or possibly we can just include list[binary] and declare that things need to be packed there.)

Regardless, the statistic schema is now fairly complicated with nesting, unions, and dictionary-encoding.

adbc.h Outdated
Comment on lines 1773 to 1777
/// 2. A dictionary-encoded statistic name. Values in [0, 1024) are
/// reserved for ADBC use. Other values are free for
/// implementation-specific statistics. For the names and
/// definitions of predefined statistic types, see \ref
/// adbc-table-statistics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird. Does it mean the dictionary must have more than 1024 entries if impl-specific stats are desired?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah...right, dictionaries aren't sparse. I will probably abandon this idea then and just have it contain strings, and take the potential space hit (likely won't be a big deal).

Copy link
Member

@pitrou pitrou Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think you want to make lookups easy (and reasonably efficient?) as well. As is "I want the min and max of column a.b.c".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just make it an int column with a separate call to get the "dictionary" (without making it a formal Arrow dictionary). Non-standardized statistic names require special handling by callers anyways.

We could also make the schema wider; this is often the actual implementation underneath. But that makes it less extensible (unless we have a generic list[struct[string, union[...]]] column at the end)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to just int16 with a separate call to get driver-specific statistic values.

adbc.h Outdated Show resolved Hide resolved
adbc.h Outdated
@@ -907,6 +1008,34 @@ AdbcStatusCode AdbcDatabaseNew(struct AdbcDatabase* database, struct AdbcError*
AdbcStatusCode AdbcDatabaseGetOption(struct AdbcDatabase* database, const char* key,
const char** value, struct AdbcError* error);

/// \brief Get a bytestring option of the database.
///
/// This must always be thread-safe (other operations are not).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but if AdbcDatabaseGetOptionBytes is called from multiple threads, then it is undefined if the return value is still valid when the calling thread consumes it, right?

(same problem as with POSIX getenv, actually: https://pubs.opengroup.org/onlinepubs/9699919799/functions/getenv.html)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah. Possibly "thread-safe with respect to other functions"?

The only reason why I wanted it to be thread-safe was to allow for a progress indicator (on GetDouble). Maybe we can split that out into a dedicated function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you switch to malloc-allocated output, or you let the caller pass a suitable buffer area...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far things have avoided assuming libc malloc. We could return a struct with a release callback?

Caller-allocated buffers get annoying because you have to somehow tell the caller the size...

Copy link
Member

@pitrou pitrou Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caller-allocated buffers get annoying because you have to somehow tell the caller the size...

But at least the lifetime is safe and it makes things simpler for the producer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I'm also not sure why functions like vsnprintf choose to do that in the first place; maybe some POSIX convention

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getenv is slightly different as global state where the application may not know what libraries it uses are also doing with the global state; here the operation is scoped to a specific handle that presumably an application can manage itself

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, how about something like this?

LGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I suppose we should also update the base GetOption (returns char*) though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated that too.

adbc.h Outdated Show resolved Hide resolved
adbc.h Outdated Show resolved Hide resolved
adbc.h Outdated Show resolved Hide resolved
adbc.h Outdated Show resolved Hide resolved
@lidavidm
Copy link
Member Author

There's been no movement here so I'll update this PR with Go and Java definitions

@lidavidm
Copy link
Member Author

Ok, Java and Go definitions are now present.

@lidavidm
Copy link
Member Author

I'm going to merge this for now so I can start drafting #319; we can come back and revisit the Go/Java APIs if needed (I know zeroshade is out for a bit so I don't want this held up)

@lidavidm lidavidm merged commit 9c9975a into apache:spec-1.1.0 Jun 27, 2023
@lidavidm lidavidm deleted the spec-1.1.0-clarifications branch June 27, 2023 17:29
lidavidm added a commit that referenced this pull request Jun 27, 2023
lidavidm added a commit that referenced this pull request Jul 11, 2023
@lidavidm lidavidm added this to the ADBC API Specification 1.1.0 milestone Jul 20, 2023
lidavidm added a commit that referenced this pull request Jul 20, 2023
lidavidm added a commit that referenced this pull request Jul 21, 2023
lidavidm added a commit that referenced this pull request Aug 3, 2023
lidavidm added a commit that referenced this pull request Aug 10, 2023
lidavidm added a commit that referenced this pull request Aug 10, 2023
lidavidm added a commit that referenced this pull request Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants