feat(format): add additional features to 1.1.0 spec #765

lidavidm · 2023-06-12T17:33:29Z

Intends to tackle:

github-actions · 2023-06-12T17:33:41Z

⚠️ Please follow the Conventional Commits format in CONTRIBUTING.md for PR titles.

lidavidm · 2023-06-20T16:17:14Z

@zeroshade this should be the rest of the requested APIs for 1.1.0, if this looks reasonable I'll port to Java and Go

adbc.h

zeroshade · 2023-06-20T16:57:07Z

adbc.h

+/// \brief Get a hierarchical view of system and user-defined functions.
+///
+/// The result is an Arrow dataset with the following schema:
+///
+/// | Field Name               | Field Type              |
+/// |--------------------------|-------------------------|
+/// | catalog_name             | utf8                    |
+/// | catalog_db_schemas       | list<DB_SCHEMA_SCHEMA>  |
+///
+/// DB_SCHEMA_SCHEMA is a Struct with fields:
+///
+/// | Field Name               | Field Type              |
+/// |--------------------------|-------------------------|
+/// | db_schema_name           | utf8                    |
+/// | db_schema_functions      | list<FUNCTION_SCHEMA>   |
+///
+/// FUNCTION_SCHEMA is a Struct with fields:
+///
+/// | Field Name               | Field Type              | Comments |
+/// |--------------------------|-------------------------|----------|
+/// | function_name            | utf8 not null           |          |
+/// | remarks                  | utf8                    | (1)      |
+/// | function_type            | int16                   | (2)      |
+/// | specific_name            | utf8                    | (3)      |
+/// | function_columns         | list<COLUMN_SCHEMA>     |          |


Given the difficulty and complexity we've seen with GetObjects, do we want to follow the same path here for UDFs and Stored procedures? Or do we want to try to find a less complex / less nested route for this to avoid the same problems / complaints?

I've opted to keep it for consistency. I think this being difficult is a reflection of the Arrow APIs still being immature, since all things considered this is a fairly straightforward nested schema.

That said you could say GetTableStatistics doesn't follow it; I can adjust that.

Given the difficulty and complexity we've seen with GetObjects, do we want to follow the same path here for UDFs and Stored procedures? Or do we want to try to find a less complex / less nested route for this to avoid the same problems / complaints?

Can you elaborate on the difficulty? Is it about easily accessing nested data from C++? If so, perhaps some separate helpers could be added for ease of navigating struct arrays and scalars...

Building and traversing these structures is a bit verbose and easy to mess up. Also this is all via Nanoarrow which of course is a more minimal API. I've been prototyping a code generator (generating iterators/builders for fixed schemas, sort of like Protobuf) as a possible solution but it isn't anywhere near ready yet.

I was hoping some of the helpers defined in utils.h via https://github.com/apache/arrow-adbc/pull/769/files#diff-3a15518d6cfd7263838e981957bf0280c06324e662700a6e8409969e3c5db6ed could help here. These are still probably too strict for a lot of use cases today but wondering if we can't iterate on those to help?

That's my idea with the code generator, to be able to just generate the code for helpers along those lines to build/access these structures and check that the schema is the expected schema

adbc.h

lidavidm · 2023-06-21T13:44:13Z

Updated to include some additional common statistics after reviewing Hive.

It may also be good to include the 'histogram' statistic (Oracle, Hive, PostgreSQL) but the encoding for this gets very messy with Arrow: either you need a union[list[int], list[double], ...], or you have to devise some way to pack it into a binary column. (Or possibly we can just include list[binary] and declare that things need to be packed there.)

Regardless, the statistic schema is now fairly complicated with nesting, unions, and dictionary-encoding.

pitrou · 2023-06-22T13:58:50Z

adbc.h

+/// 2. A dictionary-encoded statistic name. Values in [0, 1024) are
+///    reserved for ADBC use.  Other values are free for
+///    implementation-specific statistics.  For the names and
+///    definitions of predefined statistic types, see \ref
+///    adbc-table-statistics.


This is a bit weird. Does it mean the dictionary must have more than 1024 entries if impl-specific stats are desired?

Ah...right, dictionaries aren't sparse. I will probably abandon this idea then and just have it contain strings, and take the potential space hit (likely won't be a big deal).

Well, I think you want to make lookups easy (and reasonably efficient?) as well. As is "I want the min and max of column a.b.c".

We could just make it an int column with a separate call to get the "dictionary" (without making it a formal Arrow dictionary). Non-standardized statistic names require special handling by callers anyways.

We could also make the schema wider; this is often the actual implementation underneath. But that makes it less extensible (unless we have a generic list[struct[string, union[...]]] column at the end)

Switched to just int16 with a separate call to get driver-specific statistic values.

adbc.h

pitrou · 2023-06-22T14:09:01Z

adbc.h

@@ -907,6 +1008,34 @@ AdbcStatusCode AdbcDatabaseNew(struct AdbcDatabase* database, struct AdbcError*
 AdbcStatusCode AdbcDatabaseGetOption(struct AdbcDatabase* database, const char* key,
                                     const char** value, struct AdbcError* error);

+/// \brief Get a bytestring option of the database.
+///
+/// This must always be thread-safe (other operations are not).


Hmm... but if AdbcDatabaseGetOptionBytes is called from multiple threads, then it is undefined if the return value is still valid when the calling thread consumes it, right?

(same problem as with POSIX getenv, actually: https://pubs.opengroup.org/onlinepubs/9699919799/functions/getenv.html)

Gah. Possibly "thread-safe with respect to other functions"?

The only reason why I wanted it to be thread-safe was to allow for a progress indicator (on GetDouble). Maybe we can split that out into a dedicated function.

Or you switch to malloc-allocated output, or you let the caller pass a suitable buffer area...

So far things have avoided assuming libc malloc. We could return a struct with a release callback?

Caller-allocated buffers get annoying because you have to somehow tell the caller the size...

Caller-allocated buffers get annoying because you have to somehow tell the caller the size...

But at least the lifetime is safe and it makes things simpler for the producer.

Fair. I'm also not sure why functions like vsnprintf choose to do that in the first place; maybe some POSIX convention

getenv is slightly different as global state where the application may not know what libraries it uses are also doing with the global state; here the operation is scoped to a specific handle that presumably an application can manage itself

Ok, how about something like this?

LGTM

Updated, I suppose we should also update the base GetOption (returns char*) though

Updated that too.

adbc.h

Intends to tackle: - apache#621 - apache#685 - apache#736 - apache#755

lidavidm · 2023-06-26T16:30:08Z

There's been no movement here so I'll update this PR with Go and Java definitions

lidavidm · 2023-06-26T18:26:36Z

Ok, Java and Go definitions are now present.

lidavidm · 2023-06-27T17:29:44Z

I'm going to merge this for now so I can start drafting #319; we can come back and revisit the Go/Java APIs if needed (I know zeroshade is out for a bit so I don't want this held up)

Intends to tackle: - #621 - #685 - #736 - #755

lidavidm changed the title ~~format: add additional features to 1.1.0 spec~~ feat(format): add additional features to 1.1.0 spec Jun 12, 2023

lidavidm force-pushed the spec-1.1.0-clarifications branch from df64b7c to 04ecc45 Compare June 15, 2023 20:01

lidavidm force-pushed the spec-1.1.0 branch from 0692bf4 to f18f1cc Compare June 19, 2023 16:46

lidavidm force-pushed the spec-1.1.0-clarifications branch 2 times, most recently from c0486ed to f91c0db Compare June 20, 2023 16:14

zeroshade reviewed Jun 20, 2023

View reviewed changes