Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presto Adapter (#1106, #1229, #1230) #1245

Merged
merged 8 commits into from
Jan 23, 2019

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented Jan 17, 2019

Resolves #1106, #1229, #1230

Implement a presto adapter. Presto required some massaging of the SQL adapter class as it has some novel properties - the local client does its own transaction+session management, the cursor doesn't have a status available, and queries do not appear to execute until results are fetched.

Working:

  • Views/tables/seeds
  • error handling
  • configuration
  • Linux tests, including integration tests against a local docker-compose setup on Linux.
  • Transactions, apparently, though I suspect there are bugs here
  • The various required macros
  • A nice little script that generates a bunch of boilerplate for you

Not Working/incomplete:

  • archives probably don't work and also don't raise an exception about it (see Presto: raise not implemented exceptions for unsupported functionality #1228)
  • incremental models don't work and raise an exception about it
  • I don't think the integration tests on Windows will work, and I think getting them to work on appveyor will be unpleasant, at best
  • I can't come up with a nice way to test query cancellation and I'm not sure I trust it. ctrl+c does work locally but that doesn't mean queries are actually canceled.
  • never tested against a real database
  • only "no auth required" has been tested. I don't even know how to test kerberos!
  • never tested with s3 backing instead of hdfs

Things to consider that might make plugin development easier:

  • Should schema.py's column definitions get rolled into adapters and subclasses like connection management/adapter implementations? It seems wrong to have to add database-specific type information to core in support of a plugin!
  • commit/rollback/begin transaction are not all defined in the same place. We canonically do rollback via a method but begin/commit transactions via queries. That seems odd and caught me by surprise when I was implementing this.

Jacob Beck added 8 commits January 16, 2019 16:46
 - fix sql docstring, which is just wrong.
 - extract the default seed materialization to one that takes a chunk size param
 - add NUMBERS constant for all number types dbt should support inserting
 - update Column definition to allow for "varchar"
 - seeds, views, tables, and ephemeral models all implemented
 - ConnectionManager methods and credentials
 - Adapter.date_function
 - macros
 - give presto a small chunk size for seeds
 - manually cascade drop_schema
 - stub out some stuff that does not exist in presto
 - stub out incremental materializations for now (delete is not useful)
 - stub out query cancellation for now
@beckjake beckjake changed the title Presto Adapter (#1106, #1229, #1230) [WIP] Presto Adapter (#1106, #1229, #1230) Jan 17, 2019
@drewbanin drewbanin self-requested a review January 17, 2019 16:20
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tiny comments here, will review more completely after poking around with this branch for a little bit :)

queries = [q.rstrip(';') for q in sqlparse.split(sql)]

for individual_query in queries:
# hack -- after the last ';', remove comments and don't run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were issues with this hack. For one, I think this splits on semicolons found inside of sql comments. Passable for the moment I think, but let's consider if it makes sense to handle this differently

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of pulling the logic out into the SQLAdapter. We need to do this basically everywhere

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to test this out on EMR Presto with both the Hive metastore, as well as with AWS Glue. In both cases, this code worked basically as expected! There is one caveat though -- you must set the following configs when launching your Presto cluster:

hive.metastore-cache-ttl=0s
hive.metastore-refresh-interval = 5s
hive.allow-drop-table=true
hive.allow-rename-table=true

Let's be sure to document these configs for our eventual release.

I do think that we need to do more stress testing here to feel really comfortable about deploying this code, but I feel pretty good about getting this merged and then making small tweaks as needed.

Between the quality of this code (it looks great!) and some initial testing in a realistic prod environment, I am happy to say LGTM, :shipit:, 🚢, 🎉

@beckjake beckjake merged commit 477699a into dev/stephen-girard Jan 23, 2019
@beckjake beckjake deleted the feature/presto-adapter branch February 22, 2019 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants