Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Apache Drill #6610

Merged
merged 7 commits into from
May 29, 2019
Merged

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented Jan 8, 2019

This is the first attempt at support for Apache Drill in Superset. This uses John Omernik's SQLAlchemy dialect (https://github.com/JohnOmernik/sqlalchemy-drill).

Drill has a unique aspect in its namespace which proved problematic. Drill uses the concepts of:

  • Storage Plugin
  • Workspace
  • Table (or file)

When connecting to Drill to a non-file based system, the behavior is pretty much as a traditional database. However, when Drill is querying a file based system, you can have additional . in the namespace that are not namespace related. For example:

dfs.test.`data.json`

In this case, the storage plugin is dfs, the workspace which is optional is test, and the table, which in this case is a file, is data.json. In this case however, the last period is not part of the namespace and was consistently misinterpreted by Superset.

The solution I found, which seemed like complete hackery, was to replace the last period with @@@ in the dropdown in SQLlab. Then I modified view/core.py to replace the @@@ with a period in the various functions that generate tables.

UPDATE: All Hackery Removed

@mistercrunch mistercrunch changed the title Initial Commit Add support to Apache Drill Jan 8, 2019
superset/views/core.py Outdated Show resolved Hide resolved
@kristw kristw added enhancement:request Enhancement request submitted by anyone from the community change:backend Requires changing the backend labels Jan 8, 2019
@cgivre
Copy link
Contributor Author

cgivre commented Jan 25, 2019

There is one small issue with this PR that I would really like to resolve but I'm a little stuck on.
I have Superset connected to Drill and Drill connected to a MySQL database. Superset is correctly getting the tables, and schemata!
(Yay!)
When you try to access a file-based table via Drill, there’s one small problem.
As I mentioned in the PR, the issue is that filenames have dots in them, so when the schema was being passed to Superset you might have something like this
dfs.test.somefile.csv
The problem being that when Superset parses that, it parses that fourth period as a part of the namespace rather than a file extension
So… the workaround I found was, in the dropdown in sqllab, replace the value (not the label) with @@@ so that it gets parsed correctly.
This works as shown below:

screen shot 2019-01-21 at 16 46 51

Here's where I'm stuck.

In all the other functions where I could find this, I added a hook to clean up the file name, specifically, to look for @@@ and replace it with a ..
I can’t seem to find the code that generates the heading for the table preview and the heading for the column listing. Once I can figure this out, I think Drill will pretty much work perfectly with Superset!! Any suggestions would be greatly appreciated. @mistercrunch @kristw

screen shot 2019-01-21 at 16 48 10

Thanks!!

@cgivre
Copy link
Contributor Author

cgivre commented Jan 28, 2019

@mistercrunch @kristw
Hello all,
I managed to get everything working properly! I think this is ready for review.

@codecov-io
Copy link

codecov-io commented Jan 29, 2019

Codecov Report

Merging #6610 into master will decrease coverage by 0.02%.
The diff coverage is 38.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6610      +/-   ##
==========================================
- Coverage   65.26%   65.24%   -0.03%     
==========================================
  Files         435      435              
  Lines       21485    21503      +18     
  Branches     2379     2379              
==========================================
+ Hits        14023    14030       +7     
- Misses       7342     7353      +11     
  Partials      120      120
Impacted Files Coverage Δ
superset/db_engine_specs.py 61.89% <38.88%> (-0.46%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7d3413...7209c9d. Read the comment docs.

@cgivre
Copy link
Contributor Author

cgivre commented Jan 29, 2019

@mistercrunch All unit tests passed. Superset + Drill is a really awesome combination. Do you want me to squash all commits?

@mistercrunch
Copy link
Member

I don't think it's ok to have references to @@@ scattered across the code base. It really should be limited to the db_engine_spec entry for drill. Looks like there are few entries on the frontend and one on the backend outside of db_engine_spec to clean up.

@JohnOmernik
Copy link

I am watching this PR very closely. I think this is a great addition to Superset and look forward to seeing it merged.

@cgivre
Copy link
Contributor Author

cgivre commented Feb 26, 2019

Hi @mistercrunch ,
I'm a little stuck and wondering if perhaps you could point me in the right direction.
I've removed all the static references to @@@ in the code base, instead calling the function in db_engine_specs with the exception of two that I can't seem to figure out.

They are: superset/assets/src/SqlLab/components/SouthPane.jsx and superset/assets/src/SqlLab/components/SqlEditorLeftBar.jsx.

I can't seem to figure out how these functions are populated or called from the Python/Flask side.

onTableChange(tableName, schemaName) {
	    this.props.actions.addTable(this.props.queryEditor, tableName, schemaName);
	    // Mod for Apache Drill
	    const cleanTableName = tableName.replace('@@@', '.');
	    this.props.actions.addTable(this.props.queryEditor, cleanTableName, schemaName);
	  }

If you could point me in the right direction there, and show me where in the python code these functions are getting the data, I can wrap this up.
Thanks in advance.

@mistercrunch
Copy link
Member

@cgivre I'm happy to help push that through and providing pointers. So here's the view:
https://github.com/apache/incubator-superset/blob/master/superset/views/core.py#L2303 and more specifically around here?
https://github.com/apache/incubator-superset/blob/master/superset/views/core.py#L2349

I didn't dig super deep as I don't have time to dig right this moment, but wanted to give some pointers.
Let me know if the pointers aren't helpful, I'll take the time to dig in deeper.

@cgivre
Copy link
Contributor Author

cgivre commented Feb 28, 2019

@mistercrunch Thanks for your response. The PR already had a hook there to clean up the table name (see below)

https://github.com/apache/incubator-superset/blob/a05fc76633a93ccc6a33ac0badd9f3969cca35bf/superset/views/core.py#L2311

I've went through most of the flles looking for anything like table_name or tmp_table_name and run it through the clean up hook. Any help is greatly appreciated.

@cgivre
Copy link
Contributor Author

cgivre commented Mar 28, 2019

@mistercrunch @kristw
I'm wondering if we could mark this as a Work In Progress and integrate this PR. The latest version has two places in the code with the @@@ hard coded.

https://github.com/apache/incubator-superset/blob/184c6c625090da91dedff151f12ed560d83f2385/superset/assets/src/SqlLab/components/SqlEditorLeftBar.jsx#L64

and

https://github.com/apache/incubator-superset/blob/184c6c625090da91dedff151f12ed560d83f2385/superset/assets/src/SqlLab/components/SouthPane.jsx#L107

There are some things which I'd like to contribute but really can't until this is committed. I'll keep working on it and hopefully get some more help as more people start using Drill+Superset.

@JohnOmernik

@kristw
Copy link
Contributor

kristw commented Mar 28, 2019

@cgivre I am not familiar with the backend code so I cannot fully review the code.

However, regarding your request to merge as Work in Progress, Superset is used in production environment for many organizations, making master stability very important. Therefore, I will have a hard time merging incompleted code into master. PR are usually labelled WIP (Work in Progress) when the work is not completed so reviewers will not even review the PR.

However, if you feel that there are parts of this PR that are solid already and ready to be merged, please keep only that part and get rid of everything else. You can also help the reviewers understand what are the remaining issues that make you say this is still WIP.

If you can clean out the incompleted parts, after receiving approval from @mistercrunch or other backend experts, we can merge.

@cgivre
Copy link
Contributor Author

cgivre commented Mar 28, 2019

Hi @kristw
I do not believe that this PR will introduce any instability or is not production ready. As I indicated, it passes all the unit tests and works perfectly when I've used it myself.

The remaining issue is that @mistercrunch prefers to have code that is specific to one database in the db_engine_specs.py file. (Which makes perfect sense and something I completely understand) However, there are two places in this PR that I could not get to work without putting database specific code into another file. If someone can show me how I can fix that, I will gladly do so, but I'm really stuck at this point.

As I said, I do feel this PR is solid and ready to be merged. I'm willing to keep working on it to figure out how to move this code to the db_engine_specs.py file.

@cgivre
Copy link
Contributor Author

cgivre commented Apr 17, 2019

Bump... Any update?

@villebro
Copy link
Member

@cgivre maybe I can help; @yciabaud recently ran into something similar with Presto+Pulsar which was addressed in #7297 . While slightly different there seem to be parallels. Can you summarize where you are stuck right now to help me get up to speed on your issue? By the looks of it this is quite close to an elegant solution but still needs that last finishing touch.

@cgivre
Copy link
Contributor Author

cgivre commented Apr 19, 2019

Hi @villebro
Thanks for responding! Basically, I added a function to the db_engine_specs file that replaces the file extension with @@@. Then I wrote another function in the db_engine_specs that removes it.

What was happening was that Superset was working great with Drill when it was querying a database, but if something had a file, the dot for the file extension would mess up the URLs.

My approach works, however, there were two places that I couldn't figure out how to call the database specific function in db_engine_specs so I just added the string replacement in the code. The two locations are:
https://github.com/apache/incubator-superset/blob/ce3f9c30a925330763b394903d1a16ce6a90a495/superset/assets/src/SqlLab/components/SouthPane.jsx#L107

and

https://github.com/apache/incubator-superset/blob/ce3f9c30a925330763b394903d1a16ce6a90a495/superset/assets/src/SqlLab/components/SqlEditorLeftBar.jsx#L64

As you can see, these are not python files. I couldn't figure out how to insert the Python function to clean up the table name.

Another approach which I tried but couldn't figure out was to add a function in db_engine_specs to do all the URL transformations.

Any assistance would be greatly appreciated, as I'd really like to get this integrated into Superset.
Thanks!!

@villebro
Copy link
Member

villebro commented Apr 19, 2019

It looks like the culprit is here:

https://github.com/apache/incubator-superset/blob/ce3f9c30a925330763b394903d1a16ce6a90a495/superset/assets/src/components/TableSelector.jsx#L173-L179
This seems to have been copied over from (now deprecated?) code that was introduced in a PR that's two years old: #1466
https://github.com/apache/incubator-superset/blob/5f28027ce745dfda67e18fe7df8468184400be50/superset/assets/src/SqlLab/components/SqlEditorLeftBar.jsx#L86-L96
It seems the table selector is assuming that a table name with a period is schema.table, hence omitting everything before the period.

@mistercrunch do you have any idea or context on what's going on here? It seems like this functionality has been deprecated (choosing a table without having first chosen a schema), and the above logic should be removed. I tested creating a table with a period in the name on Postgres and got a similar error. Removing the namePieces logic fixed the problem and seems the appropriate solution.

@villebro
Copy link
Member

villebro commented Apr 20, 2019

@kristw can you check out my last comment; is the above code in SqlEditorLeftBar.jsx deprecated, and is my interpretation correct that a table can't be chosen unless a schema is first chosen, making the namePieces logic obsolete?

@coderfender
Copy link
Contributor

@villebro , I was also working on this issue this weekend and figured that this is an issue on JS side.

@cgivre
Copy link
Contributor Author

cgivre commented Apr 22, 2019

Hi @villebro,
You are correct in that a table can't be chosen unless a schema is first selected.
@coderfender
If what you're saying is correct, is there any way to fix this on the Python side?

@cgivre
Copy link
Contributor Author

cgivre commented May 2, 2019

Thanks @villebro! I appreciate your help on this.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments to help you test this PR with #7453 and clean up unnecessary changes. If and when #7453 gets merged, this PR should only have to modify docs/installation.rst and superset/db_engine_specs.py, as the javascript parts are already tackled by #7453.

superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Show resolved Hide resolved
@cgivre cgivre changed the title Add support to Apache Drill [WIP] Add support for Apache Drill May 13, 2019
@cgivre cgivre force-pushed the apache_drill branch 2 times, most recently from e7691c2 to b7e8b3e Compare May 22, 2019 03:32
Copy link
Contributor Author

@cgivre cgivre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made all the requested changes. At this point, the PR really is only overwriting the functions relating the time-grain functionality for Drill. Everything seems to work without any mods outside of the db_engine_specs.py file.

superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Outdated Show resolved Hide resolved
superset/db_engine_specs.py Show resolved Hide resolved
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, when you feel this is complete please remove WIP from the title.

@cgivre cgivre changed the title [WIP] Add support for Apache Drill Add support for Apache Drill May 28, 2019
@pull-request-size pull-request-size bot added size/M and removed size/L labels May 28, 2019
@cgivre
Copy link
Contributor Author

cgivre commented May 28, 2019

@villebro I updated the title. Thank you very much for your assistance getting this PR completed. From my perspective we are ready to go.

@villebro
Copy link
Member

Let's leave this open for a few days to see if anyone has something to add. But IMO this is ready to be merged.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mistercrunch mistercrunch merged commit fc3b043 into apache:master May 29, 2019
@SamuelMarks
Copy link

Awesome to see. Should the README.md#database-support be updated with Apache Drill?

@mistercrunch
Copy link
Member

@cgivre ^^^

@cgivre
Copy link
Contributor Author

cgivre commented Jun 2, 2019

Thanks @mistercrunch and @SamuelMarks
I addressed this and one small bug in #7635, which @villebro just approved.

@cgivre cgivre deleted the apache_drill branch June 2, 2019 12:26
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.34.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels change:backend Requires changing the backend enhancement:request Enhancement request submitted by anyone from the community size/M 🚢 0.34.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants