Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest import csv #3533

Closed
wants to merge 108 commits into from
Closed

Conversation

timifasubaa
Copy link
Contributor

@timifasubaa timifasubaa commented Sep 26, 2017

This PR allows users to import CSV files to superset. It extends the work that had already been done in (#2381) and also adds the new functionality of uploading to a hive datasource. All feedback to make the code and user experience better is welcome.

@mistercrunch

Wood and others added 30 commits November 14, 2016 13:00
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 69.481% when pulling 77158dcea97038685bf190904dfbf423003ae3f7 on timifasubaa:latest_import_csv into 6f1351f on apache:master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 69.481% when pulling 77158dcea97038685bf190904dfbf423003ae3f7 on timifasubaa:latest_import_csv into 6f1351f on apache:master.

@coveralls
Copy link

coveralls commented Oct 6, 2017

Coverage Status

Coverage decreased (-0.6%) to 69.481% when pulling 77158dcea97038685bf190904dfbf423003ae3f7 on timifasubaa:latest_import_csv into 6f1351f on apache:master.

@timifasubaa timifasubaa force-pushed the latest_import_csv branch 3 times, most recently from ea84375 to e35f3d2 Compare October 7, 2017 05:03
@coveralls
Copy link

coveralls commented Oct 7, 2017

Coverage Status

Coverage decreased (-0.6%) to 69.481% when pulling e35f3d21485127adb851928c92bdc85067446301 on timifasubaa:latest_import_csv into 6f1351f on apache:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 69.481% when pulling e35f3d21485127adb851928c92bdc85067446301 on timifasubaa:latest_import_csv into 6f1351f on apache:master.

'if_exists': form.if_exists.data,
'index': form.index.data,
'index_label': form.index_label.data,
'chunksize': 10000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: we like trailing commas, we should see if pylint can require it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe pylint does, and flake8 doesn't)

@@ -686,6 +771,58 @@ def fetch_result_sets(cls, db, datasource_type, force=False):
return BaseEngineSpec.fetch_result_sets(
db, datasource_type, force=force)

@staticmethod
def upload_csv(form, table):
"uploads a csv file and creates a superset table in the right databse"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

form.csv_file.data.filename)
column_names = get_column_names(upload_path)
schema_definition =\
", ".join([col_name + " STRING " for col_name in column_names])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: s/col_name/s/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here.

upload_path,
'airbnb-superset',
os.path.join(upload_prefix, table_name, filename))
sql = "CREATE EXTERNAL TABLE {table_name} ( {schema_definition} ) "\
Copy link
Member

@mistercrunch mistercrunch Oct 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: 3rd comment about triple-quoted strings

@@ -7,7 +7,7 @@ msgid ""
msgstr ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

damn dude. these .po files are still in your PR

import pickle
import re
import time
import traceback

import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: when there are many imports in a module, standard lib import should be together in a group (empty line between import groups), and alpha-sorted within that group preferably

organizing imports becomes more important as there are more imports in the module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

flash(message, 'info')
return redirect('/tablemodelview/list/')


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: one empty line too many

import sqlalchemy as sqla

from flask import (
g, request, redirect, flash, Response, render_template, Markup,
url_for)
from flask_appbuilder import expose
url_for, send_from_directory)
Copy link
Member

@mistercrunch mistercrunch Oct 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused? pylint should trigger on unused imports

form.if_exists.data = 'append'
all_datasources = []
all_datasources = db.session.query(
models.Database.sqlalchemy_uri, models.Database.database_name)\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: line break would be much better after the coma

csv_file.save(os.path.join(config['UPLOAD_FOLDER'], filename))
return filename

form.names.data = form.names.data.split(",") if form.names.data else 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like form.names.data should receive consistent types, not list or Int depending... Maybe [] or None?


form.names.data = form.names.data.split(",") if form.names.data else 0

database = db.session.query(models.Database)\
Copy link
Member

@mistercrunch mistercrunch Oct 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a comment on this PR about how we like the method chaining calls to be wrap across lines before.

database = (
    db.session.query(models.Database)
    .filter_by(sqlalchemy_uri=form.data.get('con'))
    .one()
)


from superset import app, db
import superset.models.core as models

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: uneeded blank line

@coveralls
Copy link

coveralls commented Oct 9, 2017

Coverage Status

Coverage decreased (-0.6%) to 69.451% when pulling 4685ec8f0dd1aaed62bb6ad77b19698af44b19ec on timifasubaa:latest_import_csv into 64ef8b1 on apache:master.

@timifasubaa timifasubaa force-pushed the latest_import_csv branch 3 times, most recently from 6df5191 to 47c0e4b Compare October 9, 2017 18:59
@coveralls
Copy link

coveralls commented Oct 9, 2017

Coverage Status

Coverage decreased (-0.6%) to 69.461% when pulling ae6a4bb on timifasubaa:latest_import_csv into 64ef8b1 on apache:master.

@coveralls
Copy link

coveralls commented Oct 9, 2017

Coverage Status

Coverage increased (+13.5%) to 83.6% when pulling 60624b6 on timifasubaa:latest_import_csv into 64ef8b1 on apache:master.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're starting to get somewhere, should we add some unit tests?

if not allowed_file(filename):
flash("Invalid file type selected.", 'alert')
return False
kwargs = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I was super clear about the kwargs comment earlier, and the verbose boilerplate isn't that bad either (others may consider the verbose, metaprogramming-free option better).

Knowing the form object has a data dict:
http://wtforms.simplecodes.com/docs/0.6.1/forms.html#wtforms.form.Form.data

You can do something like:

form_data = form.data
csv_to_df_fields = ['names', 'filepath_or_buffer', ...]
kwargs = {s: form_data.get(s) for s in csv_to_df_fields}
kwargs['chunksize'] = 10000
# other alterations as needed
df = BaseEngineSpec.csv_to_df(**kwargs)

return df

@staticmethod
def df_to_db(**kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you change the function signature to def df_to_db(df, table, con, schema, **kwargs): then you don't need the 5 following lines and avoid juggling with kwargs

del kwargs['df']
del kwargs['table']
kwargs['con'] = create_engine(kwargs['con'], echo=False)
df.to_sql(**kwargs)
Copy link
Member

@mistercrunch mistercrunch Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having taken stuff out of kwargs in the signature, now here you may to pass those things explicitely as in df.to_sql(con=create_engine(con, schema=schema, echo=False), **kwargs).

It's worth taking the time to read how those unpack operators work.

add_columns = ['database', 'schema', 'table_name']

def form_get(self, form):
form.sep.data = ','
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the normal way of setting defaults for wtforms is to do it when creating the form objects. As in:
BooleanField('name', default=True, ...)

form.decimal.data = '.'
form.error_bad_lines.data = False
form.if_exists.data = 'append'
all_datasources = db.session.query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one may have to stay here since you don't want to do it in module scope...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


if not bucket_path:
logging.info("No upload bucket specified")
flash(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i18n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@mistercrunch mistercrunch mentioned this pull request Oct 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.