Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069

Merged
merged 9 commits into from
Aug 27, 2019

Conversation

robdiciuccio
Copy link
Member

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

Async query performance in SQL Lab, particularly with large result sets, is fairly poor due to how data is serialized and stored. This PR introduces a RESULTS_BACKEND_USE_MSGPACK config option to use PyArrow to serialize the pandas DataFrame directly, and MessagePack for serializing the results payload. Compared to the existing JSON serialization, Arrow and MessagePack provide improved performance and result in much smaller payloads sent to S3 or other cache backends.

Benchmarks with 100K rows of birth_names examples data (multiple runs)

JSON

avg serialization: 573ms
avg deserialization: 200ms
total: 773ms
compressed payload size: 816718
peak memory usage: 281.1 MiB

Arrow/msgpack

avg serialization: 70ms
avg deserialization: 40ms
total: 110ms
compressed payload size: 452634
peak memory usage: 266.2 MiB

Benchmarks were performed on a Macbook Pro 2.6 GHz i7, 32GB running macOS 10.14.5 and Python 3.6.8. memory-profiler was used for memory usage stats.

TEST PLAN

Testing thus far has been limited to Postgres, mainly on Superset examples data. For full compatibility testing:

  • Enable RESULTS_BACKEND_USE_MSGPACK = True in superset_config.py
  • Run queries in SQL Lab with various DB backends containing multiple data types
  • Ensure displayed results and CSV downloads contain correctly formatted data

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

REVIEWERS

@mistercrunch

@codecov-io
Copy link

codecov-io commented Aug 19, 2019

Codecov Report

Merging #8069 into master will increase coverage by 0.04%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8069      +/-   ##
==========================================
+ Coverage   65.89%   65.94%   +0.04%     
==========================================
  Files         485      485              
  Lines       22917    22961      +44     
  Branches     2537     2537              
==========================================
+ Hits        15102    15142      +40     
- Misses       7683     7687       +4     
  Partials      132      132
Impacted Files Coverage Δ
superset/__init__.py 74.1% <100%> (+0.18%) ⬆️
superset/utils/core.py 87.83% <100%> (ø) ⬆️
superset/dataframe.py 94.82% <100%> (+0.18%) ⬆️
superset/config.py 88.76% <100%> (+0.06%) ⬆️
superset/sql_lab.py 77.35% <79.31%> (+0.51%) ⬆️
superset/views/core.py 71.44% <81.81%> (+0.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ac1a29...0a9f213. Read the comment docs.

@robdiciuccio
Copy link
Member Author

cc @etr2460 on SQL Lab backend performance work

@@ -214,6 +214,7 @@ def index(self):
security_manager = appbuilder.sm

results_backend = app.config.get("RESULTS_BACKEND")
results_backend_use_msgpack = app.config.get("RESULTS_BACKEND_USE_MSGPACK")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think we like to add the config var to the config.py file so that it's easier to see all the configurations available. That said, I'm wondering if this should be a config var at all if it has such good performance improvements. If you're worried about it being globally enabled without testing, maybe make this a feature flag that we can default off to begin with, but then remove and enable everywhere once we're sure everything is good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, all configs should be set as default in superset/config.py and commented/documented there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature flag feels a bit heavy for this case, since it's a global setting and not needed on the frontend. My main concern about pushing this change without a config flag is the lack of testing with data sources other than Postgres, but I do think we should push that testing forward by perhaps defaulting RESULTS_BACKEND_USE_MSGPACK = True, and providing an escape hatch to disable should problems crop up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I have a misunderstanding of the semantics around making something a feature flag. I understood it as gating a new feature that we would want to roll out to 100% of users in the future. So while testing people can enable/disable the feature, but with the expectation of in the future it being enabled by default. Feature flags are then necessary to implement on the frontend to support that goal, but aren't required to be both frontend and backend changes. Maybe @mistercrunch can clarify which understanding is correct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the feature flag framework is limited, so open to feedback here. Since the change is more middleware than feature, and adding the flag to the JS payload isn't necessary, it feels more like config, but curious what others think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config keys are static, environment-wide scoped.

Feature flags can be dynamic and currently they all flow to the frontend. Being dynamic, they can be used to do progressive rollouts or A/B testing.

This current flag in this PR seems more like the former to me

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall as noted inline I'm thinking msgpack + pyarrow is superior and should be the new default. The feature flag rolls with a sub-optimal default and two code paths to maintain. I vote for pushing this forward.

setup.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
superset/sql_lab.py Outdated Show resolved Hide resolved
superset/sql_lab.py Show resolved Hide resolved
superset/utils/core.py Show resolved Hide resolved
superset/utils/core.py Outdated Show resolved Hide resolved
superset/sql_lab.py Show resolved Hide resolved
@robdiciuccio
Copy link
Member Author

Closing & reopening to re-trigger Travis build

@mistercrunch
Copy link
Member

Rebasing might solve the JS build issue, been discussed on Superset Slack #airbnb-lyft-sprint channel

@mistercrunch mistercrunch merged commit 7595d9e into apache:master Aug 27, 2019
@mistercrunch mistercrunch deleted the rd/results-backend-msgpack branch August 27, 2019 21:23
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.35.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 0.35.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants