[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069

robdiciuccio · 2019-08-19T22:06:56Z

SUMMARY

Async query performance in SQL Lab, particularly with large result sets, is fairly poor due to how data is serialized and stored. This PR introduces a RESULTS_BACKEND_USE_MSGPACK config option to use PyArrow to serialize the pandas DataFrame directly, and MessagePack for serializing the results payload. Compared to the existing JSON serialization, Arrow and MessagePack provide improved performance and result in much smaller payloads sent to S3 or other cache backends.

Benchmarks with 100K rows of `birth_names` examples data (multiple runs)

JSON

avg serialization: 573ms
avg deserialization: 200ms
total: 773ms
compressed payload size: 816718
peak memory usage: 281.1 MiB

Arrow/msgpack

avg serialization: 70ms
avg deserialization: 40ms
total: 110ms
compressed payload size: 452634
peak memory usage: 266.2 MiB

Benchmarks were performed on a Macbook Pro 2.6 GHz i7, 32GB running macOS 10.14.5 and Python 3.6.8. memory-profiler was used for memory usage stats.

TEST PLAN

Testing thus far has been limited to Postgres, mainly on Superset examples data. For full compatibility testing:

Enable RESULTS_BACKEND_USE_MSGPACK = True in superset_config.py
Run queries in SQL Lab with various DB backends containing multiple data types
Ensure displayed results and CSV downloads contain correctly formatted data

ADDITIONAL INFORMATION

REVIEWERS

@mistercrunch

codecov-io · 2019-08-19T22:20:00Z

Codecov Report

Merging #8069 into master will increase coverage by 0.04%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master    #8069      +/-   ##
==========================================
+ Coverage   65.89%   65.94%   +0.04%     
==========================================
  Files         485      485              
  Lines       22917    22961      +44     
  Branches     2537     2537              
==========================================
+ Hits        15102    15142      +40     
- Misses       7683     7687       +4     
  Partials      132      132

Impacted Files	Coverage Δ
superset/__init__.py	`74.1% <100%> (+0.18%)`	⬆️
superset/utils/core.py	`87.83% <100%> (ø)`	⬆️
superset/dataframe.py	`94.82% <100%> (+0.18%)`	⬆️
superset/config.py	`88.76% <100%> (+0.06%)`	⬆️
superset/sql_lab.py	`77.35% <79.31%> (+0.51%)`	⬆️
superset/views/core.py	`71.44% <81.81%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ac1a29...0a9f213. Read the comment docs.

robdiciuccio · 2019-08-19T22:31:14Z

cc @etr2460 on SQL Lab backend performance work

etr2460 · 2019-08-19T23:40:39Z

superset/__init__.py

@@ -214,6 +214,7 @@ def index(self):
 security_manager = appbuilder.sm

 results_backend = app.config.get("RESULTS_BACKEND")
+results_backend_use_msgpack = app.config.get("RESULTS_BACKEND_USE_MSGPACK")


Generally, I think we like to add the config var to the config.py file so that it's easier to see all the configurations available. That said, I'm wondering if this should be a config var at all if it has such good performance improvements. If you're worried about it being globally enabled without testing, maybe make this a feature flag that we can default off to begin with, but then remove and enable everywhere once we're sure everything is good

Agreed, all configs should be set as default in superset/config.py and commented/documented there.

Feature flag feels a bit heavy for this case, since it's a global setting and not needed on the frontend. My main concern about pushing this change without a config flag is the lack of testing with data sources other than Postgres, but I do think we should push that testing forward by perhaps defaulting RESULTS_BACKEND_USE_MSGPACK = True, and providing an escape hatch to disable should problems crop up.

Hmm, maybe I have a misunderstanding of the semantics around making something a feature flag. I understood it as gating a new feature that we would want to roll out to 100% of users in the future. So while testing people can enable/disable the feature, but with the expectation of in the future it being enabled by default. Feature flags are then necessary to implement on the frontend to support that goal, but aren't required to be both frontend and backend changes. Maybe @mistercrunch can clarify which understanding is correct

My understanding of the feature flag framework is limited, so open to feedback here. Since the change is more middleware than feature, and adding the flag to the JS payload isn't necessary, it feels more like config, but curious what others think.

Config keys are static, environment-wide scoped.

Feature flags can be dynamic and currently they all flow to the frontend. Being dynamic, they can be used to do progressive rollouts or A/B testing.

This current flag in this PR seems more like the former to me

mistercrunch

Overall as noted inline I'm thinking msgpack + pyarrow is superior and should be the new default. The feature flag rolls with a sub-optimal default and two code paths to maintain. I vote for pushing this forward.

setup.py

superset/sql_lab.py

superset/utils/core.py

superset/sql_lab.py

robdiciuccio · 2019-08-23T22:53:48Z

Closing & reopening to re-trigger Travis build

mistercrunch · 2019-08-23T23:57:49Z

Rebasing might solve the JS build issue, been discussed on Superset Slack #airbnb-lyft-sprint channel

robdiciuccio added 5 commits August 13, 2019 12:07

Add support for msgpack results_backend serialization

7de63b9

Serialize DataFrame with PyArrow rather than JSON

16d13d9

Adjust dependencies, de-lint

1763df3

Add tests for (de)serialization methods

d763102

Add MessagePack config info to Installation docs

d550f0f

pull-request-size bot added the size/L label Aug 19, 2019

etr2460 reviewed Aug 19, 2019

View reviewed changes

mistercrunch approved these changes Aug 20, 2019

View reviewed changes

robdiciuccio added 3 commits August 21, 2019 12:23

Enable msgpack/arrow serialization by default

ecc6b88

[Fix] Prevent msgpack serialization on synchronous queries

2607276

Add type annotations

8e9a085

robdiciuccio closed this Aug 23, 2019

robdiciuccio reopened this Aug 23, 2019

Merge branch 'master' into rd/results-backend-msgpack

0a9f213

mistercrunch merged commit 7595d9e into apache:master Aug 27, 2019

mistercrunch deleted the rd/results-backend-msgpack branch August 27, 2019 21:23

bkyryliuk mentioned this pull request Sep 12, 2019

[WIP] Make pyarrow and msgpack optional #8218

Closed

6 tasks

dpgaspar added the preset label Nov 13, 2019

mistercrunch added preset-io and removed preset labels Nov 26, 2019

robdiciuccio mentioned this pull request Feb 10, 2020

[sqllab] fix: return pandas records in execute_sql_statements #9102

Merged

14 tasks

wjones127 mentioned this pull request May 11, 2023

sql_lab.py uses PyArrow API that was removed in pyarrow 12.0.0 #24030

Closed

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.35.0 labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069

[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069

robdiciuccio commented Aug 19, 2019

codecov-io commented Aug 19, 2019 •

edited

Loading

robdiciuccio commented Aug 19, 2019

etr2460 Aug 19, 2019

mistercrunch Aug 20, 2019

robdiciuccio Aug 20, 2019

etr2460 Aug 22, 2019

robdiciuccio Aug 22, 2019

mistercrunch Aug 23, 2019

mistercrunch left a comment

robdiciuccio commented Aug 23, 2019

mistercrunch commented Aug 23, 2019

[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069

[SQL Lab] Async query results serialization with MessagePack and PyArrow #8069

Conversation

robdiciuccio commented Aug 19, 2019

CATEGORY

SUMMARY

Benchmarks with 100K rows of birth_names examples data (multiple runs)

JSON

Arrow/msgpack

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

codecov-io commented Aug 19, 2019 • edited Loading

Codecov Report

robdiciuccio commented Aug 19, 2019

etr2460 Aug 19, 2019

Choose a reason for hiding this comment

mistercrunch Aug 20, 2019

Choose a reason for hiding this comment

robdiciuccio Aug 20, 2019

Choose a reason for hiding this comment

etr2460 Aug 22, 2019

Choose a reason for hiding this comment

robdiciuccio Aug 22, 2019

Choose a reason for hiding this comment

mistercrunch Aug 23, 2019

Choose a reason for hiding this comment

mistercrunch left a comment

Choose a reason for hiding this comment

robdiciuccio commented Aug 23, 2019

mistercrunch commented Aug 23, 2019

Benchmarks with 100K rows of `birth_names` examples data (multiple runs)

codecov-io commented Aug 19, 2019 •

edited

Loading