Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Merged
merged 107 commits into from
Mar 20, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
a93f9c2
Add new JSONDictField which allows us to more efficently store,
Kami Dec 20, 2019
a89d658
Add new JSONDictField which allows us to more efficently store,
Kami Dec 20, 2019
fe5e33d
Add a feature flag for using new json dict field, set it to false
Kami Jan 17, 2020
f0919c9
Use new JSON dict field for dictionaries which can be very large where
Kami Jan 17, 2020
59e87f9
Fix JSONDictEscapedFieldCompatibilityField to get it to work
m4dcoder Feb 22, 2020
2024a1e
Merge branch 'master' into optimize_escaped_dict_fields
Kami Feb 18, 2021
2f969f8
Add a micro-benchmark which comparsed execution save + read times for
Kami Feb 18, 2021
5971302
Add another micro benchmark fixture which represents a dictionary with a
Kami Feb 18, 2021
f19f0fd
Add micro-benchmark for escape_chars() and unescape_chars() and update
Kami Feb 18, 2021
81672f2
Add unit tests for the new fields.
Kami Feb 18, 2021
2a15e8b
Merge branch 'optimize_escaped_dict_fields' of github.com:StackStorm/…
Kami Feb 18, 2021
44bdbad
Handle subclass hack in the micro benchmark itself.
Kami Feb 18, 2021
052fde7
Add some more tests.
Kami Feb 18, 2021
dbc5f3d
Fix invalid / broken test - result is a dict field and not a string.
Kami Feb 18, 2021
424f3d7
Update more affected code to make sure we correctly handle new result
Kami Feb 19, 2021
ff485ca
Also benchmark JSONDict fields with compression.
Kami Feb 19, 2021
6b1abf0
Update docstring.
Kami Feb 19, 2021
89617ff
Add new "finalized_timestamp" field to the Execution and LiveAction
Kami Feb 19, 2021
fa03d2f
Add changelog entry.
Kami Feb 19, 2021
68feb47
Update more affected and broken tests to correctly specify a dict value
Kami Feb 19, 2021
4cb7a1d
Fow now, exclude finalized_timestamp attr from the default CLI outpunt.
Kami Feb 19, 2021
e1d085d
Fix lint.
Kami Feb 19, 2021
d9ad62b
Add python runner action which can be used for testing and timing large
Kami Feb 19, 2021
e20242f
Update changelog.
Kami Feb 19, 2021
95de8ff
Add TODO comment.
Kami Feb 19, 2021
42f70e7
Update the field and implement another approach which uses additional
Kami Feb 19, 2021
8017e2a
Fix lint.
Kami Feb 19, 2021
cbb4cb1
Make tests more robust and less reliant on a specific global state.
Kami Feb 19, 2021
49b4033
Re-generate requiremennts files.
Kami Feb 20, 2021
d93bd9d
Add -x flag to the st2 execution list command.
Kami Feb 20, 2021
c8f4022
Also apply the same field optimizations changes to all the workflows
Kami Feb 21, 2021
88151da
Update changelog.
Kami Feb 21, 2021
a239061
For now, only utilize JSONDictField for fields which are for all
Kami Feb 21, 2021
7cd3ec4
Implement dict value change tracking for our custom JSONDictField.
Kami Feb 21, 2021
82965ef
Update more fields to use the new more efficient dict field.
Kami Feb 21, 2021
eaccea2
Add orquesta workflow action which can be used to test passing large
Kami Feb 21, 2021
9682fac
Update API models to call public field method instead of calling orjson
Kami Feb 21, 2021
bc9e9c2
Update comments, add some tests for the public method.
Kami Feb 21, 2021
2ac3fda
Apply same optimizatons to trigger_instance.payload field.
Kami Feb 21, 2021
49d6134
Add correct file.
Kami Feb 21, 2021
242c676
Update affected tests and API model.
Kami Feb 21, 2021
75ab254
Also add benchmark for model with multiple fields of the same type and
Kami Feb 23, 2021
4933573
Hook micro benchmarks to CI.
Kami Feb 23, 2021
e8745d8
Updat the new field type and make sure we also correctly track changes
Kami Feb 23, 2021
560d616
Use consistent action name.
Kami Feb 23, 2021
147a02b
Simplify the code - instead of having another finalized_timestamp
Kami Feb 24, 2021
405e039
Update st2 execution get command to also display log attribute by
Kami Feb 24, 2021
78f89ab
Update affected tests.
Kami Feb 24, 2021
7810aa8
Also display log attribute on workflow executions.
Kami Feb 25, 2021
b31c006
Update affected tests - live action and action execution timestamp may
Kami Feb 25, 2021
5988b5c
Throw more user-friendly error.
Kami Feb 25, 2021
71791b3
micro-benchmarks task is very slow on CI so for now, only run it on
Kami Feb 26, 2021
1d178df
Fix failing test, remove Python 2 code.
Kami Feb 26, 2021
053bd93
Include the following changes which makes action registration 15-20%
Kami Feb 27, 2021
824c2ea
Merge branch 'master' of github.com:StackStorm/st2 into optimize_esca…
Kami Feb 27, 2021
1a99eee
Fix failing test.
Kami Feb 27, 2021
ca49b10
Fix rst syntax.
Kami Feb 27, 2021
4289b9e
Pin mail-parser test dep to the latest version so tests work correctly.
Kami Feb 27, 2021
9f0a6ba
Update more places in the code where we only work with simple / native
Kami Feb 27, 2021
4793ba3
Update nose tests target to exclude resource registrar debug log
Kami Feb 27, 2021
dbc1460
Use correct path for pip cache dir.
Kami Feb 27, 2021
af961fb
Merge branch 'master' of github.com:StackStorm/st2 into optimize_esca…
Kami Mar 6, 2021
64dbe5a
Use lazy import since right now zstandard is only used for tests and
Kami Mar 6, 2021
167ca3f
Add a comment to custom yaml_safe_load() method.
Kami Mar 6, 2021
8902d06
Better handle scenario when log attribute is already formatted.
Kami Mar 7, 2021
6eabd5b
Add workaround for issue i've seen very seldomly on ci with trigger
Kami Mar 7, 2021
2c2cb74
Make sure we don't call unescape_chars() on the JSONDictField field
Kami Mar 7, 2021
93d859c
Update changelog.
Kami Mar 7, 2021
2ea37db
Remove unused options.
Kami Mar 7, 2021
0f293ee
Add additional timer metrics to the action runner which will provide
Kami Mar 12, 2021
a46831e
Merge branch 'master' of github.com:StackStorm/st2 into optimize_esca…
Kami Mar 12, 2021
c8c3b91
Merge branch 'master' of github.com:StackStorm/st2 into optimize_esca…
Kami Mar 14, 2021
b2ed03b
Remove incorrect log message which was causing unncessary log churn in
Kami Mar 14, 2021
9feb81e
Also json instead of orjson so action can also be used with older
Kami Mar 15, 2021
9f4f523
Store "result_size field on the ActionExecutionDB.
Kami Mar 15, 2021
d0f0d78
Add new WIP API endpoint for returning / downloading raw action
Kami Mar 15, 2021
b0dea78
Also add support for compressing and pretty printing the raw response.
Kami Mar 16, 2021
756b916
Update URL path, add tests.
Kami Mar 16, 2021
a47461b
Update "result_size" field for action execution and live action DB model
Kami Mar 16, 2021
2005126
Move calculation and setting of the result_size field to the
Kami Mar 16, 2021
086be02
Add changelog entry.
Kami Mar 16, 2021
8e0c312
Re-generate api spec.
Kami Mar 16, 2021
cd9eba7
Fix typo.
Kami Mar 16, 2021
1a932ca
Fix failing test.
Kami Mar 16, 2021
e72215f
Merge branch 'master' of github.com:StackStorm/st2 into optimize_esca…
Kami Mar 16, 2021
9e336d8
Fix merge conflicts.
Kami Mar 16, 2021
d373cf5
Fix test method name.
Kami Mar 16, 2021
224dfba
Merge branch 'master' into optimize_escaped_dict_fields
Kami Mar 18, 2021
3cc71ef
Add micro benchmark which times saving and reading large string value
Kami Mar 18, 2021
ac4efbd
Merge branch 'optimize_escaped_dict_fields' of github.com:StackStorm/…
Kami Mar 18, 2021
051a691
Fix --with-schema flag which didn't work and threw and exception under
Kami Mar 19, 2021
94b6298
Update CLI to use C version of the YAML safe dumper when pretty
Kami Mar 19, 2021
d1df1cd
Clarify the comment.
Kami Mar 19, 2021
b13c195
Add workaround for weird failure on CI which should not be fatatal.
Kami Mar 19, 2021
51f811c
Log a warning message if pyyaml C bindings are not available since it
Kami Mar 19, 2021
4672495
Upgrade orjson to latest stable version.
Kami Mar 19, 2021
a25efa6
Update out of date st2client setup.py metadata.
Kami Mar 19, 2021
bdd8e3c
Add a comment on libyaml availability.
Kami Mar 19, 2021
a098315
Update more code to use orjson and C versions of yaml load/dump
Kami Mar 19, 2021
e818158
Use fast dict copy.
Kami Mar 19, 2021
48d612d
For performance reasons, use udatetime library for parsing rfc3339 /
Kami Mar 19, 2021
71ffb1a
ujson is not only used for tests / benchmarks so move it to
Kami Mar 19, 2021
46ba2c9
Fix typo.
Kami Mar 19, 2021
a27245f
Add TODO comment.
Kami Mar 19, 2021
1a91394
Fix affected test.
Kami Mar 19, 2021
cbd0259
Apply suggestions from code review
Kami Mar 20, 2021
3b47856
Fix syntax, add comments.
Kami Mar 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions conf/st2.conf.sample
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,8 @@ username = None
connection_retry_max_delay_m = 3
# ca_certs file contains a set of concatenated CA certificates, which are used to validate certificates passed from MongoDB.
ssl_ca_certs = None
# True to use a special implementation of escaped dict field which saves value as JSON to avoid expensive escaping. NOTE: Experimental.
use_json_dict_field = False
# Certificate file used to identify the localconnection
ssl_certfile = None
# Connection retry backoff max (seconds).
Expand All @@ -141,6 +143,8 @@ connection_timeout = 3000
password = None
# port of db server
port = 27017
# The backend to use for marshalling JSON to the JSONDictField.
json_dict_field_backend = ujson

[exporter]
# location of the logging.exporter.conf file
Expand Down
2 changes: 2 additions & 0 deletions conf/st2.dev.conf
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Config used by local development environment (tools/launch.dev.sh)
[database]
host = 127.0.0.1
use_json_dict_field = True
json_dict_field_backend = ujson

[api]
# Host and port to bind the API server.
Expand Down
9 changes: 8 additions & 1 deletion st2common/st2common/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,14 @@ def register_opts(ignore_errors=False):
'authentication_mechanism', default=None,
help='Specifies database authentication mechanisms. '
'By default, it use SCRAM-SHA-1 with MongoDB 3.0 and later, '
'MONGODB-CR (MongoDB Challenge Response protocol) for older servers.')
'MONGODB-CR (MongoDB Challenge Response protocol) for older servers.'),
cfg.BoolOpt(
'use_json_dict_field', default=False,
help='True to use a special implementation of escaped dict field which saves '
'value as JSON to avoid expensive escaping. NOTE: Experimental.'),
cfg.StrOpt(
'json_dict_field_backend', default='ujson', choices='cjson, ujson',
help='The backend to use for marshalling JSON to the JSONDictField.'),
]

do_register_opts(db_opts, 'database', ignore_errors)
Expand Down
101 changes: 101 additions & 0 deletions st2common/st2common/fields.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,28 @@
# limitations under the License.

from __future__ import absolute_import

import datetime
import calendar

import six

from mongoengine import LongField
from mongoengine import BinaryField
from oslo_config import cfg

from st2common.models.db import stormbase
from st2common.util import date as date_utils
from st2common.util import mongoescape

__all__ = [
'ComplexDateTimeField'
]

SECOND_TO_MICROSECONDS = 1000000

from st2common import log as logging
LOG = logging.getLogger(__name__)

class ComplexDateTimeField(LongField):
"""
Expand Down Expand Up @@ -114,3 +123,95 @@ def to_mongo(self, value):

def prepare_query_value(self, op, value):
return self._convert_from_datetime(value)


class JSONDictField(BinaryField):
"""
Custom field types which stores dictionary as JSON serialized strings.

This is done because storing large objects as JSON serialized strings is much more efficent
on the serialize and unserialize paths compared to used EscapedDictField which needs to escape
all the special values ($, .).

Only downside is that to MongoDB those values are plain raw strings which means you can't query
on actual dictionary field values. That's not an issue for us, because in places where we use
it, we already treat those values more or less as opaque strings.

# NOTE(Tomaz): I've done bencharmking of ujson and cjson and cjson is more performant on large
objects and ujson on smaller ones.
"""
def __init__(self, *args, **kwargs):
json_backend = kwargs.pop('json_backend', None)

if not json_backend:
json_backend = cfg.CONF.database.json_dict_field_backend

if json_backend not in ['ujson', 'cjson']:
raise ValueError('Unsupported backend "%s" specified for JSONDictField.' % json_backend)

super(JSONDictField, self).__init__(*args, **kwargs)

if json_backend == 'ujson':
import ujson
self.json_loads = ujson.loads
self.json_dumps = ujson.dumps
elif json_backend == 'cjson':
import cjson
self.json_loads = cjson.decode
self.json_dumps = cjson.encode

def to_mongo(self, value):
if not isinstance(value, dict):
message = 'The value argument must be a dictionary. Type: %s Content: %s'
raise ValueError(message % (type(value), str(value)))

return self.json_dumps(value)

def to_python(self, value):
if isinstance(value, six.text_type) or isinstance(value, six.binary_type):
return self.json_loads(value)

return value

def validate(self, value):
if isinstance(value, dict):
value = self.to_mongo(value)

return super(JSONDictField, self).validate(value)


class JSONDictEscapedFieldCompatibilityField(JSONDictField):
"""
Special version of JSONDictField which takes care of compatibility between old EscapedDictField
and EscapedDynamicField format and the new one.

On retrieval, if an old format is detected it's correctly un-serialized and on insertion, we
always insert data in a new format.
"""

def to_mongo(self, value, use_db_field=True, fields=None):
if not cfg.CONF.database.use_json_dict_field:
value = mongoescape.escape_chars(value)
return super(stormbase.EscapedDynamicField, self).to_mongo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bugs out because JSONDictEscapedFieldCompatibilityField is incompatible with EscapedDynamicField.

value=value, use_db_field=use_db_field, fields=fields)

if not isinstance(value, dict):
message = 'The value argument must be a dictionary. Type: %s Content: %s'
raise ValueError(message % (type(value), str(value)))

return self.json_dumps(value)

def to_python(self, value):
if not cfg.CONF.database.use_json_dict_field:
value = super(stormbase.EscapedDynamicField, self).to_python(value)
return mongoescape.unescape_chars(value)

if isinstance(value, dict):
# Old format which used a native dict with escaped special characters
value = mongoescape.unescape_chars(value)
return value

if isinstance(value, six.text_type) or isinstance(value, six.binary_type):
return self.json_loads(value)

return value
4 changes: 4 additions & 0 deletions st2common/st2common/models/api/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,10 @@ class ActionExecutionAPI(BaseAPI):
@classmethod
def from_model(cls, model, mask_secrets=False):
doc = cls._from_model(model, mask_secrets=mask_secrets)

import json
doc['result'] = json.loads(doc['result'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to add this to get the result in JSONDictEscapedFieldCompatibilityField with ujson/cjson backend to work properly. This is because _from_model will call to_mongo which dumps the result in JSON to string. The CLI output will be displayed as a string.


start_timestamp = model.start_timestamp
start_timestamp_iso = isotime.format(start_timestamp, offset=False)
doc['start_timestamp'] = start_timestamp_iso
Expand Down
3 changes: 2 additions & 1 deletion st2common/st2common/models/db/execution.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from st2common import log as logging
from st2common.models.db import stormbase
from st2common.fields import JSONDictEscapedFieldCompatibilityField
from st2common.fields import ComplexDateTimeField
from st2common.util import date as date_utils
from st2common.util.secrets import get_secret_parameters
Expand Down Expand Up @@ -61,7 +62,7 @@ class ActionExecutionDB(stormbase.StormFoundationDB):
parameters = stormbase.EscapedDynamicField(
default={},
help_text='The key-value pairs passed as to the action runner & action.')
result = stormbase.EscapedDynamicField(
result = JSONDictEscapedFieldCompatibilityField(
default={},
help_text='Action defined result.')
context = me.DictField(
Expand Down
3 changes: 2 additions & 1 deletion st2common/st2common/models/db/liveaction.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from st2common.models.db import stormbase
from st2common.models.db.notification import NotificationSchema
from st2common.fields import ComplexDateTimeField
from st2common.fields import JSONDictEscapedFieldCompatibilityField
from st2common.util import date as date_utils
from st2common.util.secrets import get_secret_parameters
from st2common.util.secrets import mask_secret_parameters
Expand Down Expand Up @@ -56,7 +57,7 @@ class LiveActionDB(stormbase.StormFoundationDB):
parameters = stormbase.EscapedDynamicField(
default={},
help_text='The key-value pairs passed as to the action runner & execution.')
result = stormbase.EscapedDynamicField(
result = JSONDictEscapedFieldCompatibilityField(
default={},
help_text='Action defined result.')
context = me.DictField(
Expand Down