Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Es item #111

Merged
merged 58 commits into from
Feb 3, 2020
Merged

Es item #111

merged 58 commits into from
Feb 3, 2020

Conversation

carlvitzthum
Copy link
Contributor

@carlvitzthum carlvitzthum commented Nov 7, 2019

Large change that refactors a lot of storage-related code and adds a new feature: creating items with ES-based properties. See this document for an overview. There are some other misc. changes as well.

Version is currently set to 1.3.8 but may need to change if subsequent releases are made before this is merged.

Summary

  • Added documentation, including a number of placeholders. Refactoring some existing docs
  • Removed linkFrom code (unused)
  • Refactored PickStorage and datastore (on request) slightly and moved to storage.py
  • Added more resource views using ICachedItem (for ES-based items)
  • Refactored ESStorage and added create + update methods (for ES-based items)
  • Refactoring resource.py and connection.py for ES-based items
  • create_mapping will now skip indices that are not empty for collections with properties_datastore='elasticsearch'
  • Added some tests to test_indexing.py and test_storage.py

Please note that links between documents will not work until the RTD build runs.

Copy link
Member

@willronchetti willronchetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, mostly just small things I've commented on

src/snovault/cache.py Show resolved Hide resolved
return None

return find_it(our_dict)
used_datastore = 'elasticsearch' # datastore used by this model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is necessary? May be a little confusing since used_datastore is an attribute on Item, and while CachedModel is related it isn't exactly the same thing. Don't you always know in this case implicitly that the data comes from elasticsearch?

document = existing_doc.source

index_name = get_namespaced_index(self.registry, document['item_type'])
# use `refresh='waitfor'` so that the ES model is immediately available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean under heavy load this could block? May want to make this configurable

@@ -255,6 +258,9 @@ class Item(Resource):
embedded_list = []
filtered_rev_statuses = ()
schema = None
# `used_datastore` determines where the properties of the item are store.
# None for traditional Postgres setup or "elasticsearch" to use ES document
used_datastore = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is set yet I suspect in some case it must not be given the errors I saw when running this branch on CGAP. To fix it I had to put a try-except where this field is accessed if I recall for static sections?

for operations like getting current sid/max_sid, rev_links, and
updating. Leverage `model.used_datastore` to determine source
"""
if self.model.used_datastore != 'database':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the convention that used_datastore='elasticsearch' and anything else implies database? In which case if the default None were here you would grab the model twice when you didn't need to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit confusing because the models themselves (storage.Resource or esstorage.CachedModel) had used_datstore parameters which are always set, as well as the used_datastore parameters on Item. I've refactored it a bit.
I think this is correct though; if you added a third type of model, say used_datastore='redis', then we would still want to use the write model here

# properties and sheets, as those are exclusively stored in ES
self.write.update(model, {}, None, unique_keys, links)
# update contents of the ES documents
return storage.update(model, properties, sheets, unique_keys, links)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this return is necessary since the one below is identical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True!

@@ -1260,3 +1259,151 @@ def test_validators_on_indexing(app, testapp, indexer_testapp):
val_err_view = testapp.get(ppp_id + '@@validation-errors', status=200).json
assert val_err_view['@id'] == ppp_id
assert val_err_view['validation_errors'] == es_res['_source']['validation_errors']


def test_elasticsearch_item(app, testapp, indexer_testapp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would maybe consider splitting into 2 (even 3) separate tests if you can logically split the behavior. Your call though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good call. This test grew organically, hence it's massive size. I'll split it up


url = '/testing-link-sources-sno/'
for item in sources:
testapp.post_json(url, item, status=201)
res = testapp.post_json(url, item)
Copy link
Member

@willronchetti willronchetti Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the above change was for testing, right? Should probably be changed back since I believe this one will fail silently (and you don't need res?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, definitely. Reverted.

Copy link
Collaborator

@netsettler netsettler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to read every line of this code, although reviewing the actual algorithmic choices is beyond the scope of what I had time to do. I did reason through things that are locally understandable and didn't see flaws in most of that, but I'm glad Will had eyes on it, too. Some of the bigger sections of code look like they were just moving code around. I pulled up sources and tried to do side-by-side of blocks of code that seemed to have moved a lot, though that's error-prone in its own ways. In general, for big changes like this, reading the commit-by-commit changes can be helpful, and maybe I should do that as well, though it has issues with changes getting superseded. Anyway, reviewing the things I actually remarked on, I see they were largely cosmetic. I'm going to ponder whether the level of work I've done on it constitutes an approval. I'll likely talk to Will before signing off. But I'll just say in words here that I don't have any reason not to approve this, presuming it's passing tests. It's good not to let a change this big linger. And we have other big things like PR #68 that may be affected and also need to get in.

Comment on lines -29 to +28
if not hasattr(context.model, 'source'):
if context.model.used_datastore != 'elasticsearch':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we can rely on every model to have a .used_datastore so we don't have to be so timid about the access?

src/snovault/cache.py Show resolved Hide resolved

Step 1: Verify that homebrew is working properly::

$ sudo brew doctor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did

~$ sudo brew doctor
Error: Running Homebrew as root is extremely dangerous and no longer supported.
As Homebrew does not drop privileges on installation you would be giving all
build scripts full access to your system.
~$ brew doctor
Your system is ready to brew.

I think we should change this line to say just

$ brew doctor


$ brew install libevent libmagic libxml2 libxslt openssl postgresql graphviz
$ brew install freetype libjpeg libtiff littlecms webp # Required by Pillow
$ brew tap homebrew/versions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~$ brew tap homebrew/versions
Error: homebrew/versions was deprecated. This tap is now empty as all its formulae were migrated.

I recommend that we just remove this line. I don't actually think it's needed.

@@ -0,0 +1,45 @@
Local Installation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet most of these instructions would be better if we borrowed the instructions I just created for forefront. (We could do that in a separate PR sometime. Not a blocker here, though.

registry[STORAGE] = storage
# expect existing values to be used
register_storage(registry)
assert registry[STORAGE].write == 'dummy_db'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hurts my head that these are called simply read and write, since I keep assuming their value will be used like you'd use

sys.stdout.write("Foo")

Definitely not something to change on this PR, which is already very sprawling and hard to review. But one day I hope we fix this to have a more intuitive name like .storage_for_read, etc.

Comment on lines +465 to +466
return [request.resource_path(conn[uuid]) for uuid in
self.get_filtered_rev_links(request, rev_name)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather see the line break before the for so that the fact of an iteration is more prominent. And this is a place I deviate from PEP. I tend to like to use a space after the bracket/brace when the text inside is computed, whether a literal or a comprehension. The space tells you something interesting is afoot and draws your attention. The line break in the proper place makes it like an upside-down for loop, and so a more familiar notation the eye can more easily scan. e.g.,

        return [ request.resource_path(conn[uuid]) 
                 for uuid in self.get_filtered_rev_links(request, rev_name) ]

@@ -1046,7 +1065,7 @@ def test_indexing_esstorage(app, testapp, indexer_testapp):
# test the following methods:
es_res_by_uuid = esstorage.get_by_uuid(test_uuid)
es_res_by_json = esstorage.get_by_json('required', 'some_value', TEST_TYPE)
es_res_direct = esstorage.get_by_uuid_direct(test_uuid, namespaced_test_type, TEST_TYPE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use keyword calling here, please.

# purge_uuid fxn ensures that all links to the item are removed
namespaced_index = get_namespaced_index(request, item_type)
request.registry[STORAGE].purge_uuid(item_uuid, namespaced_index, item_type)
request.registry[STORAGE].purge_uuid(item_uuid, item_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought I'd committed this comment, but apparently hadn't: Please whenever changing argument conventions like this, use keyword calling so that you're sure it's lining up consistently. In fact, I use keyword calling in nearly all cases when there are arbitrarily ordered arguments like these. (By arbitrarily, I mean that there's something natural about putting the uuid first in a function named purge_uuid, so I'm OK with that not being a keyword, though I also don't mind a keyword even then. But in this PR there is a def purge_uuid(self, rid, item_type=None):' in src/snovault/storage.pythat takes one set of argument conventions, and there's also been amax_sidargument that's been used. I recommend writing at leastpurge_uuid(item_uuid, item_type=item_type)and optionally evenpurge_uuid(rid=item_uuid, item_type=item_type)`. This make sure the right method with the right args is being called while in transition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of my other comments, which intended to allude back to that one, may have been confusing in its absence. :)

@@ -1262,6 +1280,207 @@ def test_validators_on_indexing(app, testapp, indexer_testapp):
assert val_err_view['validation_errors'] == es_res['_source']['validation_errors']


def test_elasticsearch_item_basic(app, testapp, indexer_testapp, es_based_target):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look plausible, though I'm still not up on enough detail to review some aspects of them. I wouldn't mind sitting with Will and going over them in detail to make sure I get the things that are being tested. No action item for the code due to that, though. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I can ask here, though: These tests are longish. Is that because there is state that runs through the whole thing and changes? Would they not work if carved up into more and smaller tests? Would that require repeating a constant bit of boilerplate setup, or would there be a (pardon pun) pyramidal cost in setup as the series of tests progressed? Or is the reason for lumping them together that the fixture setup is costly and we just don't want too many of such tests? I feel like this is not just a matter of ease of review, but of more concise noticing of particular problems in case of error, of having one error not mask another, and of modularity in test setup so that tests are easier to change or less state-dependent.

Copy link
Collaborator

@netsettler netsettler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good.

@willronchetti willronchetti merged commit b163e5a into master Feb 3, 2020
@willronchetti willronchetti deleted the es_item branch February 3, 2020 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants