Skip to content
This repository has been archived by the owner on Mar 13, 2022. It is now read-only.

fix watching with a specified resource version #109

Merged
merged 1 commit into from
Jan 23, 2019

Conversation

juliantaylor
Copy link
Contributor

The watch code reset the version to the last found in the
response.
When you first list existing objects and then start watching from that
resource version the existing versions are older than the version you
wanted and the watch starts from the wrong version after the first
restart.
This leads to for example already deleted objects ending in the stream
again.

Fix this by not resetting to an older version than the one specified in
the watch.
It does not handle overflows of the resource version but they are 64 bit
integers so they should not realistically overflow even in the most loaded
clusters.

Closes kubernetes-client/python#700

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 11, 2018
@juliantaylor juliantaylor force-pushed the fix-watch-reset branch 3 times, most recently from 5cab3de to 5eb6633 Compare December 12, 2018 17:51
@max-rocket-internet
Copy link

/assign @yliaog

@max-rocket-internet
Copy link

Any update @yliaog ? Thanks!

watch/watch.py Outdated
@@ -83,13 +83,14 @@ def unmarshal_event(self, data, return_type):
obj = SimpleNamespace(data=json.dumps(js['raw_object']))
js['object'] = self._api_client.deserialize(obj, return_type)
if hasattr(js['object'], 'metadata'):
self.resource_version = js['object'].metadata.resource_version
self.resource_version = int(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource version is an opaque string, it cannot be assumed to be an int

watch/watch.py Outdated
# For custom objects that we don't have model defined, json
# deserialization results in dictionary
elif (isinstance(js['object'], dict) and 'metadata' in js['object']
and 'resourceVersion' in js['object']['metadata']):
self.resource_version = js['object']['metadata'][
'resourceVersion']
self.resource_version = int(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

watch/watch.py Outdated
@@ -122,6 +123,7 @@ def stream(self, func, *args, **kwargs):
return_type = self.get_return_type(func)
kwargs['watch'] = True
kwargs['_preload_content'] = False
min_resource_version = int(kwargs.get('resource_version', 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource version is an string, the resource version string "0" is kind of treated special on the server side. but please don't assume "0" is the minimum resource version

watch/watch.py Outdated
# continue to watch from the requested resource version
# does not handle overflow though that should take a few
# hundred years
kwargs['resource_version'] = max(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly, no max or min can be done on resource version. Instead, there is only special "0" versus non-special other opaque strings. so what can be done is to test for equal to "0" or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know where I can find the exact definition of the resource version?
There has to be be some kind of ordering possible, what else is the resource version for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes-client/python#693 (comment)

has some useful references

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a suggestion how to fix this issue then?
my best idea is to ditch the while loop that caused the issue in the first place if a resource version is passed in by the caller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the while loop is still needed, i think probably check if kwargs['resource_version'] is present at the beginning, if it does, then start watching with kwargs['resource_version'], instead of starting with "0" (which is set in Init to self.resource_version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm assuming receive all events in order from the first watch it would work, you'd restart at a lower resource version than inputted but if you have received them all you should still be fine. Makes the tests a bit more tricky to implement but I'll post an update soon

@max-rocket-internet
Copy link

Any progress @juliantaylor ?

@juliantaylor
Copy link
Contributor Author

finally figured out a way to write a test and updated treating the rv as opaque, please have a look again.

@max-rocket-internet
Copy link

@juliantaylor you need to lint the files as some of the are failing the style checks:
https://api.travis-ci.org/v3/job/481731414/log.txt

@juliantaylor
Copy link
Contributor Author

should be fixed

@@ -62,6 +64,69 @@ def test_watch_with_decode(self):
fake_resp.close.assert_called_once()
fake_resp.release_conn.assert_called_once()

def test_watch_resource_version_set(self):
# gh-700 ensure watching from a resource version does reset to resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is gh-700 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github issue 700, the convention I'm used to. How do you reference issues in this project?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use a direct link to the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


from .watch import Watch

callcount = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to avoid using global

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a way to avoid it, any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move it to class WatchTests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, just putting it the class works ... I had just tested a function local variable ...
updated

fake_resp.close = Mock()
fake_resp.release_conn = Mock()
values = [
'{"type": "ADDED", "object": {"metadata": {"name": "test1",'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you use a real k8s object, like a pod? that could make the test easier to understand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one could though it does not matter for the test
none of the existing tests use real objects, should all be updated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine to leave it as is

fake_api.get_namespaces.__doc__ = ':return: V1NamespaceList'

w = Watch()
count = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does count do? the number of the for loop iterations? usually it starts with 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes its the loop count, again a copy paste from other tests, using enumerate would be a bit nicer

if count == len(values) * 3:
w.stop()

fake_api.get_namespaces.assert_has_calls(calls)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you simulate the case the 'connection got reset'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

via the global variable changing the return value of the mock read_chunked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    if 'resource_version' in kwargs:
        self.resource_version = kwargs['resource_version']

===================================================
In the above test, the code in the function stream() from its beginning to "while True" is executed only once, i.e. the two lines of fix code you added taken above is executed only once.
when the above two lines are executed, kwargs is None, hence essentially the two lines you added did not make any effect. I'm wondering how the test code tests the fix? am I missing anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the testcode does run with kwargs['resource_version] set which sets the rv member of the watch object.
The function then calls the get_namespaces mock which would direct k8s to provide a watch from the provided resource version, in our mock it is just ignored and returns nothing simulating the k8s api not returning anything and resetting the connection.
Now instead of resetting the resource version member variable back to zero and issuing a watch from zero in the next iteration it will reuse the input resource version which is what we want.
The mock in the second call returns something but what doesn't really matter we just want to register that the second mocked call was with the input resource version and not with zero which causes the linked issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the failure when you remove the two lines I added, it matches what happens in the real testcase I posted in the linked issue:

E   AssertionError: Calls not found.
E   Expected: [call(_preload_content=False, resource_version='5', watch=True),
E    call(_preload_content=False, resource_version='3', watch=True),
E    call(_preload_content=False, resource_version='3', watch=True)]
E   Actual: [call(_preload_content=False, resource_version='5', watch=True),
E    call(_preload_content=False, resource_version=0, watch=True), <<<<<<<<< ERROR HERE
E    call(_preload_content=False, resource_version='3', watch=True),
E    call(_preload_content=False, resource_version='3', watch=True)]

calls.append(call(_preload_content=False, watch=True,
resource_version=rv))
# returned
if count == len(values) * 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you expect the loop above iterates 3*3 times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the mock needs a way to identify when it is done, this and other tests in the file use the loop iteration count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for each iteration of the for loop, it what's returned from w.stream(...) is:

calls: call<rv=5> | call<rv=5> | call<rv=3> | call<rv=3>
result: nothing | test1 test2 test3 | test1 test2 test3 | test1 test2
count: 1 | 2 3 4 | 5 6 7 | 8 9
it looks that there should be 4 calls as illustrated above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes one could reduce it to 4, the amount of iterations doesn't matter as long as we iterate at least twice in the watch.
going a bit beyond what is needed is currently unnecessary but may make it more robust to future changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean there are calls, call<rv=5> | call<rv=5> | call<rv=3> | call<rv=3>

currently, the test expects 3 calls, (call<rv=5> | call<rv=3> | call<rv=3>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, assert_has_calls does not care about calls before and after the wanted ones, I'll update the test to be more strict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, changed it a bit to be hopefully more clear what result we expect

@juliantaylor juliantaylor force-pushed the fix-watch-reset branch 3 times, most recently from 97174be to 82c710c Compare January 23, 2019 18:37
The watch code reset the version to the last found in the
response.
When you first list existing objects and then start watching from that
resource version the existing versions are older than the version you
wanted and the watch starts from the wrong version after the first
restart.
This leads to for example already deleted objects ending in the stream
again.

Fix this by setting the minimum resource version to reset from to the
input resource version. As long as k8s returns all objects in order in
the watch this should work.
We cannot use the integer value of the resource version to order it as
one should be treat the value as opaque.

Closes kubernetes-client/python#700
@codecov-io
Copy link

Codecov Report

Merging #109 into master will increase coverage by 0.2%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #109     +/-   ##
=========================================
+ Coverage   91.74%   91.94%   +0.2%     
=========================================
  Files          13       13             
  Lines        1187     1217     +30     
=========================================
+ Hits         1089     1119     +30     
  Misses         98       98
Impacted Files Coverage Δ
watch/watch_test.py 98.51% <100%> (+0.38%) ⬆️
watch/watch.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8497dfb...3c30a30. Read the comment docs.

@yliaog
Copy link
Contributor

yliaog commented Jan 23, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2019
@yliaog
Copy link
Contributor

yliaog commented Jan 23, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliantaylor, yliaog

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 23, 2019
@k8s-ci-robot k8s-ci-robot merged commit 2d69e89 into kubernetes-client:master Jan 23, 2019
@max-rocket-internet
Copy link

This PR won't stop the 410 error, right? It will just stop the out of order processing?

@juliantaylor
Copy link
Contributor Author

the 410 resource version to old error is not handled by this.
This still needs to be handled by the application e.g. by restarting the watch at the last known resource version (which this PR does fix).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants