Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adpatation: plugin registration/sync race, unit test failures. #99

Open
klihub opened this issue Jul 11, 2024 · 2 comments
Open

adpatation: plugin registration/sync race, unit test failures. #99

klihub opened this issue Jul 11, 2024 · 2 comments
Assignees
Milestone

Comments

@klihub
Copy link
Member

klihub commented Jul 11, 2024

Adaptation unit tests fail, ever since c4893c7. The reason is that while that commit fixed a locking inversion bug, it now opened a time window for a plugin to lose events during registration after it has received the initial set of pods and containers but before it is has been fully registered for event processing.

Synchronizing the plugin and adding it to the slice of registered plugins both used to happen with Adaptation.Mutex held. Now that mutex is held only after synchronization is done while updating the slice of plugins. As a consequence to the adaptation test suite while Suite.WaitForPluginsToSync() used to return only once the plugin was fully registered (IOW, added to slice of plugins), now it returns before that. Therefore, it became possible for a pod/container event to come when a plugin is halfway through its registration: already synchronized with existing pods/containers, but not added yet to the slice of plugins. When that happens, the plugin loses any event that comes before it has been added to the slice of plugins.

The effect is not limited to unit tests, but it becomes apparent as unit tests banging full steam registering plugins and generating pod/container events now trigger this race condition reliably. Similar NRI pod/container event loss is also possible now in the runtimes.

@klihub klihub self-assigned this Jul 11, 2024
@klihub klihub changed the title Adaptation unit test failure adpatation: unit test failures. Jul 11, 2024
@klihub
Copy link
Member Author

klihub commented Jul 11, 2024

Here is a proposed fix. It requires additional co-operation from the NRI integration code on the runtime side. The runtime should request blocking of plugin registration during event processing.

@klihub klihub changed the title adpatation: unit test failures. adpatation: plugin registration/sync race, unit test failures. Jul 19, 2024
@samuelkarp
Copy link
Member

after it has received the initial set of pods and containers but before it is has been fully registered for event processing.

It seems to me that the order should really be the opposite: a plugin should be fully registered for events (and start receiving immediately), and then synchronize to see any pods/containers that existed prior to subscription to the event stream. This would require some additional logic to handle deduplication (since some pods/containers may be received both event-based during the initial registration and during that sync), but would not require any locking or pausing of events or registration.

@mikebrow mikebrow added this to the 1.0 milestone Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants