Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync sandboxes and containers after starting the pre-installed plugins #43

Merged
merged 2 commits into from
Sep 24, 2024

Conversation

Iceber
Copy link
Member

@Iceber Iceber commented May 18, 2023

For plugins registered with nri.sock, NRI will send the full set of pods and containers after starting.

conn, err := l.Accept()
if err != nil {
log.Infof(ctx, "stopped accepting plugin connections (%v)", err)
return
}
p, err := r.newExternalPlugin(conn)
if err != nil {
log.Errorf(ctx, "failed to create external plugin: %v", err)
continue
}
if err := p.start(r.name, r.version); err != nil {
log.Errorf(ctx, "failed to start external plugin: %v", err)
continue
}
r.Lock()
err = r.syncFn(ctx, p.synchronize)
if err != nil {
log.Infof(ctx, "failed to synchronize plugin: %v", err)
} else {
r.plugins = append(r.plugins, p)
r.sortPlugins()
}

For pre-installed plugins, it is still necessary to synchronize the sandboxes and containers data to the plugin after launch,otherwise the plugin will never be able to get information about existing resources

@Iceber Iceber requested a review from klihub May 18, 2023 06:23
@codecov-commenter
Copy link

codecov-commenter commented May 18, 2023

Codecov Report

Patch coverage: 27.27% and project coverage change: -0.08 ⚠️

Comparison is base (2a8b655) 63.83% compared to head (8fa7ae9) 63.75%.

❗ Current head 8fa7ae9 differs from pull request most recent head 75ad210. Consider uploading reports for the commit 75ad210 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #43      +/-   ##
==========================================
- Coverage   63.83%   63.75%   -0.08%     
==========================================
  Files           9        9              
  Lines        1800     1810      +10     
==========================================
+ Hits         1149     1154       +5     
- Misses        500      503       +3     
- Partials      151      153       +2     
Impacted Files Coverage Δ
pkg/adaptation/adaptation.go 70.27% <27.27%> (-1.62%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Iceber Iceber changed the title sync sandboxs and containers after starting the pre-installed plugin sync sandboxes and containers after starting the pre-installed plugin May 18, 2023
Copy link
Member

@klihub klihub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix ! LGTM.

@Iceber Iceber force-pushed the pre_installed_plugin branch 2 times, most recently from 6f7af44 to 75ad210 Compare May 18, 2023 10:17
@Iceber
Copy link
Member Author

Iceber commented May 18, 2023

@klihub I changed the call to syncFn after each pre-install plugin launch to only call syncFn once after all plugins are launched, which avoids unnecessary multiple calls to syncFn in the case of a large number of plugins
https://github.com/containerd/nri/compare/104c8e844981be230d53ffac4f9eee5d1db0634e..6f7af44d11899f94a6f619bab0bba6b012bd8e4e

PTAL, Thanks

@Iceber Iceber requested a review from klihub May 18, 2023 10:21
@Iceber Iceber changed the title sync sandboxes and containers after starting the pre-installed plugin sync sandboxes and containers after starting the pre-installed plugins May 18, 2023
@klihub
Copy link
Member

klihub commented May 18, 2023

@klihub I changed the call to syncFn after each pre-install plugin launch to only call syncFn once after all plugins are launched, which avoids unnecessary multiple calls to syncFn in the case of a large number of plugins
https://github.com/containerd/nri/compare/104c8e844981be230d53ffac4f9eee5d1db0634e..6f7af44d11899f94a6f619bab0bba6b012bd8e4e

PTAL, Thanks

True, better to do it that way.
LGTM.

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see questions

for _, plugin := range plugins {
us, err := plugin.synchronize(ctx, pods, containers)
if err != nil {
return nil, fmt.Errorf("failed to sync NRI Plugin %q: %w", plugin.name(), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a warning instead and continue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible approaches to consider: skipping the synchronization if it fails or returning an error.

I chose to return an error for the following reason: pre-install plugins and external plugins are different in some ways.
The pre-install plugin binary is launched by the NRI adaptation, so it is difficult for users to know whether the binary plugin has been launched successfully, apart from the logs.
Therefore, users may overlook the fact that the plugin binary is not actually working. That's why I require that all pre-install plugins requested by the user must be launched(started, synced) successfully.
This is similar to starting and returning an error if it fails.

if err := p.start(r.name, r.version); err != nil {
return err
}

On the other hand, external plugins are usually registered by external binaries, and we could perceive more clearly whether the NRI registration and synchronization are successful or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose two other patterns would be:
3) to introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch
4) attempt to start and synch all and report all errors

that said should we stop any that we started if there's an error..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. to introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch

Could you introduce more details? thanks

Copy link
Member

@mikebrow mikebrow Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current containerd plugins have a requires section for specifying which plugins must be init'd.. before initing a plugin

we probably need an additional state beyond init for the containerd plugins that would cover ready for work.. for example in the cri case ready may mean all containers/pods restarted ready for grpc calls

nri might need a requires list for the nri plugins (unless we go with external management via systemd? or whatnot) and then within that you could have the concept of ready which might at least cover init and sync ^

additionally we might want to have a flag in the plugin config regarding if sync is required for "ready"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmcgowan thoughts?

pkg/adaptation/adaptation.go Show resolved Hide resolved
@champtar
Copy link
Contributor

Can we merge this as is ? It does fix a blocking bug

@champtar
Copy link
Contributor

ping @mikebrow @dmcgowan this PR is almost 1 year old

@mikebrow
Copy link
Member

@klihub let's chat with @Iceber if still interested .. maybe at the next or subsequent containerd call

@Iceber
Copy link
Member Author

Iceber commented Jun 4, 2024

Hi @mikebrow @klihub Do you have any new ideas for this pr?

I think synchronizing all and reporting back all errors is a good way which gives a more complete picture of the errors that may occur

Of course if still think that skipping and printing warnings is a good way to go, I'll adopt it

Also for more complex configurations my thoughts are:

  1. pre-installed binaries are required by default, I don't think it's necessary for the user to place a bunch of plugin binaries in the directory that don't care if they are successfully launched or not
  2. you can enable/disable some pre-installed plugins by configuring the plugin name, so that only some plugins can be started, or some plugins won't be started at all
    I think this feature can be placed in another pr

ci fails from errors.Join, It comes from 1.20, we might consider upgrading go.mod to 1.20

@klihub
Copy link
Member

klihub commented Jun 4, 2024

ci fails from errors.Join, It comes from 1.20, we might consider upgrading go.mod to 1.20

I filed PR #88 to bump the minimum golang requirement to 1.20. It should be fine for all of our downstream packages. You can then rebase this PR if #88 gets merged.

@klihub
Copy link
Member

klihub commented Jun 4, 2024

Hi @mikebrow @klihub Do you have any new ideas for this pr?

I think synchronizing all and reporting back all errors is a good way which gives a more complete picture of the errors that may occur

I think synchronizing pre-launched plugins (pre-installed plugins in your terminology) would be essential to make sure that we have an identical handshake/initial message flow, regardless of whether a plugin is pre-launched or not. I made a mistake and that is why I completely overlooked that pre-launched plugins do not get properly synchronized during registration. It was not intentional. It would be better to send an empty sync message than no message at all. There are packages out there which do expect the synchronization message to come before they consider themselves ready to start processing other NRI messages.

That said, I think synchronizing pre-launched plugins is only/really relevant for the case when an already active runtime (with existing containers) gets restarted (after all, on initial startup the set of pods and containers to sync with the plugins is empty). So a related question is whether the runtime itself is able to properly determine/rediscover the state of pods and containers by the time pre-launched plugins are being started and what to do if it is not. And this needs to be checked/tested both with containerd and cri-o because they might behave differently. I have a vague recollection that containerd itself would behave slightly differently for 2.0 than 1.7, but I haven't had time right no to check/test this in practice.

@Iceber
Copy link
Member Author

Iceber commented Jun 6, 2024

That said, I think synchronizing pre-launched plugins is only/really relevant for the case when an already active runtime (with existing containers) gets restarted (after all, on initial startup the set of pods and containers to sync with the plugins is empty). So a related question is whether the runtime itself is able to properly determine/rediscover the state of pods and containers by the time pre-launched plugins are being started and what to do if it is not. And this needs to be checked/tested both with containerd and cri-o because they might behave differently. I have a vague recollection that containerd itself would behave slightly differently for 2.0 than 1.7, but I haven't had time right no to check/test this in practice.

yeah, this is very important, and can be very bad if the synchronization is incomplete with pods and containers.

In containerd 2.0, the recover function initializes the data before the nri is initialized.
https://github.com/containerd/containerd/blob/45bc430dd12539ee93f908750ce24e52e1122e9a/internal/cri/server/service.go#L257-L300

@mikebrow mikebrow added this to the 1.0 milestone Aug 1, 2024
@klihub klihub requested a review from mikebrow August 19, 2024 07:14
Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change just some comments to increase logging to help the plugin admin, and also continue vs return early for failure to new/start a plugin.

** In 1.1 let's consider a more formal process with a manager plugin or 3) introduce config/management for requiring/ordering (add requires logic), auto restarting etc.. for plugin start && synch

pkg/adaptation/adaptation.go Outdated Show resolved Hide resolved
pkg/adaptation/adaptation.go Outdated Show resolved Hide resolved
pkg/adaptation/adaptation.go Outdated Show resolved Hide resolved
@mikebrow
Copy link
Member

@Iceber wdyt about the logging changes..

@Iceber
Copy link
Member Author

Iceber commented Sep 19, 2024

@Iceber wdyt about the logging changes..

Sorry I missed the email on this pr, I'll take care of it soon

@Iceber
Copy link
Member Author

Iceber commented Sep 19, 2024

@mikebrow I've added a new commit for better viewing of the changes

I've returned plugins that failed to start to syncFn via syncPlugins. after all, those plugins that didn't start are also sync failures/unsyncable plugins

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks!

@mikebrow
Copy link
Member

@klihub did you want to take one more look with the additional changes?

@klihub
Copy link
Member

klihub commented Sep 20, 2024

Yes, I'll do that.

@klihub
Copy link
Member

klihub commented Sep 20, 2024

Yes, I'll do that.

It'll take me a bit more time though. I'm not at the keyboard ATM.

Copy link
Member

@klihub klihub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty good. I spotted a few wrongly formatted log messages (error wrapping vs. default value format %verb), and one oddly returned (value, error) combo.

Then I have a question to @mikebrow about what we should do if a pre-installed plugin failed to start up or get synchronized. The original code errors out and this PR keeps that behavior intact. But now I started to have second thoughts about that. Do we want to prevent the runtime from starting up in such a case, or just log errors, ignore failing plugins and keep going ?

If we get the few log formatting problems fixed, maybe also change that odd return value combo to the standard convention of 'return a nil-value with a non-nil error', then I think this is good to go in.

But if @mikebrow is of the opinion that we should err on the side of caution and ignore failing plugins instead of preventing the runtime from starting up, then we should also change that behavior.

p, err := r.newLaunchedPlugin(r.pluginPath, id, name, configs[i])
if err != nil {
return fmt.Errorf("failed to start NRI plugin %q: %w", name, err)
errs = append(errs, fmt.Errorf("[%s] %w", name, err))
log.Warnf(noCtx, "failed to initialize pre-installed NRI plugin %q: %w", name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With log.Warnf() we should use %v (value in default format) instead of %w (wrap error, but only understood by fmt.Errorf()).

}

if err := p.start(r.name, r.version); err != nil {
return err
errs = append(errs, fmt.Errorf("[%s] %w", name, err))
log.Warnf(noCtx, "failed to start pre-installed NRI plugin %q: %w", name, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for this one.

if err != nil {
plugin.stop()
errs = append(errs, fmt.Errorf("[%s] %w", plugin.name(), err))
log.Warnf(noCtx, "failed to synchronize pre-installed NRI plugin %q: %w", plugin.name(), err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, log.*f() should use %v, not %w.

updates = append(updates, us...)
log.Infof(noCtx, "pre-installed NRI plugin %q synchronization success", plugin.name())
}
return updates, errors.Join(errs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner/clearer if we'd always returned either updates, nil, or nil, errors.Join(non_nil_errors). IOW, if errors.Join() returns non-nil, I think we should return nil for updates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand @mikebrow correctly, syncPlugins returns successful updates and failed plugins&errors, and it's up to syncFn to decide whether or not to ignore a particular wrong plugin

return updates, errors.Join(errs...)
}
if err := r.syncFn(noCtx, syncPlugins); err != nil {
return fmt.Errorf("failed to synchronize pre-installed NRI Plugins: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another thing I now start to have 2nd thoughts about. Do we really want to prevent the runtime from starting up, if any pre-installed plugin fails to start or synchronize ? Or would it be better to just log the errors, ignore failed plugins and continue... Any thoughts on that @mikebrow ?

@klihub
Copy link
Member

klihub commented Sep 21, 2024

ping @Iceber Apart from a few small nits it looks very good.

@mikebrow
Copy link
Member

It looks pretty good. I spotted a few wrongly formatted log messages (error wrapping vs. default value format %verb), and one oddly returned (value, error) combo.

Then I have a question to @mikebrow about what we should do if a pre-installed plugin failed to start up or get synchronized. The original code errors out and this PR keeps that behavior intact. But now I started to have second thoughts about that. Do we want to prevent the runtime from starting up in such a case, or just log errors, ignore failing plugins and keep going ?

If we get the few log formatting problems fixed, maybe also change that odd return value combo to the standard convention of 'return a nil-value with a non-nil error', then I think this is good to go in.

But if @mikebrow is of the opinion that we should err on the side of caution and ignore failing plugins instead of preventing the runtime from starting up, then we should also change that behavior.

In r.next 1.1 I'd like us to consider adding plugin config to describe ordering(after:plugin)/dependencies(requires:plugin)/runtime required(MUST) type information
that we would use to order the plugin bring up and whether or not a plugin is required to run a container/pod which would give us a good reason, for example, to not report CRI plugin success and thus block pods on the node. Until then I'd prefer to log errors and continue for failing plugins vs bring the runtime down when there is an unknown / unhandled issue with a plugin.

We simply can't know if the plugin is required always or if it is only required for certain nodes or if the error is related to some unknown resource device dependency that is only met on some of the worker nodes. This gives the admin the ability to test out deployment of new plugins on a running machine, remove the failing ones report logs etc.

Would be easy enough to add the MUST run flag now if you like. Dependency ordering stuff would be harder.

@klihub
Copy link
Member

klihub commented Sep 23, 2024

In r.next 1.1 I'd like us to consider adding plugin config to describe ordering(after:plugin)/dependencies(requires:plugin)/runtime required(MUST) type information that we would use to order the plugin bring up and whether or not a plugin is required to run a container/pod which would give us a good reason, for example, to not report CRI plugin success and thus block pods on the node. Until then I'd prefer to log errors and continue for failing plugins vs bring the runtime down when there is an unknown / unhandled issue with a plugin.

We simply can't know if the plugin is required always or if it is only required for certain nodes or if the error is related to some unknown resource device dependency that is only met on some of the worker nodes. This gives the admin the ability to test out deployment of new plugins on a running machine, remove the failing ones report logs etc.

Sounds like a good plan to me.

@Iceber Could you update the PR to still log errors from plugin startup or synchronization failures, but return no error so we don't prevent the runtime from starting up.

@Iceber
Copy link
Member Author

Iceber commented Sep 24, 2024

@mikebrow @klihub updated to print logs only, PTAL, Thanks

@klihub klihub self-requested a review September 24, 2024 07:01
Copy link
Member

@klihub klihub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @Iceber !

@mikebrow mikebrow merged commit dda9682 into containerd:main Sep 24, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants