The error type is missing when stub.Run fails #83

kangclzjc · 2024-05-13T07:56:24Z

I want to implement NRI reconnect logic in onClose() which is when the NRI plugin loses connection to the NRI server, the NRI plugin attempts to reconnect. However, during the reconnection attempt, we need to detect the possible error types of the connection and handle them differently based on the error type. However, the current error return of the NRI plugin's Start function does not distinguish between error categories. For example, whether the connection was lost due to containerd restarting or NRI being disabled on the containerd side. If NRI is disabled on the containerd side, the NRI plugin does not need to reconnect anymore. If containerd simply restarts, the NRI plugin needs to keep trying to reconnect until the connection is successful. The corresponding code in the nri repository is https://github.com/containerd/nri/blob/main/pkg/stub/stub.go#L315. I paste below.

In the Start function, we can return errors because of several reasons:

stub already started
stub.connect() return error
failed to listen multiplex.PluginServiceConn
failed to multiplex ttrpc client connection
failed to register plugin
...

Can we give more info for us to detect error types not an error string?

// Start event processing, register to NRI and wait for getting configured.
func (stub *stub) Start(ctx context.Context) (retErr error) {
	stub.Lock()
	defer stub.Unlock()

	if stub.started {
		return fmt.Errorf("stub already started")
	}
	stub.started = true

	err := stub.connect()
	if err != nil {
		return err
	}

	rpcm := multiplex.Multiplex(stub.conn)
	defer func() {
		if retErr != nil {
			rpcm.Close()
			stub.rpcm = nil
		}
	}()

	rpcl, err := rpcm.Listen(multiplex.PluginServiceConn)
	if err != nil {
		return err
	}
	defer func() {
		if retErr != nil {
			rpcl.Close()
			stub.rpcl = nil
		}
	}()

	rpcs, err := ttrpc.NewServer(stub.serverOpts...)
	if err != nil {
		return fmt.Errorf("failed to create ttrpc server: %w", err)
	}
	defer func() {
		if retErr != nil {
			rpcs.Close()
			stub.rpcs = nil
		}
	}()

	api.RegisterPluginService(rpcs, stub)

	conn, err := rpcm.Open(multiplex.RuntimeServiceConn)
	if err != nil {
		return fmt.Errorf("failed to multiplex ttrpc client connection: %w", err)
	}

	clientOpts := []ttrpc.ClientOpts{
		ttrpc.WithOnClose(func() {
			stub.connClosed()
		}),
	}
	rpcc := ttrpc.NewClient(conn, append(clientOpts, stub.clientOpts...)...)
	defer func() {
		if retErr != nil {
			rpcc.Close()
			stub.rpcc = nil
		}
	}()

	stub.srvErrC = make(chan error, 1)
	stub.cfgErrC = make(chan error, 1)
	go func() {
		stub.srvErrC <- rpcs.Serve(ctx, rpcl)
		close(stub.doneC)
	}()

	stub.rpcm = rpcm
	stub.rpcl = rpcl
	stub.rpcs = rpcs
	stub.rpcc = rpcc

	stub.runtime = api.NewRuntimeClient(rpcc)

	if err = stub.register(ctx); err != nil {
		stub.close()
		return err
	}

	if err = <-stub.cfgErrC; err != nil {
		return err
	}
}

The text was updated successfully, but these errors were encountered:

klihub · 2024-05-13T09:03:06Z

If it is enough to tell apart whether NRI is disabled in the runtime (an ENOENT or ECONNREFUSED, I guess) or something else, then that might be already possible using stdlib errors.Is, errors.As etc. on the returned error. You could try something similar to this simple hack I briefly tested:

	err = p.stub.Run(context.Background())
	if err != nil {
		log.Errorf("plugin exited with error %v", err)
		if errors.Is(err, syscall.ECONNREFUSED) || errors.Is(err, syscall.ENOENT) {
			fmt.Printf("*** NRI disabled in runtime ?\n")
		}
		os.Exit(1)
	}

If that is not enough, then maybe what we should do is:

define and export errors for classes of failures
use those errors in returned errors, so you could test the error class with errors.As, errors.Is and friends.

So maybe something like this below, but probably with a more carefully thought through and more minimal set of errors...

...
var (
    ErrAlreadyStarted = fmt.Errorf("already started")
    ErrConn           = fmt.Errorf("failed to connect")
    ErrMuxListen      = fmt.Errorf("failed to listen on multiplexed connection")
    ErrMuxOpen        = fmt.Errorf("failed to open multiplexed connection")
    ErrTtrpcServer    = fmt.Errorf("failed to create ttrpc server")
    ErrRegister       = fmt.Errorf("failed to register to NRI server")
    ErrPuginConf      = fmt.Errorf("failed to configure plugin")
    ...
)

...
if stub.started {
    return ErrAlreadyStarted
}
...
if err := stub.connect(); err != nil {
    return fmt.Errorf("%w: %w", ErrConn, err)
}
...
rpcl, err := rpcm.Listen(multiplex.PluginServiceConn)
if err != nil {
    return fmt.Errorf("%w: %w", ErrMuxListen, err)
}
...
rpcs, err := ttrpc.NewServer(stub.serverOpts...)
if err != nil {
    return fmt.Errorf("%w: %w", ErrTtrpcServer, err)
}
...

kangclzjc · 2024-05-15T07:38:48Z

Thansk @klihub. I believe define and export errors for classes of failures may be useful for users. But I still have one question, it's difficult to distinguish between the two situations:

in normal containerd restart period while NRI is enable
disable NRI, and then containerd restart

In these two situations, stub.connect() will return the same error. How do I tell apart these two situations. And do you think is it necessary to distinguish? If yes, can you give me an example and probably I can add this and reconnection code into the nri sample plugin.

klihub · 2024-05-15T08:05:49Z

I think the core of this question is

"The CRI runtime (containerd in this case) is not running right now, how do I know if this is just a transient state and it will be running again soon?"

And I think the answer is that without additional externally provided information there is no way of knowing this. And in this case your only option is to keep trying to reconnect.

Now (assuming that you are in full control of your cluster environment and plugin deployment), there are at least two possible extra sources of information I can think of:

If the runtime is (known and guaranteed to be) a systemd service on the host (with a known service name), you can query systemd and check the status of the runtime service. You can start reconnection attempts once the service is active and running, and give up after a while if you still can't reconnect. This requires access from your plugin to the host dbus socket for systemd access.
Another variant of this is using the CRI socket to check if the runtime is up. This requires access from your plugin to the host runtime socket (which is different for containerd and CRI-O).

That said, I think it is generally not very common that the runtime is reconfigured/restarted on a cluster worker node without draining and then restarting the whole node. However, if you have some tooling of your own which nevertheless does this then another and perhaps more proper solution would be to have that tooling provide the necessary extra information for your plugin to know whether to attempt reconnection or not. For instance, it could label the node as 'runtime being restarted' before it does it's deed then remove that label once the runtime has been fully reconfigured/updated/etc. and restarted. Or to label the node as NRI disabled, if that tooling decides to disable NRI in the runtime. Then your plugin can watch whichever related labels your tooling provides and act accordingly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The error type is missing when stub.Run fails #83

The error type is missing when stub.Run fails #83

kangclzjc commented May 13, 2024

klihub commented May 13, 2024 •

edited

Loading

kangclzjc commented May 15, 2024 •

edited

Loading

klihub commented May 15, 2024 •

edited

Loading

The error type is missing when stub.Run fails #83

The error type is missing when stub.Run fails #83

Comments

kangclzjc commented May 13, 2024

klihub commented May 13, 2024 • edited Loading

kangclzjc commented May 15, 2024 • edited Loading

klihub commented May 15, 2024 • edited Loading

klihub commented May 13, 2024 •

edited

Loading

kangclzjc commented May 15, 2024 •

edited

Loading

klihub commented May 15, 2024 •

edited

Loading