Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status plugin delays start-up when falling back to persisted bundles #6343

Closed
charlieegan3 opened this issue Oct 25, 2023 · 6 comments · Fixed by #6345
Closed

Status plugin delays start-up when falling back to persisted bundles #6343

charlieegan3 opened this issue Oct 25, 2023 · 6 comments · Fixed by #6345
Labels

Comments

@charlieegan3
Copy link
Contributor

When persisting discovery bundles, and having configured a status service, when opa restarts during downtime in the status service, startup is delayed while attempting to run status plugin.

To reproduce the issue:

Create a simple go server:

package main

import (
	"fmt"
	"log"
	"net"
	"net/http"
	"os"

	"github.com/open-policy-agent/opa/bundle"
)

func main() {

	port := 8080

	if len(os.Args) > 1 {
		http.HandleFunc("/discovery", discoveryHandler)
		http.HandleFunc("/bundle", bundleHandler)

		fmt.Printf("Starting server on localhost:%d...\n", port)
		err := http.ListenAndServe(fmt.Sprintf("localhost:%d", port), nil)
		if err != nil {
			fmt.Printf("Error starting server: %s\n", err)
		}

	}

	// start a timeouting server
	listener, err := net.Listen("tcp", fmt.Sprintf("localhost:%d", port))
	if err != nil {
		fmt.Println("Error starting server:", err)
		return
	}
	defer listener.Close()

	// This server will run indefinitely without accepting incoming connections.
	select {}

}

func discoveryHandler(w http.ResponseWriter, r *http.Request) {

	b := bundle.Bundle{
		Data: map[string]interface{}{
			"discovery": map[string]interface{}{
				"services": []interface{}{
					map[string]interface{}{
						"name": "localhost",
						"url":  "http://localhost:8080",
					},
				},
				"status": map[string]interface{}{
					"service": "localhost",
				},
				"bundles": map[string]interface{}{
					"main": map[string]interface{}{
						"service":  "localhost",
						"resource": "/bundle",
						"persist":  true,
					},
				},
			},
		},
	}

	w.Header().Set("content-type", "application/vnd.openpolicyagent.bundles")
	err := bundle.NewWriter(w).Write(b)
	if err != nil {
		log.Printf("error creating bundle: %v\n", err)
		w.WriteHeader(http.StatusInternalServerError)
	}
}

func bundleHandler(w http.ResponseWriter, r *http.Request) {

	b := bundle.Bundle{
		Data: map[string]interface{}{"foo": "bar"},
	}

	w.Header().Set("content-type", "application/vnd.openpolicyagent.bundles")
	err := bundle.NewWriter(w).Write(b)
	if err != nil {
		log.Printf("error creating bundle: %v\n", err)
		w.WriteHeader(http.StatusInternalServerError)
	}
}

Run the server:

go run main.go server

Create some opa config:

services:
  - name: localhost
    url: http://localhost:8080
discovery:
  name: localhost
  resource: /discovery
  decision: discovery
  persist: true
  polling:
    min_delay_seconds: 10
    max_delay_seconds: 10

Run OPA, clearing the local state:

rm -rf .opa && go run main.go run -s --addr=127.0.0.1:8181 --config-file=config.yaml

OPA starts ok, and the bundles are persisted to disk.

Now stop the go server and OPA, and run the server in 'timeout mode' where a dial timeout is simulated:

go run main.go

Now run OPA again, using the local state:

go run main.go run -s --addr=127.0.0.1 --config-file=config.yaml

We see OPA waits trying to report the status before starting:

$ go run main.go run -s --addr=127.0.0.1:8181 --config-file=config.yaml
{"addrs":["127.0.0.1:8181"],"diagnostic-addrs":[],"level":"info","msg":"Initializing server.","time":"2023-10-25T12:08:59+01:00"}
{"level":"info","msg":"Starting bundle loader.","name":"main","plugin":"bundle","time":"2023-10-25T12:08:59+01:00"}
{"level":"info","msg":"Starting status reporter.","plugin":"status","time":"2023-10-25T12:08:59+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:09+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:09+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:19+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:19+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:29+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:39+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:39+01:00"}
{"level":"error","msg":"Discovery download failed: request failed: Get \"http://localhost:8080/discovery\": net/http: timeout awaiting response headers","plugin":"discovery","time":"2023-10-25T12:09:49+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:49+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:09:49+01:00"}
{"level":"error","msg":"Discovery download failed: request failed: Get \"http://localhost:8080/discovery\": net/http: timeout awaiting response headers","plugin":"discovery","time":"2023-10-25T12:09:59+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:09:59+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:10:09+01:00"}
{"level":"error","msg":"Bundle load failed: request failed: Get \"http://localhost:8080/bundle\": net/http: timeout awaiting response headers","name":"main","plugin":"bundle","time":"2023-10-25T12:10:09+01:00"}
{"level":"error","msg":"Status update failed: Post \"http://localhost:8080/status\": net/http: timeout awaiting response headers.","plugin":"status","time":"2023-10-25T12:10:19+01:00"}
{"level":"error","msg":"Discovery download failed: request failed: Get \"http://localhost:8080/discovery\": net/http: timeout awaiting response headers","plugin":"discovery","time":"2023-10-25T12:10:19+01:00"}
{"client_addr":"127.0.0.1:62971","level":"info","msg":"Received request.","req_id":1,"req_method":"GET","req_path":"/health","time":"2023-10-25T12:10:26+01:00"}
{"client_addr":"127.0.0.1:62971","level":"info","msg":"Sent response.","req_id":1,"req_method":"GET","req_path":"/health","resp_bytes":3,"resp_duration":0.788292,"resp_status":200,"time":"2023-10-25T12:10:26+01:00"}

Note the timestamps, and that all this time the agent isn't serving. The instance does start, but it is delayed.

This appears to be related to the loop activating plugins here: https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L240. It'd be good to avoid having the status plugin delaying boot when using persisted bundles.

@charlieegan3 charlieegan3 changed the title Status plugin delays start-up during discovery bundle server downtime Status plugin delays start-up when falling back to persisted bundles Oct 25, 2023
@bdytgj
Copy link

bdytgj commented Oct 25, 2023

Also notice that the for retry := 0; retry < maxActivationRetry; retry++ loop in https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L259 seems to lack a break statement at the end, so even when successfully activating the bundle in first attempt it will still run the loop 10 times (further slowing down the startup time).
I have confirmed this with logs (seeing 10 instances of the "Discovery bundle loaded from disk and activated successfully." debug log message before the server starts).

charlieegan3 added a commit to charlieegan3/opa that referenced this issue Oct 25, 2023
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Oct 25, 2023
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Oct 25, 2023
@ashutosh-narkar
Copy link
Member

Also notice that the for retry := 0; retry < maxActivationRetry; retry++ loop in https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L259 seems to lack a break statement at the end, so even when successfully activating the bundle in first attempt it will still run the loop 10 times (further slowing down the startup time). I have confirmed this with logs (seeing 10 instances of the "Discovery bundle loaded from disk and activated successfully." debug log message before the server starts).

Good catch. This is a bug.

@ashutosh-narkar
Copy link
Member

Note the timestamps, and that all this time the agent isn't serving. The instance does start, but it is delayed.

This appears to be related to the loop activating plugins here: https://github.com/open-policy-agent/opa/blob/main/plugins/discovery/discovery.go#L240. It'd be good to avoid having the status plugin delaying boot when using persisted bundles.

@charlieegan3 I haven't tried this but have we looked into using buffered channels?

@ashutosh-narkar
Copy link
Member

Alternatively since the manager has the plugin status on it, we could do the status upload async.

@charlieegan3
Copy link
Contributor Author

Hey, I've been having a play with using a buffered channel for the updates this morning. It works, but our tests are quite brittle around some of this stuff and so they need to be reworked. I'll see where I can get to this afternoon.

@charlieegan3
Copy link
Contributor Author

#6345 is all I had time for today. I've tried to make some careful updates to tests to maintain the behaviour, but remove the expectation that the status plugin will be in State:OK immediately. Take a look, I'm out tomorrow so feel free to take over and make edits as needed.

charlieegan3 added a commit to charlieegan3/opa that referenced this issue Oct 30, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Oct 30, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Oct 31, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Oct 31, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Nov 1, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Nov 2, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Nov 6, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
ashutosh-narkar pushed a commit to charlieegan3/opa that referenced this issue Nov 6, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit to charlieegan3/opa that referenced this issue Nov 6, 2023
This is all that's needed to fix open-policy-agent#6343

Signed-off-by: Charlie Egan <[email protected]>
charlieegan3 added a commit that referenced this issue Nov 6, 2023
Fixes #6343

A status endpoint that times out, will no longer delay boots.

Signed-off-by: Charlie Egan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants