Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add no-op support for collector lambda layer #1181

Open
jerrytfleung opened this issue Mar 6, 2024 · 7 comments
Open

Feature request: Add no-op support for collector lambda layer #1181

jerrytfleung opened this issue Mar 6, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@jerrytfleung
Copy link
Contributor

Is your feature request related to a problem? Please describe.
If Config.Validate() of a component returns false, the collector lambda layer cannot start in AWS lambda. As a result, the user lambda function is broken.

Describe the solution you'd like
Depending on the component, an invalid component configuration may not need to fail the whole collector lambda layer. We could let that component run in no-op.

Describe alternatives you've considered
Tried removing all config validation logic in the component and moved them to Start function. If config is invalid, just print a message instead. However, opentelemetry-collector-contrib code reviewer would like to check if there is other way to go.

Additional context
PR review comment
The component PR

@jerrytfleung jerrytfleung added the enhancement New feature or request label Mar 6, 2024
@jerrytfleung jerrytfleung changed the title Add no-op support for collector lambda layer Feature request: Add no-op support for collector lambda layer Mar 6, 2024
@serkan-ozal
Copy link
Contributor

I am not sure whether it is the correct approach to switch to noop mode when configuration in valid. Because it might be confusing for the users and as far as I know it doesn't align with the way of how OTEL configurations are handled.

Instead of noop, default values might be used for the invalid configs and fail fast if there is no default value for the invalid config.

WDYT @tylerbenson?

@cheempz
Copy link

cheempz commented Sep 3, 2024

I am not sure whether it is the correct approach to switch to noop mode when configuration in valid. Because it might be confusing for the users and as far as I know it doesn't align with the way of how OTEL configurations are handled.

Instead of noop, default values might be used for the invalid configs and fail fast if there is no default value for the invalid config.

WDYT @tylerbenson?

Adding some more context--it's reasonable for otelcol outside of Lambda to fail fast on invalid config, the only consequence is the collector doesn't run but it doesn't bring down the entire host. But in Lambda, the otelcol extension failing means the entire Lambda runtime crashes, kind of like crashing the entire VM because otelcol didn't start. To me this is pretty terrible user experience.

@tylerbenson
Copy link
Member

I can see both arguments here, though I'm leaning towards fail fast being the better option. Might be worth discussing in the SIG meeting.

Lambda versions are generally immutable, so it's nice to know immediately if you configured something wrong. If a deployment is urgent, the rollback can be as easy as removing the collector layer and redeploying.

@serkan-ozal
Copy link
Contributor

But in Lambda, the otelcol extension failing means the entire Lambda runtime crashes, kind of like crashing the entire VM because otelcol didn't start.

BTW, I am really not sure whether entire Lambda environment crashes if/when an extension fails gracefully (by calling /extension/init/error endpoint)

@serkan-ozal
Copy link
Contributor

And also AWS Lambda encourages being fail fast for extensions: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html#runtimes-extensions-init-error

@cheempz
Copy link

cheempz commented Sep 4, 2024

That's good to know re: /extension/init/error endpoint, it seems the otelcol extension is already using it. From a quick test of an otelcol extension with misconfigured pipeline, it doesn't result in a crash but in an Extension.InitError:

Test Event Name
(unsaved) test event

Response
{
  "errorType": "Extension.InitError",
  "errorMessage": "RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Error: exit code 0"
}

Function Logs
TELEMETRY	Name: collector	State: Subscribed	Types: [Platform]
{"level":"warn","ts":1725481081.1169627,"logger":"lifecycle.manager","msg":"Failed to start the extension","error":"invalid configuration: service::pipelines::logs: references receiver \"telemetryapi\" which is not configured"}
EXTENSION	Name: collector	State: InitError	Events: [INVOKE, SHUTDOWN]
INIT_REPORT Init Duration: 439.08 ms	Phase: init	Status: error	Error Type: Extension.InitError
TELEMETRY	Name: collector	State: Already subscribed	Types: [Platform]
{"level":"warn","ts":1725481086.9609792,"logger":"lifecycle.manager","msg":"Failed to start the extension","error":"unable to start, otelcol state is Closed"}
EXTENSION	Name: collector	State: InitError	Events: [INVOKE, SHUTDOWN]
INIT_REPORT Init Duration: 5971.95 ms	Phase: invoke	Status: error	Error Type: Extension.InitError
START RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Version: $LATEST
RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7 Error: exit code 0
Extension.InitError
END RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7
REPORT RequestId: 32e01a48-1b65-4559-9ba6-ec4620f689d7	Duration: 6031.77 ms	Billed Duration: 6032 ms	Memory Size: 128 MB	Max Memory Used: 75 MB

Request ID
32e01a48-1b65-4559-9ba6-ec4620f689d7

Still, the end result is the application is unavailable, and I do think it's pretty disruptive even given the recourse available. It goes against the expectation that observability tools strive to cause as little disruption to the application as possible.

@serkan-ozal
Copy link
Contributor

serkan-ozal commented Sep 4, 2024

I still prefer fail fast when something is not configured properly. Pre-prod environments are there to catch such cases before happening in productions. If it is silently ignored (even though there are error logs), I am pretty sure that most of the people and companies will not notice it until they find out that they have missing traces after some time.

I agree that both of the approaches have their own pros and cons, but IMO, being aware of the issues earlier is more important than suppressing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants