Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OTel]: enable persistence by default in our OTel distribution #5347

Closed
wants to merge 5 commits into from

Conversation

VihasMakwana
Copy link
Contributor

@VihasMakwana VihasMakwana commented Aug 23, 2024

  • Enhancement

What does this PR do?

  • By default, the filelog receiver stores offsets in memory rather than on disk. Since users might forget to enable on-disk storage themselves, this PR enables persistence by default in our sample configurations.

Why is it important?

  • If the collector is restarted, there could either be data duplication (if it starts re-reading files from the beginning) or data loss (if it starts reading files from the end and new data has been written to these files during the restart).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

  • Pull from this branch.
  • Run elastic-agent otel
  • Try ingesting some data
  • Shutdown the otel command
  • Rerun the elastic-agent otel
  • Write new data
  • You shouldn't see duplicated data, as offsets will be stored on disk.

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@VihasMakwana VihasMakwana added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Aug 23, 2024
@VihasMakwana VihasMakwana self-assigned this Aug 23, 2024
Copy link
Contributor

mergify bot commented Aug 23, 2024

This pull request does not have a backport label. Could you fix it @VihasMakwana? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@VihasMakwana VihasMakwana changed the title chore: enable persistence by default in our OTeL distrubtion [Otel]: enable persistence by default in our OTeL distrubtion Aug 23, 2024
@VihasMakwana VihasMakwana marked this pull request as ready for review August 23, 2024 10:01
@VihasMakwana VihasMakwana requested a review from a team as a code owner August 23, 2024 10:01
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

mergify bot commented Aug 23, 2024

This pull request does not have a backport label. Could you fix it @VihasMakwana? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@VihasMakwana
Copy link
Contributor Author

@elastic/elastic-agent-control-plane can someone take a look here?

@strawgate
Copy link
Contributor

Do we also need to update the sample yaml configurations that onboarding uses?

@strawgate strawgate changed the title [Otel]: enable persistence by default in our OTeL distrubtion [OTel]: enable persistence by default in our OTeL distribution Aug 29, 2024
@strawgate strawgate changed the title [OTel]: enable persistence by default in our OTeL distribution [OTel]: enable persistence by default in our OTel distribution Aug 29, 2024
@ycombinator
Copy link
Contributor

Do we also need to update the sample yaml configurations that onboarding uses?

Good catch. I think we should. They live here: https://github.com/elastic/elastic-agent/tree/main/internal/pkg/otel/samples

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Aug 29, 2024

@strawgate @ycombinator those configs already use the filestorage extension.
For eg.

receivers:
# Receiver for platform specific log files
filelog/platformlogs:
include: [ /var/log/*.log ]
retry_on_failure:
enabled: true
start_at: end
storage: file_storage
# start_at: beginning

@strawgate
Copy link
Contributor

strawgate commented Aug 29, 2024

It seems like STORAGE_DIR is probably too general to be using as an environment variable here?

If this PR changes the default then can these changes be removed from the onboarding yaml?

@VihasMakwana
Copy link
Contributor Author

@strawgate

If this PR changes the default then can these changes be removed from the onboarding yaml?

Not really,

If you take a look here

storageDir := os.Getenv("STORAGE_DIR")
if storageDir == "" {
// by default use "${path.data}/registry/otelcol" to store offsets
storageDir = filepath.Join(paths.Data(), "registry", "otelcol")
// set the STORAGE_DIR env. This will be used by otel collectore to get the registry directory
os.Setenv("STORAGE_DIR", storageDir)
}

I only override STORAGE_DIR if it's not set by the user (by default I use elastic-agent/data/registry/otelcol, but we can change the default).

If a user has set the STORAGE_DIR env, then we will not override the env.

@strawgate
Copy link
Contributor

strawgate commented Aug 29, 2024

For the first part of my question though -- It seems like STORAGE_DIR is probably too general to be using as an environment variable here?

@VihasMakwana
Copy link
Contributor Author

It seems like STORAGE_DIR is probably too general to be using as an environment variable here?

Sorry, I forgot to add it in above comment.

I agree with you. It's too general. I believe we should rename it in context of agent and otel.

@strawgate
Copy link
Contributor

I only override STORAGE_DIR if it's not set by the user (by default I use elastic-agent/data/registry/otelcol, but we can change the default).

So doesn't this mean that we no longer need to define this directory in the onboarding yaml? Do we want to create this folder if it doesn't exist?

@VihasMakwana
Copy link
Contributor Author

So doesn't this mean that we no longer need to define this directory in the onboarding yaml? Do we want to create this folder if it doesn't exist?

@strawgate In the onboarding yaml, we use the env variable STORAGE_DIR (quite a general name though). We don't specify any directory explicitly.
Example,

extensions:
file_storage:
directory: ${env:STORAGE_DIR}

Do we want to create this folder if it doesn't exist?

Yes. I've added that step in this PR.
The OTeL collector requires that the directory should exist or else it will not start up and will throw an error

@strawgate
Copy link
Contributor

strawgate commented Aug 29, 2024

@strawgate In the onboarding yaml, we use the env variable STORAGE_DIR (quite a general name though). We don't specify any directory explicitly.

This is no longer necessary though after this PR though because if it's not set, it defaults to env:storage_dir, so explicitly setting it to env:storage_dir is redundant and is not necessary anymore?

The onboarding flow uses this yaml and replaces the environment variables with actual values -- so it actually replaces env:STORAGE_DIR with ./data/otelcol which also shouldn't be needed anymore?

@VihasMakwana
Copy link
Contributor Author

This is no longer necessary though after this PR though because if it's not set, it defaults to env:storage_dir, so explicitly setting it to env:storage_dir is redundant and is not necessary anymore?

I believe it's necessary to set it in the onboarding YAML.

If you review the filestorage extension documentation, it specifies that the default directory used is /var/lib/otelcol/file_storage.

If we omit the env:storage_dir in onboarding.yaml, the OpenTelemetry Collector will default to storing offsets in /var/lib/otelcol/file_storage, as per the documentation. This may not be the intended behavior if we are expecting it to use env:storage_dir. Therefore, it's important to set it explicitly to avoid this issue.


Let me give an example to make this clear (hopefully ;)),

if a user has set storage_dir to /path/data/otelcol (because he wants to store offsets in this directory) and he runs the elastic-agent with the otel mode, then this PR will enusre that the /path/data/otelcol directory exists before booting up the otel collector.

However, if we remove the env:storage_dir setting from the onboarding YAML, the otel collector will revert to its default behavior and use the /var/lib/otelcol/file_storage directory for storage.

Therefore, omitting that setting means that the otel will not use the user-specified directory (/path/data/otelcol) and will instead use the default path, which may not align with the user's intended configuration.

Does that make sense?

@VihasMakwana
Copy link
Contributor Author

That being said,

If the OpenTelemetry Collector used env:storage_dir by default, your point would be valid.

@VihasMakwana
Copy link
Contributor Author

@strawgate your understanding is correct.
But we also need to explicitly set env:storage_dir here

extensions:
file_storage:
directory: ${env:STORAGE_DIR}

@strawgate
Copy link
Contributor

strawgate commented Aug 29, 2024

I see, so the remaining issue (from my perspective) is just us using the storage_dir environment variable -- though it probably makes more sense for the directory creation to be handled by the extension as in most actual cases this environment var will be unused anyway (as the onboarding flow removes it and a user is unlikely to populate this themselves).

@VihasMakwana
Copy link
Contributor Author

so the remaining issue (from my perspective) is just us using the storage_dir environment variable

yes.

though it probably makes more sense for the directory creation to be handled by the extension as in most cases this environment var will be unused anyway

I agree. But as of now, the extension doesn't handle that part.
We can discuss this with upstream by opening an issue.

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Aug 29, 2024

@strawgate regarding the name, I believe we should switch to OTEL_STORAGE_DIR or OTEL_REGISTRY_DIR for the env variable. Let me know if you have any suggestions.

@strawgate
Copy link
Contributor

strawgate commented Aug 29, 2024

extensions:
file_storage:
directory

otel_ext_file_storage_dir?

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we are ensuring the existence of a storage directory here but I am not sure that it's enough to enable persistence by default, without some default receiver/exporter/extensions/service configuration being injected.

Obviously the choice depends on what kind of compatibility we want to keep with upstream OTel collector we want to keep but in my view enabling persistence by default means that I can omit storage extensions settings in my config file and then still benefit from persistence by default but maybe I misunderstood the purpose of the change

@@ -80,3 +86,23 @@ func newSettings(version string, configPaths []string) (*otelcol.CollectorSettin
DisableGracefulShutdown: true,
}, nil
}

func ensureRegistryExists() error {
storageDir := os.Getenv("STORAGE_DIR")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reference or define a constant for the env variable name STORAGE_DIR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this should have a more specific name, we should also discuss if we should just reuse the existing STATE_PATH variable that controls the same thing.

Comment on lines +100 to +107
if _, err := os.Stat(storageDir); err == nil {
// directory exists
return nil
} else if os.IsNotExist(err) {
return os.MkdirAll(storageDir, 0755)
} else {
return fmt.Errorf("error stating %s: %w", storageDir, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe restructure this block a bit

Suggested change
if _, err := os.Stat(storageDir); err == nil {
// directory exists
return nil
} else if os.IsNotExist(err) {
return os.MkdirAll(storageDir, 0755)
} else {
return fmt.Errorf("error stating %s: %w", storageDir, err)
}
err := os.Stat(storageDir);
if errors.Is(err, fs.ErrNotExist) {
// create directory if it doesn't exist
return os.MkdirAll(storageDir, 0755)
}
if err != nil {
// we have a generic error
return fmt.Errorf("error stating %s: %w", storageDir, err)
}
return nil

if err := ensureRegistryExists(); err != nil {
return fmt.Errorf("error while creating registry: %w", err)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check and inject the storage extension settings here? It may be conditional to a cli flag or some env variable or something else but I doubt very man people are gonna run the elastic otel collector with our sample config file and without that this change ensures only that a given dir exists/is created...

@andrzej-stencel
Copy link
Contributor

This pull request is named "enable persistence by default", but it does not do what the description says.

I believe what this change is intending to achieve is "make sure the directory used by the File Storage extension exists", but I it does not actually do that either.

What this change actually does is:

  • look at an arbitrary env var STORAGE_DIR (regardless of whether the env var is used in the collector's config or not),
  • default this env var to a specific path if unset,
  • create a directory at that path if it does not exist.

I see the following problems with this approach:

  1. It creates a directory in an arbitrary location, regardless of whether this location is being used by the collector for persistence. This may be surprising and cause unwanted side effects for users that don't use the STORAGE_DIR env var in their File Storage extension configuration.
  2. The collector fails to start if that arbitrary directory does not exist and cannot be created. This may be surprising to users that don't use File Storage extension and are running e.g. in a container with read-only file system.

A much better way to achieve the intention "make sure the directory used by the File Storage extension exists" would be to add an option to the upstream File Storage extension like create_directory, which, when set to true, would create the directory that is actually used by the extension, for each extension that is defined (and create no directory if no extension is defined). I suggest to create an issue in Contrib about it.

@blakerouse
Copy link
Contributor

Another option is to also ensure that the STORAGE_DIR is created during installation of the Elastic Agent inside of the Elastic Agent directory structure. Use that path as the default and if the user changes the default then its up to the user to create that directory. This would ensure that the path would also have the correct permissions and ownership when being installed.

It would also be nice to add ability to adjust the default in upstream instead of it being hard-coded.

@VihasMakwana
Copy link
Contributor Author

Thank you for your feedback!

I agree that handling directory creation should ideally be managed upstream. I've submitted an enhancement request to address this.

In the meantime, my current PR aimed to ensure the storage directory exists by using the storage_dir environment variable, as it's featured in our sample configs. However, this approach may lead to issues if the user doesn't include this setting in their otel config.

I propose repurposing the PR to focus solely on enabling filestorage in our sample configs. I will then submit a separate PR to address the directory creation flag.

Copy link
Contributor

mergify bot commented Sep 3, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b enable-persistence-elastic upstream/enable-persistence-elastic
git merge upstream/main
git push upstream enable-persistence-elastic

if storageDir == "" {

// by default use "${path.data}/registry/otelcol" to store offsets
storageDir = filepath.Join(paths.Data(), "registry", "otelcol")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need some discussion on what our directory structure should be, considering https://github.com/elastic/ingest-dev/issues/3671 and #5304 for example.

There are also things besides filelog that may want disk persistent, who don't use a registry concept. If we needed more than this, where would they go in the directory structure? Here? What if we decided we needed more than one storage extension?

@cmacknz
Copy link
Member

cmacknz commented Sep 3, 2024

Taking a higher level look at this, I don't think we need to introduce the .go changes or the env var quite yet. These are not upstream functionality and if we expose them to users they will be hard to get rid of.

Fundamentally, lack of persistence for log collection is sub-optimal but not fatal. As long as we are only considering our pure OTel collector distribution (i.e. not Beats receivers yet) we should solve this with documentation and reference examples while making or advocating for upstream changes to make the configuration changes needed simpler.

+1 to the suggestion in #5347 (comment) to add a create_directory option to the storage extension, and as discussed elsewhere we should try to make it so that receivers that benefit from storage use it automatically if it is available.

For Beats receivers, we will have to store the Filebeat registry somewhere to maintain parity with what we do today, but we already have a place to put that we could reuse without adding anything new. We may opt to change where this state is stored, but until we have to do that let's avoid introducing any new configuration or env vars unless there is a really strong reason. Adding configuration is easy, it is much harder to take away later so let's not do it until we have to and I don't think we have to yet.

Copy link
Contributor

mergify bot commented Sep 10, 2024

backport-v8.x has been added to help with the transition to the new branch 8.x.

Copy link
Contributor

mergify bot commented Sep 11, 2024

backport-v8.x has been added to help with the transition to the new branch 8.x.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 11, 2024
@VihasMakwana
Copy link
Contributor Author

Closing this PR.
Once open-telemetry/opentelemetry-collector-contrib#34985 gets merged, I'll update the docs and raise a new PR as per #5347 (comment)

@ycombinator ycombinator deleted the enable-persistence-elastic branch September 11, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-skip enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants