Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve container log collection #3007

Open
HyeockJinKim opened this issue Oct 31, 2024 · 1 comment
Open

Improve container log collection #3007

HyeockJinKim opened this issue Oct 31, 2024 · 1 comment
Milestone

Comments

@HyeockJinKim
Copy link

References

Motivation

Currently, there is a known issue with log collection where excessive log size leads to connection errors when querying session logs in the web UI.
This issue requires urgent attention, especially as the volume of logs can increase rapidly when using model services.
While proactive measures such as implementing log rotation to remove old log files or applying pagination to limit the amount of data requested at one time can be helpful,
we aim to fundamentally improve the system by collecting logging information in a location separate from the agent.
This separation allows for enhanced scalability, as it reduces the load on the agent and prevents potential bottlenecks.
Moreover, it enables better log management and analysis, as external logging servers can provide more robust features for storage, retrieval, and querying of logs, facilitating quicker access to critical information during troubleshooting and analysis.

Main Tasks

I want to adopt a method where collected logs are not stored in the agent but instead forwarded to an external logging server. Ideally, I would like to leverage existing open-source projects rather than implementing this from scratch.

I suggest that we focus on the following tasks:

  1. Implement log rotation to prevent excessive accumulation of logs.
  2. Analyze and compare open-source tools for log collection (such as Fluent Bit, Logstash, Vector, etc.) to determine which tool is most suitable for backend.ai.
    • Ensure that the log collection tool allows for easy configuration changes to the storage location of the logs.
  3. Identify potential changes to the installation guidelines and, if possible, set up the ability to apply new features in a configurable manner.
  4. (Optional) Consider implementing visualization tools for statistical metrics or problem analysis in the future.

Expected Results

  • Effectively manage the size of log files to prevent excessive delays in requests.
  • Achieve the ability to configure the desired output location for collected logs without extensive implementation efforts.
  • (Optional) Establish a foundation for gaining data insights and quickly identifying issues through visualization tools for statistical metrics and problem analysis.
@HyeockJinKim HyeockJinKim added this to the 24.12 milestone Oct 31, 2024
@achimnol achimnol changed the title Log Collection Improvement Tasks Log collection Oct 31, 2024
@achimnol achimnol changed the title Log collection Improve container log collection Oct 31, 2024
@achimnol
Copy link
Member

achimnol commented Oct 31, 2024

Technical considerations:

  • Where to store the actual log streams?
    • We could consider a special system folder in a storage volume, but the resource group and storage volume mappings may not be always available.
      • We need to keep the location information in the kernels table in this case to retrieve the persistent log data.
    • The prior .logs vfolder approach is also error-prone to inaccessible storage volumes from specific resource groups, and .logs vfolder should be manually created by the user.
    • Refs to the old implementation:
  • HA Setup
    • If we adopt an open-source log store and browser, it should conform with the HA setup scenario of the Backend.AI Manager service, with an explicit guide on how to add it to an existing installation.
  • Collecting mechanism
    • We could use an existing Docker logging driver. But we also need to consider how they would continue to work when we migrate to containerd or CRI.

Ideas:

  • Keep the startup logs (e.g., the first one-hour) of long-running containers even when the log stream is truncated due to the configured size/retention limits. This will ease troubleshooting/debugging of inference runtimes by checking how they are configured and started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants