Improve container log collection #3007

HyeockJinKim · 2024-10-31T04:36:28Z

References

Motivation

Currently, there is a known issue with log collection where excessive log size leads to connection errors when querying session logs in the web UI.
This issue requires urgent attention, especially as the volume of logs can increase rapidly when using model services.
While proactive measures such as implementing log rotation to remove old log files or applying pagination to limit the amount of data requested at one time can be helpful,
we aim to fundamentally improve the system by collecting logging information in a location separate from the agent.
This separation allows for enhanced scalability, as it reduces the load on the agent and prevents potential bottlenecks.
Moreover, it enables better log management and analysis, as external logging servers can provide more robust features for storage, retrieval, and querying of logs, facilitating quicker access to critical information during troubleshooting and analysis.

Main Tasks

I want to adopt a method where collected logs are not stored in the agent but instead forwarded to an external logging server. Ideally, I would like to leverage existing open-source projects rather than implementing this from scratch.

I suggest that we focus on the following tasks:

Implement log rotation to prevent excessive accumulation of logs.
Analyze and compare open-source tools for log collection (such as Fluent Bit, Logstash, Vector, etc.) to determine which tool is most suitable for backend.ai.
- Ensure that the log collection tool allows for easy configuration changes to the storage location of the logs.
Identify potential changes to the installation guidelines and, if possible, set up the ability to apply new features in a configurable manner.
(Optional) Consider implementing visualization tools for statistical metrics or problem analysis in the future.

Expected Results

Effectively manage the size of log files to prevent excessive delays in requests.
Achieve the ability to configure the desired output location for collected logs without extensive implementation efforts.
(Optional) Establish a foundation for gaining data insights and quickly identifying issues through visualization tools for statistical metrics and problem analysis.

achimnol · 2024-10-31T05:22:53Z

Technical considerations:

Where to store the actual log streams?
- We could consider a special system folder in a storage volume, but the resource group and storage volume mappings may not be always available.
  - We need to keep the location information in the kernels table in this case to retrieve the persistent log data.
- The prior .logs vfolder approach is also error-prone to inaccessible storage volumes from specific resource groups, and .logs vfolder should be manually created by the user.
- Refs to the old implementation:
  - Revive the .logs vfolder for persistent container log storage #2970
  - [127] Persistent container logs #115
  - lablup/backend.ai-agent#185
  - lablup/backend.ai-agent#179
HA Setup
- If we adopt an open-source log store and browser, it should conform with the HA setup scenario of the Backend.AI Manager service, with an explicit guide on how to add it to an existing installation.
Collecting mechanism
- We could use an existing Docker logging driver. But we also need to consider how they would continue to work when we migrate to containerd or CRI.

Ideas:

Keep the startup logs (e.g., the first one-hour) of long-running containers even when the log stream is truncated due to the configured size/retention limits. This will ease troubleshooting/debugging of inference runtimes by checking how they are configured and started.

HyeockJinKim added this to the 24.12 milestone Oct 31, 2024

HyeockJinKim mentioned this issue Oct 31, 2024

Explore feasibility of open-source log collection tools #3008

Open

achimnol changed the title ~~Log Collection Improvement Tasks~~ Log collection Oct 31, 2024

achimnol changed the title ~~Log collection~~ Improve container log collection Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve container log collection #3007

Improve container log collection #3007

HyeockJinKim commented Oct 31, 2024

achimnol commented Oct 31, 2024 •

edited

Loading

Improve container log collection #3007

Improve container log collection #3007

Comments

HyeockJinKim commented Oct 31, 2024

References

Motivation

Main Tasks

Expected Results

achimnol commented Oct 31, 2024 • edited Loading

Technical considerations:

Ideas:

achimnol commented Oct 31, 2024 •

edited

Loading