You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, there is a known issue with log collection where excessive log size leads to connection errors when querying session logs in the web UI.
This issue requires urgent attention, especially as the volume of logs can increase rapidly when using model services.
While proactive measures such as implementing log rotation to remove old log files or applying pagination to limit the amount of data requested at one time can be helpful,
we aim to fundamentally improve the system by collecting logging information in a location separate from the agent.
This separation allows for enhanced scalability, as it reduces the load on the agent and prevents potential bottlenecks.
Moreover, it enables better log management and analysis, as external logging servers can provide more robust features for storage, retrieval, and querying of logs, facilitating quicker access to critical information during troubleshooting and analysis.
Main Tasks
I want to adopt a method where collected logs are not stored in the agent but instead forwarded to an external logging server. Ideally, I would like to leverage existing open-source projects rather than implementing this from scratch.
I suggest that we focus on the following tasks:
Implement log rotation to prevent excessive accumulation of logs.
Analyze and compare open-source tools for log collection (such as Fluent Bit, Logstash, Vector, etc.) to determine which tool is most suitable for backend.ai.
Ensure that the log collection tool allows for easy configuration changes to the storage location of the logs.
Identify potential changes to the installation guidelines and, if possible, set up the ability to apply new features in a configurable manner.
(Optional) Consider implementing visualization tools for statistical metrics or problem analysis in the future.
Expected Results
Effectively manage the size of log files to prevent excessive delays in requests.
Achieve the ability to configure the desired output location for collected logs without extensive implementation efforts.
(Optional) Establish a foundation for gaining data insights and quickly identifying issues through visualization tools for statistical metrics and problem analysis.
The text was updated successfully, but these errors were encountered:
We could consider a special system folder in a storage volume, but the resource group and storage volume mappings may not be always available.
We need to keep the location information in the kernels table in this case to retrieve the persistent log data.
The prior .logs vfolder approach is also error-prone to inaccessible storage volumes from specific resource groups, and .logs vfolder should be manually created by the user.
If we adopt an open-source log store and browser, it should conform with the HA setup scenario of the Backend.AI Manager service, with an explicit guide on how to add it to an existing installation.
Collecting mechanism
We could use an existing Docker logging driver. But we also need to consider how they would continue to work when we migrate to containerd or CRI.
Ideas:
Keep the startup logs (e.g., the first one-hour) of long-running containers even when the log stream is truncated due to the configured size/retention limits. This will ease troubleshooting/debugging of inference runtimes by checking how they are configured and started.
References
Motivation
Currently, there is a known issue with log collection where excessive log size leads to connection errors when querying session logs in the web UI.
This issue requires urgent attention, especially as the volume of logs can increase rapidly when using model services.
While proactive measures such as implementing log rotation to remove old log files or applying pagination to limit the amount of data requested at one time can be helpful,
we aim to fundamentally improve the system by collecting logging information in a location separate from the agent.
This separation allows for enhanced scalability, as it reduces the load on the agent and prevents potential bottlenecks.
Moreover, it enables better log management and analysis, as external logging servers can provide more robust features for storage, retrieval, and querying of logs, facilitating quicker access to critical information during troubleshooting and analysis.
Main Tasks
I want to adopt a method where collected logs are not stored in the agent but instead forwarded to an external logging server. Ideally, I would like to leverage existing open-source projects rather than implementing this from scratch.
I suggest that we focus on the following tasks:
Expected Results
The text was updated successfully, but these errors were encountered: