Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement metrics API #1827

Merged
merged 17 commits into from
Oct 15, 2024
Merged

Implement metrics API #1827

merged 17 commits into from
Oct 15, 2024

Conversation

r4victor
Copy link
Collaborator

@r4victor r4victor commented Oct 14, 2024

Closes #1809

This PR:

  • Implements runner metrics API. The runner collects cpu and memory metrics via cgroups fs and nvidia gpu metrics via nvidia-smi.
  • Adds server background tasks to collect and delete job metrics. By default, metrics are collected every 10 seconds and live for 1 hour.
  • Implements server metrics API.
  • Implements dstack stats command to view run metrics.

Backward compatibility:

  • The server handles old runners that don't provide metrics API and emits warnings that metrics collection fails.

Tested:

  • Metrics of a run with cgroups v1 host.
  • Metrics of a run with cgroups v2 host.
  • Metrics of a run with no GPUs.
  • Metrics of a run with one GPU.
  • Metrics of a run with multiple GPUs.
  • Metrics of a run executed by the old runner.
  • Multi-replica server metrics collection.

TODOs:

  • Support AMD GPU metrics.

dstack stats example:

✗ dstack stats hot-frog-1 -w
 NAME        CPU  MEMORY           GPU                        
 hot-frog-1  2%   15307MB/49152MB  #0 22764MB/24576MB 0% Util 

Metrics API example: GET /api/project/{project_name}/metrics/job/{run_name}

{
	"metrics": [
		{
			"name": "cpu_usage_percent",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				147
			]
		},
		{
			"name": "memory_usage_bytes",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				652775424
			]
		},
		{
			"name": "memory_working_set_bytes",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				167317504
			]
		},
		{
			"name": "gpus_detected_num",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				2
			]
		},
		{
			"name": "gpu_memory_usage_bytes_gpu0",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				2097152
			]
		},
		{
			"name": "gpu_util_percent_gpu0",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				0
			]
		},
		{
			"name": "gpu_memory_usage_bytes_gpu1",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				2097152
			]
		},
		{
			"name": "gpu_util_percent_gpu1",
			"timestamps": [
				"2024-10-14T08:27:59.721935+00:00"
			],
			"values": [
				0
			]
		}
	]
}

@r4victor r4victor marked this pull request as ready for review October 14, 2024 09:32
@r4victor r4victor requested a review from un-def October 14, 2024 09:32
@r4victor r4victor merged commit c4ddf7c into master Oct 15, 2024
23 checks passed
@r4victor r4victor deleted the issue_1809_metrics_api branch October 15, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add job metrics API
2 participants