diff --git a/site/content/en/docs/adopters/index.md b/site/content/en/docs/adopters/index.md index 95182ea2e8..9e79a379c4 100644 --- a/site/content/en/docs/adopters/index.md +++ b/site/content/en/docs/adopters/index.md @@ -16,16 +16,16 @@ If you are using Kueue, feel free to open a pull request to add your organizatio ## Adopters -| Organization | Type | Description | Integrations | Contact | -|:----------------------------------------------------:|:--------:|:----------------------:|:----------------------------------:|:----------------------------------------:| -| [CyberAgent, Inc.](https://www.cyberagent.co.jp/en/) | End User | On-premise ML Platform | batch/job
kubeflow.org/mpijob | [@tenzen-y](https://github.com/tenzen-y) | -| [DaoCloud, Inc.](https://www.daocloud.io/en/) | End User | Part of the AI Platform for managing all kinds of Jobs. | batch/job
RayJob
... | [@kerthcet](https://github.com/kerthcet) | -| [WattIQ, Inc.](https://wattiq.io) | End User | SaaS/IoT product | batch/job
RayJob
| [@madsenwattiq](https://github.com/madsenwattiq) | -| [Horizon, Inc.](https://horizon.cc/) | End User | AI training platform | batch/job
... | [@GhangZh](https://github.com/GhangZh) | -| [FAR AI](https://far.ai/) | End User | AI alignment research nonprofit | batch/job | [@rhaps0dy](https://github.com/rhaps0dy) | -| [Shopee, Inc.](https://shopee.com/) | End User | Training/batch inference/data processes in AI platform test env | Customized job
RayJob
... | [@denkensk](https://github.com/denkensk) | -| [Mondoo, Inc.](https://mondoo.com) | End User | Helps power Mondoo's hosted security scanner | batch/job | [@jaym](https://github.com/jaym) | -| [Google Cloud](https://cloud.google.com/) | Provider | Part of [kit for training ML workloads on TPUs][gcmldemo] | JobSet | [@mrozacki](https://github.com/mrozacki) | -| [Onna Technologies, Inc](https://onna.com) | End User | Unstructured Data Management Platform | batch/job
| [@gitcarbs](https://github.com/gitcarbs) | +| Organization | Type | Description | Integrations | Contact | +|:-------------------------------------------------------:|:--------:|:---------------------------------------------------------------:|:-------------------------------------:|:------------------------------------------------:| +| [CyberAgent, Inc.](https://www.cyberagent.co.jp/en/) | End User | On-premise ML Platform | batch/job
kubeflow.org/mpijob | [@tenzen-y](https://github.com/tenzen-y) | +| [DaoCloud, Inc.](https://www.daocloud.io/en/) | End User | Part of the AI Platform for managing all kinds of Jobs. | batch/job
RayJob
... | [@kerthcet](https://github.com/kerthcet) | +| [WattIQ, Inc.](https://wattiq.io) | End User | SaaS/IoT product | batch/job
RayJob
| [@madsenwattiq](https://github.com/madsenwattiq) | +| [Horizon, Inc.](https://horizon.cc/) | End User | AI training platform | batch/job
... | [@GhangZh](https://github.com/GhangZh) | +| [FAR AI](https://far.ai/) | End User | AI alignment research nonprofit | batch/job | [@rhaps0dy](https://github.com/rhaps0dy) | +| [Shopee, Inc.](https://shopee.com/) | End User | Training/batch inference/data processes in AI platform test env | Customized job
RayJob
... | [@denkensk](https://github.com/denkensk) | +| [Mondoo, Inc.](https://mondoo.com) | End User | Helps power Mondoo's hosted security scanner | batch/job | [@jaym](https://github.com/jaym) | +| [Google Cloud](https://cloud.google.com/) | Provider | Part of [kit for training ML workloads on TPUs][gcmldemo] | JobSet | [@mrozacki](https://github.com/mrozacki) | +| [Onna Technologies, Inc](https://onna.com) | End User | Unstructured Data Management Platform | batch/job
| [@gitcarbs](https://github.com/gitcarbs) | [gcmldemo]: https://cloud.google.com/blog/products/compute/the-worlds-largest-distributed-llm-training-job-on-tpu-v5e diff --git a/site/content/en/docs/installation/_index.md b/site/content/en/docs/installation/_index.md index e4600e880a..1dd553989d 100644 --- a/site/content/en/docs/installation/_index.md +++ b/site/content/en/docs/installation/_index.md @@ -243,22 +243,22 @@ spec: The currently supported features are: -| Feature | Default | Stage | Since | Until | -|---------|---------|-------|-------|-------| -| `FlavorFungibility` | `true` | Beta | 0.5 | | -| `MultiKueue` | `false` | Alpha | 0.6 | | -| `MultiKueueBatchJobWithManagedBy` | `false` | Alpha | 0.8 | | -| `PartialAdmission` | `false` | Alpha | 0.4 | 0.4 | -| `PartialAdmission` | `true` | Beta | 0.5 | | -| `ProvisioningACC` | `false` | Alpha | 0.5 | 0.6 | -| `ProvisioningACC` | `true` | Beta | 0.7 | | -| `QueueVisibility` | `false` | Alpha | 0.5 | | -| `VisibilityOnDemand` | `false` | Alpha | 0.6 | | -| `PrioritySortingWithinCohort` | `true` | Beta | 0.6 | | -| `LendingLimit` | `false` | Alpha | 0.6 | 0.8 | -| `LendingLimit` | `true` | Beta | 0.9 | | -| `MultiplePreemptions` | `false` | Alpha | 0.8 | 0.8 | -| `MultiplePreemptions` | `true` | Beta | 0.9 | | +| Feature | Default | Stage | Since | Until | +|-----------------------------------|---------|-------|-------|-------| +| `FlavorFungibility` | `true` | Beta | 0.5 | | +| `MultiKueue` | `false` | Alpha | 0.6 | | +| `MultiKueueBatchJobWithManagedBy` | `false` | Alpha | 0.8 | | +| `PartialAdmission` | `false` | Alpha | 0.4 | 0.4 | +| `PartialAdmission` | `true` | Beta | 0.5 | | +| `ProvisioningACC` | `false` | Alpha | 0.5 | 0.6 | +| `ProvisioningACC` | `true` | Beta | 0.7 | | +| `QueueVisibility` | `false` | Alpha | 0.5 | | +| `VisibilityOnDemand` | `false` | Alpha | 0.6 | | +| `PrioritySortingWithinCohort` | `true` | Beta | 0.6 | | +| `LendingLimit` | `false` | Alpha | 0.6 | 0.8 | +| `LendingLimit` | `true` | Beta | 0.9 | | +| `MultiplePreemptions` | `false` | Alpha | 0.8 | 0.8 | +| `MultiplePreemptions` | `true` | Beta | 0.9 | | ## What's next diff --git a/site/content/en/docs/reference/metrics.md b/site/content/en/docs/reference/metrics.md index a2408de0e2..58102f4615 100644 --- a/site/content/en/docs/reference/metrics.md +++ b/site/content/en/docs/reference/metrics.md @@ -13,34 +13,34 @@ of the system and the status of [ClusterQueues](/docs/concepts/cluster_queue). Use the following metrics to monitor the health of the kueue controllers: -| Metric name | Type | Description | Labels | -| ----------- | ---- | ----------- | ------ | -| `kueue_admission_attempts_total` | Counter | The total number of attempts to [admit](/docs/concepts#admission) workloads. Each admission attempt might try to admit more than one workload. | `result`: possible values are `success` or `inadmissible` | -| `kueue_admission_attempt_duration_seconds` | Histogram | The latency of an admission attempt. | `result`: possible values are `success` or `inadmissible` | +| Metric name | Type | Description | Labels | +|--------------------------------------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------| +| `kueue_admission_attempts_total` | Counter | The total number of attempts to [admit](/docs/concepts#admission) workloads. Each admission attempt might try to admit more than one workload. | `result`: possible values are `success` or `inadmissible` | +| `kueue_admission_attempt_duration_seconds` | Histogram | The latency of an admission attempt. | `result`: possible values are `success` or `inadmissible` | ## ClusterQueue status Use the following metrics to monitor the status of your ClusterQueues: -| Metric name | Type | Description | Labels | -| ----------- | ---- | ----------- | ------ | -| `kueue_pending_workloads` | Gauge | The number of pending workloads. | `cluster_queue`: the name of the ClusterQueue
`status`: possible values are `active` or `inadmissible` | -| `kueue_quota_reserved_workloads_total` | Counter | The total number of quota reserved workloads. | `cluster_queue`: the name of the ClusterQueue | -| `kueue_quota_reserved_wait_time_seconds` | Histogram | The time between a workload was created or requeued until it got quota reservation. | `cluster_queue`: the name of the ClusterQueue | -| `kueue_admitted_workloads_total` | Counter | The total number of admitted workloads. | `cluster_queue`: the name of the ClusterQueue | -| `kueue_evicted_workloads_total` | Counter | The total number of evicted workloads. | `cluster_queue`: the name of the ClusterQueue
`reason`: Possible values are `Preempted`, `PodsReadyTimeout`, `AdmissionCheck`, `ClusterQueueStopped` or `InactiveWorkload` | -| `kueue_admission_wait_time_seconds` | Histogram | The time between a workload was created or requeued until admission. | `cluster_queue`: the name of the ClusterQueue | -| `kueue_admission_checks_wait_time_seconds` | Histogram | The time from when a workload got the quota reservation until admission. | `cluster_queue`: the name of the ClusterQueue | -| `kueue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished) | `cluster_queue`: the name of the ClusterQueue | -| `kueue_cluster_queue_status` | Gauge | Reports the status of the ClusterQueue | `cluster_queue`: The name of the ClusterQueue
`status`: Possible values are `pending`, `active` or `terminated`. For a ClusterQueue, the metric only reports a value of 1 for one of the statuses. | +| Metric name | Type | Description | Labels | +|--------------------------------------------|-----------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `kueue_pending_workloads` | Gauge | The number of pending workloads. | `cluster_queue`: the name of the ClusterQueue
`status`: possible values are `active` or `inadmissible` | +| `kueue_quota_reserved_workloads_total` | Counter | The total number of quota reserved workloads. | `cluster_queue`: the name of the ClusterQueue | +| `kueue_quota_reserved_wait_time_seconds` | Histogram | The time between a workload was created or requeued until it got quota reservation. | `cluster_queue`: the name of the ClusterQueue | +| `kueue_admitted_workloads_total` | Counter | The total number of admitted workloads. | `cluster_queue`: the name of the ClusterQueue | +| `kueue_evicted_workloads_total` | Counter | The total number of evicted workloads. | `cluster_queue`: the name of the ClusterQueue
`reason`: Possible values are `Preempted`, `PodsReadyTimeout`, `AdmissionCheck`, `ClusterQueueStopped` or `InactiveWorkload` | +| `kueue_admission_wait_time_seconds` | Histogram | The time between a workload was created or requeued until admission. | `cluster_queue`: the name of the ClusterQueue | +| `kueue_admission_checks_wait_time_seconds` | Histogram | The time from when a workload got the quota reservation until admission. | `cluster_queue`: the name of the ClusterQueue | +| `kueue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished) | `cluster_queue`: the name of the ClusterQueue | +| `kueue_cluster_queue_status` | Gauge | Reports the status of the ClusterQueue | `cluster_queue`: The name of the ClusterQueue
`status`: Possible values are `pending`, `active` or `terminated`. For a ClusterQueue, the metric only reports a value of 1 for one of the statuses. | ### Optional metrics The following metrics are available only if `metrics.enableClusterQueueResources` is enabled in the [manager's configuration](/docs/installation/#install-a-custom-configured-released-version). -| Metric name | Type | Description | Labels | -| ----------- | ---- | ----------- | ------ | -| `kueue_cluster_queue_resource_usage` | Gauge | Reports the ClusterQueue's total resource usage |`cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name| -| `kueue_cluster_queue_nominal_quota` | Gauge | Reports the ClusterQueue's resource quota |`cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name| -| `kueue_cluster_queue_borrowing_limit` | Gauge | Reports the ClusterQueue's resource borrowing limit |`cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name| -| `kueue_cluster_queue_weighted_share` | Gauge | Reports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the cohort, among all the resources provided by the ClusterQueue. |`cluster_queue`: The name of the ClusterQueue| +| Metric name | Type | Description | Labels | +|---------------------------------------|--------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `kueue_cluster_queue_resource_usage` | Gauge | Reports the ClusterQueue's total resource usage | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | +| `kueue_cluster_queue_nominal_quota` | Gauge | Reports the ClusterQueue's resource quota | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | +| `kueue_cluster_queue_borrowing_limit` | Gauge | Reports the ClusterQueue's resource borrowing limit | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | +| `kueue_cluster_queue_weighted_share` | Gauge | Reports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the cohort, among all the resources provided by the ClusterQueue. | `cluster_queue`: The name of the ClusterQueue |