Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU affinity to executed processes #1253

Merged
merged 2 commits into from
Jun 25, 2024

Conversation

kolyshkin
Copy link
Contributor

This allows to set initial and final CPU affinity for a process being run in a container, which is needed to solve the issue described in opencontainers/runc#3922.

This is an alternative to #1252.

config.md Outdated Show resolved Hide resolved
@kolyshkin kolyshkin force-pushed the exec-aff branch 2 times, most recently from 8179d10 to cb918d2 Compare May 19, 2024 22:27
config.md Outdated Show resolved Hide resolved
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

utam0k

This comment was marked as duplicate.

@kolyshkin
Copy link
Contributor Author

@opencontainers/runtime-spec-maintainers PTAL (I'm going to merge this this week if there are no other comments)

Copy link

@bartwensley bartwensley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

config.md Outdated
on after the transition to container's cgroup. The format is either the
same as for `initial`, or a special value `cgroup` which means container's
CPU affinity, as defined by [cpu.cpus property](./config.md#configLinuxCPUs)).
If omitted or empty, the `initial` setting is kept.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be the opposite? I mean "if ommitted, cpu.cpus will be in effect"?
The initial seems to allow to set affinity to cpu cores that are outside of the set which is specified by cpu.cpus.
This also trigger the question of validation of the values between cpuAffinity and cpu.cpus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two questions here.

  1. What is the default if final is not set. You are probably right that cpu.cpus makes more sense. UPDATED.

  2. Whether to allow affinity that's outside of container's cpu.cpus. While it seems theoretically possible for the initial cpuset, I haven't checked that it works, and I don't want to add any specific constraints to the spec (or to runc/crun).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL @kad

@kolyshkin
Copy link
Contributor Author

@cclerget PTAL

Copy link
Member

@utam0k utam0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's interesting.

Copy link

@cclerget cclerget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kad
Copy link
Contributor

kad commented Jun 4, 2024

@kolyshkin please, give us some time, we would like to do few experiments with @klihub, as we also observed corner cases scenarios with DPDK-style apps in the past, but we need some time to replicate that environment and see behaviour in case of invalid arguments between requests via cgroups or setaffinity.

Regarding PR in general:

  1. naming of the fields: initial and final: would it make sense to have more consistent names with other parts of the spec that corresponds to same lifecycle stages of the container? e.g. normalize names with hook names: createRuntime/createContainer (or startContainer)? This also helps spec consumers to understand at which stage this CPU affinity is supposed to be applied in container lifecycle.
  2. initial seems to be mandatory argument right now, even were it is applied before entering container namespace. We seen in the past that bigger problem for execution inside container namespace, not before. I think it would make more sense if both initial/final will be optional. WDYT?
  3. Overall error handling with setaffinity(): it might be good to add clause in the description that incorrect setting of those parameters would lead to container creation failure? because if spec requests some value in those fields, but corresponding function call fails (invalid arguments/permissions/etc), we can't silently ignore such errors.

@kolyshkin
Copy link
Contributor Author

@kad thanks for reviewing this!

  1. naming of the fields: initial and final: would it make sense to have more consistent names with other parts of the spec that corresponds to same lifecycle stages of the container? e.g. normalize names with hook names: createRuntime/createContainer (or startContainer)? This also helps spec consumers to understand at which stage this CPU affinity is supposed to be applied in container lifecycle.

I think you are misunderstanding. These two are for exec, not create/start/run. The problem we're trying to solve with these is laid out at opencontainers/runc#3922.

  1. initial seems to be mandatory argument right now, even were it is applied before entering container namespace. We seen in the past that bigger problem for execution inside container namespace, not before. I think it would make more sense if both initial/final will be optional. WDYT?

See, they (initial and final) are both part of cpuAffinity object, which itself is optional. By saying that initial is mandatory we're just saying that if cpuAffinity is specified, it should not be empty.

  1. Overall error handling with setaffinity(): it might be good to add clause in the description that incorrect setting of those parameters would lead to container creation failure? because if spec requests some value in those fields, but corresponding function call fails (invalid arguments/permissions/etc), we can't silently ignore such errors.

Again, this is about exec not create/start/run. Unfortunately, this distinction (between starting container's PID 1 and executing an arbitrary process in a running container) is not laid out very well in the runtime-spec.

I think it should be up to implementation what to do it case of errors. IOW the spec should not be way too rigid. Maybe some implementation could have a flag saying "ignore this" or "ignore error from that" or "if this won't work, fall back to that" etc. If the spec is too rigid, such hacks render an implementation incompatible, which I guess is not good.

@kolyshkin
Copy link
Contributor Author

I guess I am merging this later today. @kad if you have something, please open a followup PR.

@kolyshkin
Copy link
Contributor Author

We also need to add a flag into runtime features.

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Jun 11, 2024

We also need to add a flag into runtime features.

I couldn't find a good place to add such a flag, so maybe the upper-level runtime (cri-o, containerd) can check the features.ociVersionMax field instead? WDYT @haircommander

@haircommander
Copy link

if the feature piece isn't sanctified in the runtime spec that's fine, as long as runtimes that support it agree on the name of the feature

@kad
Copy link
Contributor

kad commented Jun 11, 2024

sorry for delay, missed notification from previous reply :(

@kad thanks for reviewing this!

  1. naming of the fields: initial and final: would it make sense to have more consistent names with other parts of the spec that corresponds to same lifecycle stages of the container? e.g. normalize names with hook names: createRuntime/createContainer (or startContainer)? This also helps spec consumers to understand at which stage this CPU affinity is supposed to be applied in container lifecycle.

I think you are misunderstanding. These two are for exec, not create/start/run. The problem we're trying to solve with these is laid out at opencontainers/runc#3922.

well, then it is even worse with naming, I'd say. Description says "specifies CPU affinity used to execute the process." but doesn't mention that it is not applicable for normal container entry point process.

Furthermore, the "exec" is not really part of the runtime spec since 2016 or so. Yes, runtimes are implementing it, and it can be used, but it is not covered by spec officially. So, any fields in the spec is originally referring to standard operations create/start/stop/status/...

if we want this field only be respected on exec, we have two choices: a) explicitly call it with exec prefix in name or something similar, b) having field still named more generically as now, but inside initial/final rename to something more closely named about exec.

  1. initial seems to be mandatory argument right now, even were it is applied before entering container namespace. We seen in the past that bigger problem for execution inside container namespace, not before. I think it would make more sense if both initial/final will be optional. WDYT?

See, they (initial and final) are both part of cpuAffinity object, which itself is optional. By saying that initial is mandatory we're just saying that if cpuAffinity is specified, it should not be empty.

it is not really a problem to have empty struct, effect should be the same as it is not specified at all.

One thing which I'm asking about 'initial' be optional, is because sched_cpuaffinity is reset once process is attached to cgroup with cpuset. So, practically initial makes sense only if you want to execute something outside container, but once you're entering container namespace, the sched_cpuaffinity() needed to be called again.

So, my line of thinking is like this:

  • if cpuAffinity struct not set at all, don't call sched_setaffinity at all.
  • if set:
    • if initial is set -> set in createRuntime stage. fail on error.
    • if initial not set or empty -> don't call sched_setaffinity()
    • if final is set -> set in createContainer, fail on error. (this will validate disjoint sets with cpu.cpus).
    • if final not set or empty, default to cgroup cpu.cpus, but only if inital was set (probably not needed, as kernel will reset it anyway to match cgroup)

with both of those variants being optional, it would allow to do affinity only for "infrastructure" operations before entering into container, only after entering container, or both.

And if we do it linked with other container lifecycle points like createRuntime, createRuntime, etc... it will give more flexibility to set affinity for all scenarios.

  1. Overall error handling with setaffinity(): it might be good to add clause in the description that incorrect setting of those parameters would lead to container creation failure? because if spec requests some value in those fields, but corresponding function call fails (invalid arguments/permissions/etc), we can't silently ignore such errors.

Again, this is about exec not create/start/run. Unfortunately, this distinction (between starting container's PID 1 and executing an arbitrary process in a running container) is not laid out very well in the runtime-spec.

well, exec was part of the spec at some point, but then it was removed from it. So, as I wrote above, fields present in spec are assumed to be used at all container lifecycle points, unless really explicitly called out.

I think it should be up to implementation what to do it case of errors. IOW the spec should not be way too rigid. Maybe some implementation could have a flag saying "ignore this" or "ignore error from that" or "if this won't work, fall back to that" etc. If the spec is too rigid, such hacks render an implementation incompatible, which I guess is not good.

Regarding implementation: if sched_cpuaffinity() is called with disjoint set of cores compared to cpuset.cpus, it would return EINVAL. for the rest, it is AND masked with cpuset.cpus.... So, in worst case scenarios the requested affinity might not be in place at all (if error is ignored) or failure (if error not ignored), to "best" case where affinity would be set to subset from what was actually requested....

Also, s/in/on/g.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor Author

@kad thanks, changed like this:

diff --git a/config.md b/config.md
index eab6eec..b9b5573 100644
--- a/config.md
+++ b/config.md
@@ -340,9 +340,10 @@ For Linux-based systems, the `process` object supports the following process-spe
 
     * **`class`** (string, REQUIRED) specifies the I/O scheduling class. Possible values are `IOPRIO_CLASS_RT`, `IOPRIO_CLASS_BE`, and `IOPRIO_CLASS_IDLE`.
     * **`priority`** (int, REQUIRED) specifies the priority level within the class. The value should be an integer ranging from 0 (highest) to 7 (lowest).
-* **`cpuAffinity`** (object, OPTIONAL) specifies CPU affinity used to execute the process.
+* **`execCPUAffinity`** (object, OPTIONAL) specifies CPU affinity used to execute the process.
+    This setting is not applicable to the container's init process.
     The following properties are available:

-    * **`initial`** (string, REQUIRED) is a list of CPUs a runtime parent
+    * **`initial`** (string, OPTIONAL) is a list of CPUs a runtime parent
       process to be run on initially, before the transition to container's
       cgroup. This is a a comma-separated list, with dashes to represent
       ranges. For example, `0-3,7` represents CPUs 0,1,2,3, and 7.
@@ -427,7 +427,7 @@ _Note: symbolic name for uid and gid, such as uname and gname respectively, are
             "soft": 1024
         }
     ],
-    "cpuAffinity": {
+    "execCPUAffinity": {
         "initial": "7",
         "final": "0-3,7"
     }
diff --git a/schema/config-schema.json b/schema/config-schema.json
index 39cc71e..cb74342 100644
--- a/schema/config-schema.json
+++ b/schema/config-schema.json
@@ -221,11 +221,8 @@
                         }
                     }
                 },
-                "cpuAffinity": {
+                "execCPUAffinity": {
                     "type": "object",
-                    "required": [
-                        "initial"
-                    ],
                     "properties": {
                         "initial": {
                             "type": "string",
diff --git a/specs-go/config.go b/specs-go/config.go
index 00bf946..290b6dc 100644
--- a/specs-go/config.go
+++ b/specs-go/config.go
@@ -94,8 +94,8 @@ type Process struct {
        SelinuxLabel string `json:"selinuxLabel,omitempty" platform:"linux"`
        // IOPriority contains the I/O priority settings for the cgroup.
        IOPriority *LinuxIOPriority `json:"ioPriority,omitempty" platform:"linux"`
-       // CPUAffinity specifies CPU affinity for the process.
-       CPUAffinity *CPUAffinity `json:"cpuAffinity,omitempty" platform:"linux"`
+       // ExecCPUAffinity specifies CPU affinity for exec processes.
+       ExecCPUAffinity *CPUAffinity `json:"execCPUAffinity,omitempty" platform:"linux"`
 }
 
 // LinuxCapabilities specifies the list of allowed capabilities that are kept for a process.

Copy link
Contributor

@kad kad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for update. looks better now. I'm still not convinced about names initial and final, maybe something more descriptive can be done like execRuntime / execContainer for consistency with hooks?
But for that I'll let others to chime in.

specs-go/config.go Outdated Show resolved Hide resolved
* **`final`** (string, OPTIONAL) is a list of CPUs the process will be run
on after the transition to container's cgroup. The format is the same as
for `initial`. If omitted or empty, the container's default CPU affinity,
as defined by [cpu.cpus property](./config.md#configLinuxCPUs)), is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the tests, if you don't call sched_setaffinity(), the kernel itself will reset it to cgroups cpuset.cpu, so, maybe here statement can be simplified, that if not specified, after transition into container namespace, default affinity depends on kernel implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is nicer to have a reference to another way of setting cpu affinity here. If removed, it still says the same thing:

       on after the transition to container's cgroup. The format is the same as
-      for `initial`. If omitted or empty, the container's default CPU affinity,
-      as defined by [cpu.cpus property](./config.md#configLinuxCPUs)), is used.
+      for `initial`. If omitted or empty, the container's default CPU affinity
+      applies.
 

so I think it's better to keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, ok for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the tests, if you don't call sched_setaffinity(), the kernel itself will reset it to cgroups cpuset.cpu

@kad It seems your tests are wrong. Mine are showing this only happens when the initial (pre cgroup-move) affinity is not a subset of cgroup's affinity.

So maybe we need to change the spec to say something like

If omitted or empty, no attempt to change the CPU affinity of the process after moving it to container's cgroup is made, and its final affinity is determined by the Linux kernel.

I'd rather not go into more details here, because otherwise I'll be documenting the kernel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my tests, it is not the case. (-p is "pause" where I'm attaching in another terminal process to cgroup with cpuset.cpus=0-10):

r640-2:/home/akanevsk # ~akanevsk/getaffinity -g -s "0-5" -g -p -g
get_affinity: 0-87
set_affinity: affinity set to 0-5
get_affinity: 0-5
get_affinity: 0-10
r640-2:/home/akanevsk # ~akanevsk/getaffinity -g -s "50-60" -g -p -g
get_affinity: 0-87
set_affinity: affinity set to 50-60
get_affinity: 50-60
get_affinity: 0-10

This allows to set initial and final CPU affinity for a process being
run in a container, which is needed to solve the issue described in [1].

[1] opencontainers/runc#3922

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin merged commit 7017384 into opencontainers:main Jun 25, 2024
3 checks passed
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Jul 17, 2024
As per
 - opencontainers/runtime-spec#1253
 - opencontainers/runtime-spec#1261

Add some tests (alas it's impossible to test initial CPU affinity
without adding debug logging).

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this pull request Aug 16, 2024
As per
 - opencontainers/runtime-spec#1253
 - opencontainers/runtime-spec#1261

Add some tests (alas it's impossible to test initial CPU affinity
without adding debug logging).

Signed-off-by: Kir Kolyshkin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants