Use "requests" for CPU resources instead of limits #5748

portante · 2017-10-13T15:26:11Z

We now use a CPU request to ensure logging infrastructure pods are not capped by default for CPU usage. It is still important to ensure we have a minimum amount of CPU.

We keep the use of the variables *_cpu_limit so that the existing behavior is maintained.

Note that we don't want to cap an infra pod's CPU usage by default, since we want to be able to use the necessary resources to complete it's tasks.

Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)

portante · 2017-10-13T17:39:34Z

roles/openshift_logging/defaults/main.yml

 openshift_logging_fluentd_journal_source: ""
 openshift_logging_fluentd_journal_read_from_head: ""
 openshift_logging_fluentd_hosts: ['--all']
-openshift_logging_fluentd_buffer_queue_limit: 1024
-openshift_logging_fluentd_buffer_size_limit: 1m
+openshift_logging_fluentd_buffer_queue_limit: 32


Note that these two changes fix it so that the buffer_queue_limit/buffer_chunk_size combined are no more than 256 MB of memory, half of the default of 512 MB of memory on line 81. The prior value was 1 GB, which was too much.

portante · 2017-10-13T17:43:23Z

roles/openshift_logging_fluentd/defaults/main.yml

@@ -55,4 +56,4 @@ openshift_logging_fluentd_aggregating_passphrase: none
 #fluentd_throttle_contents:
 #fluentd_secureforward_contents:

-openshift_logging_fluentd_file_buffer_limit: 1Gi
+openshift_logging_fluentd_file_buffer_limit: 256Mi


Note that we keep the file buffer limit small, one half of the memory limit by default.

portante · 2017-10-13T17:44:31Z

roles/openshift_logging_mux/defaults/main.yml

@@ -57,11 +58,11 @@ openshift_logging_mux_file_buffer_storage_type: "emptydir"
 openshift_logging_mux_file_buffer_pvc_name: "logging-mux-pvc"

 # required if the PVC does not already exist
-openshift_logging_mux_file_buffer_pvc_size: 4Gi
+openshift_logging_mux_file_buffer_pvc_size: 512Mi


Lowering the PVC size to be on the same order as the lowered memory limit.

Probably not a big deal, but this size PVC will preclude using IOPS-provisioned EBS storage for file buffer storage. io1 storage has a minimum size of 4Gi. But users can change this in the Ansible vars if there default storage class is io1.

Update: Minimum gp2 volume size on EBS is 1Gi. Should we make this the default?

@mffiedler I know @portante arrived at these numbers based upon the mismatch between disk space and the various fluentd queue and buffer settings. I think we need to figure out how users set one value and it is used to set a few other settings to arrive at the set that is not mismatched

@mffiedler, the default PVC size of 1 Gi seems reasonable to have for gp2 storage class. I'll update the commit.

portante · 2017-10-13T17:46:12Z

roles/openshift_logging_mux/defaults/main.yml

-openshift_logging_mux_buffer_size_limit: 1m
+openshift_logging_mux_cpu_limit: null
+openshift_logging_mux_cpu_request: 100m
+openshift_logging_mux_memory_limit: 512Mi


Note here that we are lowering the mux memory limit to some more reasonable as a default, 512 MB, and adjust the buffer_queue_limit/buffer_chunk_size parameters to use half of that memory. It is not clear we needed 2GB of memory as a default.

portante · 2017-10-13T17:46:41Z

roles/openshift_logging_mux/defaults/main.yml

 openshift_logging_mux_file_buffer_pvc_dynamic: false
 openshift_logging_mux_file_buffer_pvc_pv_selector: {}
 openshift_logging_mux_file_buffer_pvc_access_modes: ['ReadWriteOnce']
 openshift_logging_mux_file_buffer_storage_group: '65534'

 openshift_logging_mux_file_buffer_pvc_prefix: "logging-mux"
-openshift_logging_mux_file_buffer_limit: 2Gi
+openshift_logging_mux_file_buffer_limit: 256Mi


Note as well that we are lowering the file buffer limit to half of the new default memory limit.

sdodson · 2017-10-13T20:32:37Z

/retest

portante · 2017-10-13T23:49:04Z

@sdodson, yeah, this is not a test flake. I have to modify the origin-aggregated-logging tests first before this will pass.

This patch is a required sibling to the openshift-ansible PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748.

sdodson · 2017-10-16T13:09:53Z

/test upgrade

sdodson · 2017-10-16T13:09:59Z

/assign jcantrill

Automatic merge from submit-queue. Fix references to cpu-limits and test environment This patch is a required sibling to the PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748.

jcantrill · 2017-10-16T14:23:03Z

/test logging

jcantrill · 2017-10-16T17:07:18Z

/test logging
Not sure if this started before or after the dependent logging change

jcantrill · 2017-10-16T18:26:55Z

/lgtm

This patch is a required sibling to the openshift-ansible PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748. (cherry picked from commit 6b6fe9e)

sdodson · 2017-10-17T18:58:25Z

/lgtm

richm · 2017-10-17T20:03:06Z

Same problem as the docker events PR - fluentd could not start after 60 seconds
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/5748/test_pull_request_openshift_ansible_logging/2235/artifacts/scripts/entrypoint/artifacts/logging-fluentd-g0mnx.log
This doesn't say when it started, but the test failure occurred between [INFO] Logging test suite check-EFK-running started at Tue Oct 17 18:28:24 UTC 2017 and [WARNING] Logging test suite check-EFK-running failed at Tue Oct 17 18:29:35 UTC 2017

And this is the closest fluentd log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/5748/test_pull_request_openshift_ansible_logging/2235/artifacts/scripts/entrypoint/artifacts/logging-fluentd-g0mnx.log

2017-10-17 18:29:43 +0000 [info]: reading config file path="/etc/fluent/fluent.conf"

which is after the test

So perhaps there was problem deploying the fluentd daemonset? Perhaps the 60 second timeout needs to be increased?

portante · 2017-10-19T03:42:37Z

@jcantrill, @sdodson, hmm, requested Github CI test is not green? They all look green to me...

sdodson · 2017-10-19T13:12:09Z

/lgtm cancel

sdodson · 2017-10-19T13:12:12Z

/lgtm

sdodson · 2017-10-19T13:12:24Z

/kind bug

We now use a CPU request to ensure logging infrastructure pods are not capped by default for CPU usage. It is still important to ensure we have a minimum amount of CPU. We keep the use of the variables *_cpu_limit so that the existing behavior is maintained. Note that we don't want to cap an infra pod's CPU usage by default, since we want to be able to use the necessary resources to complete it's tasks. Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)

jcantrill · 2017-10-19T15:09:04Z

/lgtm

This is a backport for the release-3.6 branch of: openshift#5748 We now use a CPU request to ensure logging infrastructure pods are not capped by default for CPU usage. It is still important to ensure we have a minimum amount of CPU. We keep the use of the variables *_cpu_limit so that the existing behavior is maintained. Note that we don't want to cap an infra pod's CPU usage by default, since we want to be able to use the necessary resources to complete it's tasks. Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)

mffiedler · 2017-10-19T16:45:49Z

Logging deploys are broken now (https://bugzilla.redhat.com/show_bug.cgi?id=1504191). When I try to deploy using this branch I hit the same issue. Might be part of what is plaguing the tests for this PR

Suggest trying a manual install using this branch. Don't think it will work.

portante · 2017-10-19T16:46:21Z

@sdodson, @jcantrill, @richm, this looks like another test failure from somewhere else in the system:

Using project "default".
[INFO] [CLEANUP] Dumping container logs to _output/scripts/conformance/logs/containers
[INFO] [CLEANUP] Truncating log files over 200M
[INFO] [CLEANUP] Stopping docker containers
[INFO] [CLEANUP] Removing docker containers
Error response from daemon: You cannot remove a running container aba826a0362fd94ffe5fd949050dce1e64b517da5e2f6077d587845e28f35f7a. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container 0142657409e6f40819aa487b325b4b5fb0a702983c0bf25b5d07280dbb043e60. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container 310aa74c2e5ff6521914cbad91fbcc0d438d37c8e5245d28292bb407b858a47d. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container ba5a9ae59d4fa64ed851c974d1c4af73e0c31069271ae004ccc0c198a10edf30. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container 2b4c8205defd31768c6440d3bedc0b438397e7afbe9449098b328286547fc443. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container ebaa6702219ee971d39c486f73178579dbc2b9982bcf365dc59d39867bdf2206. Stop the container before attempting removal or use -f
[INFO] [CLEANUP] Killing child processes
[INFO] [CLEANUP] Pruning etcd data directory
rm: cannot remove ‘/tmp/etcd/member’: Permission denied
[ERROR] test/extended/conformance.sh exited with code 1 after 00h 39m 17s
make: *** [test-extended] Error 1
++ export status=FAILURE
++ status=FAILURE
+ set +o xtrace

ewolinetz · 2017-10-19T18:48:28Z

@mffiedler That bz doesn't look to be related to what is flaking in this PR

ewolinetz · 2017-10-19T21:59:07Z

conformance failure on install, upgrade results seem to be missing entirely

/retest

jcantrill · 2017-10-20T12:27:05Z

/test install

sdodson · 2017-10-20T13:04:29Z

flake on openshift/origin#16929

sdodson · 2017-10-20T13:04:52Z

lets be sure to link flakes, this one has been happening all the time and linking them is the best method to provide feedback for which flakes are highest priority

jcantrill · 2017-10-20T17:31:24Z

/test install

openshift-merge-robot · 2017-10-20T20:24:03Z

/test all [submit-queue is verifying that this PR is safe to merge]

This patch is a required sibling to the openshift-ansible PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2017

portante force-pushed the fix-cpu-limits branch 2 times, most recently from d91cb60 to 6b4114f Compare October 13, 2017 17:15

portante commented Oct 13, 2017

View reviewed changes

portante force-pushed the fix-cpu-limits branch from 6b4114f to 9259ff1 Compare October 13, 2017 17:50

ewolinetz requested review from richm, ewolinetz and jcantrill October 13, 2017 22:04

portante force-pushed the fix-cpu-limits branch from 9259ff1 to 1c6821c Compare October 14, 2017 01:00

portante mentioned this pull request Oct 14, 2017

Fix references to cpu-limits and test environment openshift/origin-aggregated-logging#723

Merged

openshift-ci-robot assigned jcantrill Oct 16, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2017

jcantrill mentioned this pull request Oct 16, 2017

Fix references to cpu-limits and test environment openshift/origin-aggregated-logging#726

Merged

openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2017

portante force-pushed the fix-cpu-limits branch from 9b0caa0 to ac6d065 Compare October 18, 2017 21:45

openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2017

openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 19, 2017

portante force-pushed the fix-cpu-limits branch from ac6d065 to 578ac5b Compare October 19, 2017 15:02

openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2017

portante mentioned this pull request Oct 19, 2017

Use "requests" for CPU resources instead of limits #5815

Merged

sdodson merged commit ac2af73 into openshift:master Oct 20, 2017

jcantrill added the affects_3.6 label Oct 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use "requests" for CPU resources instead of limits #5748

Use "requests" for CPU resources instead of limits #5748

portante commented Oct 13, 2017 •

edited

Loading

portante Oct 13, 2017

portante Oct 13, 2017

portante Oct 13, 2017

mffiedler Oct 16, 2017 •

edited

Loading

jcantrill Oct 16, 2017

portante Oct 17, 2017

portante Oct 13, 2017

portante Oct 13, 2017

sdodson commented Oct 13, 2017

portante commented Oct 13, 2017

sdodson commented Oct 16, 2017

sdodson commented Oct 16, 2017

jcantrill commented Oct 16, 2017

jcantrill commented Oct 16, 2017

jcantrill commented Oct 16, 2017

sdodson commented Oct 17, 2017

richm commented Oct 17, 2017

portante commented Oct 19, 2017

sdodson commented Oct 19, 2017

sdodson commented Oct 19, 2017

sdodson commented Oct 19, 2017

jcantrill commented Oct 19, 2017

mffiedler commented Oct 19, 2017 •

edited

Loading

portante commented Oct 19, 2017

ewolinetz commented Oct 19, 2017

ewolinetz commented Oct 19, 2017

jcantrill commented Oct 20, 2017

sdodson commented Oct 20, 2017

sdodson commented Oct 20, 2017

jcantrill commented Oct 20, 2017

openshift-merge-robot commented Oct 20, 2017

Use "requests" for CPU resources instead of limits #5748

Use "requests" for CPU resources instead of limits #5748

Conversation

portante commented Oct 13, 2017 • edited Loading

portante Oct 13, 2017

Choose a reason for hiding this comment

portante Oct 13, 2017

Choose a reason for hiding this comment

portante Oct 13, 2017

Choose a reason for hiding this comment

mffiedler Oct 16, 2017 • edited Loading

Choose a reason for hiding this comment

jcantrill Oct 16, 2017

Choose a reason for hiding this comment

portante Oct 17, 2017

Choose a reason for hiding this comment

portante Oct 13, 2017

Choose a reason for hiding this comment

portante Oct 13, 2017

Choose a reason for hiding this comment

sdodson commented Oct 13, 2017

portante commented Oct 13, 2017

sdodson commented Oct 16, 2017

sdodson commented Oct 16, 2017

jcantrill commented Oct 16, 2017

jcantrill commented Oct 16, 2017

jcantrill commented Oct 16, 2017

sdodson commented Oct 17, 2017

richm commented Oct 17, 2017

portante commented Oct 19, 2017

sdodson commented Oct 19, 2017

sdodson commented Oct 19, 2017

sdodson commented Oct 19, 2017

jcantrill commented Oct 19, 2017

mffiedler commented Oct 19, 2017 • edited Loading

portante commented Oct 19, 2017

ewolinetz commented Oct 19, 2017

ewolinetz commented Oct 19, 2017

jcantrill commented Oct 20, 2017

sdodson commented Oct 20, 2017

sdodson commented Oct 20, 2017

jcantrill commented Oct 20, 2017

openshift-merge-robot commented Oct 20, 2017

portante commented Oct 13, 2017 •

edited

Loading

mffiedler Oct 16, 2017 •

edited

Loading

mffiedler commented Oct 19, 2017 •

edited

Loading