-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use "requests" for CPU resources instead of limits #5748
Conversation
d91cb60
to
6b4114f
Compare
openshift_logging_fluentd_journal_source: "" | ||
openshift_logging_fluentd_journal_read_from_head: "" | ||
openshift_logging_fluentd_hosts: ['--all'] | ||
openshift_logging_fluentd_buffer_queue_limit: 1024 | ||
openshift_logging_fluentd_buffer_size_limit: 1m | ||
openshift_logging_fluentd_buffer_queue_limit: 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these two changes fix it so that the buffer_queue_limit
/buffer_chunk_size
combined are no more than 256 MB of memory, half of the default of 512 MB of memory on line 81. The prior value was 1 GB, which was too much.
@@ -55,4 +56,4 @@ openshift_logging_fluentd_aggregating_passphrase: none | |||
#fluentd_throttle_contents: | |||
#fluentd_secureforward_contents: | |||
|
|||
openshift_logging_fluentd_file_buffer_limit: 1Gi | |||
openshift_logging_fluentd_file_buffer_limit: 256Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we keep the file buffer limit small, one half of the memory limit by default.
@@ -57,11 +58,11 @@ openshift_logging_mux_file_buffer_storage_type: "emptydir" | |||
openshift_logging_mux_file_buffer_pvc_name: "logging-mux-pvc" | |||
|
|||
# required if the PVC does not already exist | |||
openshift_logging_mux_file_buffer_pvc_size: 4Gi | |||
openshift_logging_mux_file_buffer_pvc_size: 512Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowering the PVC size to be on the same order as the lowered memory limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a big deal, but this size PVC will preclude using IOPS-provisioned EBS storage for file buffer storage. io1 storage has a minimum size of 4Gi. But users can change this in the Ansible vars if there default storage class is io1.
Update: Minimum gp2 volume size on EBS is 1Gi. Should we make this the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mffiedler I know @portante arrived at these numbers based upon the mismatch between disk space and the various fluentd queue and buffer settings. I think we need to figure out how users set one value and it is used to set a few other settings to arrive at the set that is not mismatched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mffiedler, the default PVC size of 1 Gi seems reasonable to have for gp2 storage class. I'll update the commit.
openshift_logging_mux_buffer_size_limit: 1m | ||
openshift_logging_mux_cpu_limit: null | ||
openshift_logging_mux_cpu_request: 100m | ||
openshift_logging_mux_memory_limit: 512Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note here that we are lowering the mux memory limit to some more reasonable as a default, 512 MB, and adjust the buffer_queue_limit
/buffer_chunk_size
parameters to use half of that memory. It is not clear we needed 2GB of memory as a default.
openshift_logging_mux_file_buffer_pvc_dynamic: false | ||
openshift_logging_mux_file_buffer_pvc_pv_selector: {} | ||
openshift_logging_mux_file_buffer_pvc_access_modes: ['ReadWriteOnce'] | ||
openshift_logging_mux_file_buffer_storage_group: '65534' | ||
|
||
openshift_logging_mux_file_buffer_pvc_prefix: "logging-mux" | ||
openshift_logging_mux_file_buffer_limit: 2Gi | ||
openshift_logging_mux_file_buffer_limit: 256Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note as well that we are lowering the file buffer limit to half of the new default memory limit.
6b4114f
to
9259ff1
Compare
/retest |
@sdodson, yeah, this is not a test flake. I have to modify the |
9259ff1
to
1c6821c
Compare
This patch is a required sibling to the openshift-ansible PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748.
/test upgrade |
/assign jcantrill |
Automatic merge from submit-queue. Fix references to cpu-limits and test environment This patch is a required sibling to the PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748.
/test logging |
/test logging |
/lgtm |
This patch is a required sibling to the openshift-ansible PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748. (cherry picked from commit 6b6fe9e)
/lgtm |
Same problem as the docker events PR - fluentd could not start after 60 seconds And this is the closest fluentd log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/5748/test_pull_request_openshift_ansible_logging/2235/artifacts/scripts/entrypoint/artifacts/logging-fluentd-g0mnx.log
which is after the test So perhaps there was problem deploying the fluentd daemonset? Perhaps the 60 second timeout needs to be increased? |
9b0caa0
to
ac6d065
Compare
@jcantrill, @sdodson, hmm, requested Github CI test is not green? They all look green to me... |
/lgtm cancel |
/lgtm |
/kind bug |
We now use a CPU request to ensure logging infrastructure pods are not capped by default for CPU usage. It is still important to ensure we have a minimum amount of CPU. We keep the use of the variables *_cpu_limit so that the existing behavior is maintained. Note that we don't want to cap an infra pod's CPU usage by default, since we want to be able to use the necessary resources to complete it's tasks. Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)
ac6d065
to
578ac5b
Compare
/lgtm |
This is a backport for the release-3.6 branch of: openshift#5748 We now use a CPU request to ensure logging infrastructure pods are not capped by default for CPU usage. It is still important to ensure we have a minimum amount of CPU. We keep the use of the variables *_cpu_limit so that the existing behavior is maintained. Note that we don't want to cap an infra pod's CPU usage by default, since we want to be able to use the necessary resources to complete it's tasks. Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)
Logging deploys are broken now (https://bugzilla.redhat.com/show_bug.cgi?id=1504191). When I try to deploy using this branch I hit the same issue. Might be part of what is plaguing the tests for this PR Suggest trying a manual install using this branch. Don't think it will work. |
@sdodson, @jcantrill, @richm, this looks like another test failure from somewhere else in the system:
|
@mffiedler That bz doesn't look to be related to what is flaking in this PR |
conformance failure on install, upgrade results seem to be missing entirely /retest |
/test install |
flake on openshift/origin#16929 |
lets be sure to link flakes, this one has been happening all the time and linking them is the best method to provide feedback for which flakes are highest priority |
/test install |
/test all [submit-queue is verifying that this PR is safe to merge] |
This patch is a required sibling to the openshift-ansible PR openshift/openshift-ansible#5748. Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found. That failure is now always the case with the changes in PR openshift/openshift-ansible#5748.
We now use a CPU request to ensure logging infrastructure pods are not capped by default for CPU usage. It is still important to ensure we have a minimum amount of CPU.
We keep the use of the variables
*_cpu_limit
so that the existing behavior is maintained.Note that we don't want to cap an infra pod's CPU usage by default, since we want to be able to use the necessary resources to complete it's tasks.
Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)