Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "requests" for CPU resources instead of limits #5748

Merged
merged 1 commit into from
Oct 20, 2017

Conversation

portante
Copy link
Contributor

@portante portante commented Oct 13, 2017

We now use a CPU request to ensure logging infrastructure pods are not capped by default for CPU usage. It is still important to ensure we have a minimum amount of CPU.

We keep the use of the variables *_cpu_limit so that the existing behavior is maintained.

Note that we don't want to cap an infra pod's CPU usage by default, since we want to be able to use the necessary resources to complete it's tasks.

Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 13, 2017
@portante portante force-pushed the fix-cpu-limits branch 2 times, most recently from d91cb60 to 6b4114f Compare October 13, 2017 17:15
openshift_logging_fluentd_journal_source: ""
openshift_logging_fluentd_journal_read_from_head: ""
openshift_logging_fluentd_hosts: ['--all']
openshift_logging_fluentd_buffer_queue_limit: 1024
openshift_logging_fluentd_buffer_size_limit: 1m
openshift_logging_fluentd_buffer_queue_limit: 32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these two changes fix it so that the buffer_queue_limit/buffer_chunk_size combined are no more than 256 MB of memory, half of the default of 512 MB of memory on line 81. The prior value was 1 GB, which was too much.

@@ -55,4 +56,4 @@ openshift_logging_fluentd_aggregating_passphrase: none
#fluentd_throttle_contents:
#fluentd_secureforward_contents:

openshift_logging_fluentd_file_buffer_limit: 1Gi
openshift_logging_fluentd_file_buffer_limit: 256Mi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we keep the file buffer limit small, one half of the memory limit by default.

@@ -57,11 +58,11 @@ openshift_logging_mux_file_buffer_storage_type: "emptydir"
openshift_logging_mux_file_buffer_pvc_name: "logging-mux-pvc"

# required if the PVC does not already exist
openshift_logging_mux_file_buffer_pvc_size: 4Gi
openshift_logging_mux_file_buffer_pvc_size: 512Mi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowering the PVC size to be on the same order as the lowered memory limit.

Copy link

@mffiedler mffiedler Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a big deal, but this size PVC will preclude using IOPS-provisioned EBS storage for file buffer storage. io1 storage has a minimum size of 4Gi. But users can change this in the Ansible vars if there default storage class is io1.

Update: Minimum gp2 volume size on EBS is 1Gi. Should we make this the default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mffiedler I know @portante arrived at these numbers based upon the mismatch between disk space and the various fluentd queue and buffer settings. I think we need to figure out how users set one value and it is used to set a few other settings to arrive at the set that is not mismatched

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mffiedler, the default PVC size of 1 Gi seems reasonable to have for gp2 storage class. I'll update the commit.

openshift_logging_mux_buffer_size_limit: 1m
openshift_logging_mux_cpu_limit: null
openshift_logging_mux_cpu_request: 100m
openshift_logging_mux_memory_limit: 512Mi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note here that we are lowering the mux memory limit to some more reasonable as a default, 512 MB, and adjust the buffer_queue_limit/buffer_chunk_size parameters to use half of that memory. It is not clear we needed 2GB of memory as a default.

openshift_logging_mux_file_buffer_pvc_dynamic: false
openshift_logging_mux_file_buffer_pvc_pv_selector: {}
openshift_logging_mux_file_buffer_pvc_access_modes: ['ReadWriteOnce']
openshift_logging_mux_file_buffer_storage_group: '65534'

openshift_logging_mux_file_buffer_pvc_prefix: "logging-mux"
openshift_logging_mux_file_buffer_limit: 2Gi
openshift_logging_mux_file_buffer_limit: 256Mi
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note as well that we are lowering the file buffer limit to half of the new default memory limit.

@sdodson
Copy link
Member

sdodson commented Oct 13, 2017

/retest

@portante
Copy link
Contributor Author

@sdodson, yeah, this is not a test flake. I have to modify the origin-aggregated-logging tests first before this will pass.

portante added a commit to portante/origin-aggregated-logging that referenced this pull request Oct 16, 2017
This patch is a required sibling to the openshift-ansible PR
openshift/openshift-ansible#5748.

Without this patch landing first, that PR will not past its regression
tests, due to a behavior of the test environment provided by this repo
where it attempts to remove a CPU limit resource path, but fails when it
is not found.  That failure is now always the case with the changes in
PR openshift/openshift-ansible#5748.
@sdodson
Copy link
Member

sdodson commented Oct 16, 2017

/test upgrade

@sdodson
Copy link
Member

sdodson commented Oct 16, 2017

/assign jcantrill

openshift-merge-robot added a commit to openshift/origin-aggregated-logging that referenced this pull request Oct 16, 2017
Automatic merge from submit-queue.

Fix references to cpu-limits and test environment

This patch is a required sibling to the PR openshift/openshift-ansible#5748.

Without this patch landing first, that PR will not past its regression tests, due to a behavior of the test environment provided by this repo where it attempts to remove a CPU limit resource path, but fails when it is not found.  That failure is now always the case with the changes in PR openshift/openshift-ansible#5748.
@jcantrill
Copy link
Contributor

/test logging

@jcantrill
Copy link
Contributor

/test logging
Not sure if this started before or after the dependent logging change

@jcantrill
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2017
jcantrill pushed a commit to jcantrill/origin-aggregated-logging that referenced this pull request Oct 16, 2017
This patch is a required sibling to the openshift-ansible PR
openshift/openshift-ansible#5748.

Without this patch landing first, that PR will not past its regression
tests, due to a behavior of the test environment provided by this repo
where it attempts to remove a CPU limit resource path, but fails when it
is not found.  That failure is now always the case with the changes in
PR openshift/openshift-ansible#5748.

(cherry picked from commit 6b6fe9e)
@openshift-merge-robot openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2017
@sdodson
Copy link
Member

sdodson commented Oct 17, 2017

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2017
@richm
Copy link
Contributor

richm commented Oct 17, 2017

Same problem as the docker events PR - fluentd could not start after 60 seconds
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/5748/test_pull_request_openshift_ansible_logging/2235/artifacts/scripts/entrypoint/artifacts/logging-fluentd-g0mnx.log
This doesn't say when it started, but the test failure occurred between [INFO] Logging test suite check-EFK-running started at Tue Oct 17 18:28:24 UTC 2017 and [WARNING] Logging test suite check-EFK-running failed at Tue Oct 17 18:29:35 UTC 2017

And this is the closest fluentd log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_openshift-ansible/5748/test_pull_request_openshift_ansible_logging/2235/artifacts/scripts/entrypoint/artifacts/logging-fluentd-g0mnx.log

2017-10-17 18:29:43 +0000 [info]: reading config file path="/etc/fluent/fluent.conf"

which is after the test

So perhaps there was problem deploying the fluentd daemonset? Perhaps the 60 second timeout needs to be increased?

@openshift-merge-robot openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2017
@portante
Copy link
Contributor Author

@jcantrill, @sdodson, hmm, requested Github CI test is not green? They all look green to me...

@sdodson
Copy link
Member

sdodson commented Oct 19, 2017

/lgtm cancel

@sdodson
Copy link
Member

sdodson commented Oct 19, 2017

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2017
@sdodson
Copy link
Member

sdodson commented Oct 19, 2017

/kind bug

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 19, 2017
We now use a CPU request to ensure logging infrastructure pods are
not capped by default for CPU usage. It is still important to ensure
we have a minimum amount of CPU.

We keep the use of the variables *_cpu_limit so that the existing
behavior is maintained.

Note that we don't want to cap an infra pod's CPU usage by default,
since we want to be able to use the necessary resources to complete
it's tasks.

Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)
@openshift-merge-robot openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2017
@jcantrill
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2017
portante added a commit to portante/openshift-ansible that referenced this pull request Oct 19, 2017
This is a backport for the release-3.6 branch of:
openshift#5748

We now use a CPU request to ensure logging infrastructure pods are
not capped by default for CPU usage. It is still important to ensure
we have a minimum amount of CPU.

We keep the use of the variables *_cpu_limit so that the existing
behavior is maintained.

Note that we don't want to cap an infra pod's CPU usage by default,
since we want to be able to use the necessary resources to complete
it's tasks.

Bug 1501960 (https://bugzilla.redhat.com/show_bug.cgi?id=1501960)
@mffiedler
Copy link

mffiedler commented Oct 19, 2017

Logging deploys are broken now (https://bugzilla.redhat.com/show_bug.cgi?id=1504191). When I try to deploy using this branch I hit the same issue. Might be part of what is plaguing the tests for this PR

Suggest trying a manual install using this branch. Don't think it will work.

@portante
Copy link
Contributor Author

@sdodson, @jcantrill, @richm, this looks like another test failure from somewhere else in the system:

Using project "default".
[INFO] [CLEANUP] Dumping container logs to _output/scripts/conformance/logs/containers
[INFO] [CLEANUP] Truncating log files over 200M
[INFO] [CLEANUP] Stopping docker containers
[INFO] [CLEANUP] Removing docker containers
Error response from daemon: You cannot remove a running container aba826a0362fd94ffe5fd949050dce1e64b517da5e2f6077d587845e28f35f7a. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container 0142657409e6f40819aa487b325b4b5fb0a702983c0bf25b5d07280dbb043e60. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container 310aa74c2e5ff6521914cbad91fbcc0d438d37c8e5245d28292bb407b858a47d. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container ba5a9ae59d4fa64ed851c974d1c4af73e0c31069271ae004ccc0c198a10edf30. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container 2b4c8205defd31768c6440d3bedc0b438397e7afbe9449098b328286547fc443. Stop the container before attempting removal or use -f
Error response from daemon: You cannot remove a running container ebaa6702219ee971d39c486f73178579dbc2b9982bcf365dc59d39867bdf2206. Stop the container before attempting removal or use -f
[INFO] [CLEANUP] Killing child processes
[INFO] [CLEANUP] Pruning etcd data directory
rm: cannot remove ‘/tmp/etcd/member’: Permission denied
[ERROR] test/extended/conformance.sh exited with code 1 after 00h 39m 17s
make: *** [test-extended] Error 1
++ export status=FAILURE
++ status=FAILURE
+ set +o xtrace

@ewolinetz
Copy link
Contributor

@mffiedler That bz doesn't look to be related to what is flaking in this PR

@ewolinetz
Copy link
Contributor

conformance failure on install, upgrade results seem to be missing entirely

/retest

@jcantrill
Copy link
Contributor

/test install

@sdodson
Copy link
Member

sdodson commented Oct 20, 2017

flake on openshift/origin#16929

@sdodson
Copy link
Member

sdodson commented Oct 20, 2017

lets be sure to link flakes, this one has been happening all the time and linking them is the best method to provide feedback for which flakes are highest priority

@jcantrill
Copy link
Contributor

/test install

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@sdodson sdodson merged commit ac2af73 into openshift:master Oct 20, 2017
sqtran pushed a commit to sqtran/origin-aggregated-logging that referenced this pull request Nov 10, 2017
This patch is a required sibling to the openshift-ansible PR
openshift/openshift-ansible#5748.

Without this patch landing first, that PR will not past its regression
tests, due to a behavior of the test environment provided by this repo
where it attempts to remove a CPU limit resource path, but fails when it
is not found.  That failure is now always the case with the changes in
PR openshift/openshift-ansible#5748.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_3.6 kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants