-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate a single debug file for a long duration capture #10279
Conversation
The API allows collecting profiles even debug_enabled=false as long as ACLs are enabled. Remove this check from the CLI so that users do not need to set debug_enabled=true for no reason. Also: - fix the API client to return errors on non-200 status codes for debug endpoints - improve the failure messages when pprof data can not be collected Co-Authored-By: Dhia Ayachi <[email protected]>
parallel runs create a race condition that fail the debug tests
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you for working on this!
This is a bit different then what I was imagining. I'm actually not sure if these file formats are safe for concatenation. Maybe they are?
I think this might work, but collecting a single profile might also help us miss fewer things. I was thinking that we could split up captureDynamic
and make a single API call for each of the profiles.
Looking at https://golang.org/pkg/net/http/pprof/ , we can specify the duration using the seconds
parameter. That way we can specify seconds=120
(or whatever duration is set using the command line flag).
I guess that will also require changes to the HTTP client in api/debug
, because right now it looks like it is not accepting any parameters.
Thank you @dnephin, |
This reverts commit c2d0334
I went through the code to understand what is the current behaviour and my findings are the following:
So if we do a capture with duration=200ms and an interval=100ms we will end up with 22 folders ((2s + 200ms)/100ms) each having a file with the data for each type of capture we specified (logs, metrics, pprof) To achieve what is stated in this PR we need to change the logic because capturing everything in the same file defy the logic of having a duration and an interval. one option would be to simplify the usage and have interval = duration if not specified so we capture a single set of files for the total duration specified (including the grace period) @dnephin I will need your insights on this one as I'm not sure what we want In the meantime, I reverted the change I made and did some refactoring to clean up the code a bit |
- timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist
This reverts commit c2d0334
…p/consul into dhia/debug-generate-single-file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good.
Not sure if other reviewers want to take another pass - might be worth a ping to see but I'll approve this 😄 .
There was one integration test failure that looks like a flake and I just re-ran it. If it fails again it might be worth double-checking but I expect it's unrelated as I don't think we even run this command in envoy integration tests anywhere.
Co-authored-by: Paul Banks <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, couple minor things, but nothing blocking
Co-authored-by: Daniel Nephin <[email protected]>
…> v0.0.0-20210601050228-01bbb1931b22
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/381945. |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/381978. |
🍒✅ Cherry pick of commit 005ad9e onto |
* debug: remove the CLI check for debug_enabled The API allows collecting profiles even debug_enabled=false as long as ACLs are enabled. Remove this check from the CLI so that users do not need to set debug_enabled=true for no reason. Also: - fix the API client to return errors on non-200 status codes for debug endpoints - improve the failure messages when pprof data can not be collected Co-Authored-By: Dhia Ayachi <[email protected]> * remove parallel test runs parallel runs create a race condition that fail the debug tests * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * extract wait group outside the go routine to avoid a race condition * capture pprof in a separate go routine * perform a single capture for pprof data for the whole duration * add missing vendor dependency * add a change log and fix documentation to reflect the change * create function for timestamp dir creation and simplify error handling * use error groups and ticker to simplify interval capture loop * Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval * refactor Logs capture routine and add log capture specific test * improve error reporting when log test fail * change test duration to 1s * make time parsing in log line more robust * refactor log time format in a const * test on log line empty the earliest possible and return Co-authored-by: Freddy <[email protected]> * rename function to captureShortLived * more specific changelog Co-authored-by: Paul Banks <[email protected]> * update documentation to reflect current implementation * add test for behavior when invalid param is passed to the command * fix argument line in test * a more detailed description of the new behaviour Co-authored-by: Paul Banks <[email protected]> * print success right after the capture is done * remove an unnecessary error check Co-authored-by: Daniel Nephin <[email protected]> * upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22 Co-authored-by: Daniel Nephin <[email protected]> Co-authored-by: Freddy <[email protected]> Co-authored-by: Paul Banks <[email protected]>
🍒✅ Cherry pick of commit 005ad9e onto |
* debug: remove the CLI check for debug_enabled The API allows collecting profiles even debug_enabled=false as long as ACLs are enabled. Remove this check from the CLI so that users do not need to set debug_enabled=true for no reason. Also: - fix the API client to return errors on non-200 status codes for debug endpoints - improve the failure messages when pprof data can not be collected Co-Authored-By: Dhia Ayachi <[email protected]> * remove parallel test runs parallel runs create a race condition that fail the debug tests * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * extract wait group outside the go routine to avoid a race condition * capture pprof in a separate go routine * perform a single capture for pprof data for the whole duration * add missing vendor dependency * add a change log and fix documentation to reflect the change * create function for timestamp dir creation and simplify error handling * use error groups and ticker to simplify interval capture loop * Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval * refactor Logs capture routine and add log capture specific test * improve error reporting when log test fail * change test duration to 1s * make time parsing in log line more robust * refactor log time format in a const * test on log line empty the earliest possible and return Co-authored-by: Freddy <[email protected]> * rename function to captureShortLived * more specific changelog Co-authored-by: Paul Banks <[email protected]> * update documentation to reflect current implementation * add test for behavior when invalid param is passed to the command * fix argument line in test * a more detailed description of the new behaviour Co-authored-by: Paul Banks <[email protected]> * print success right after the capture is done * remove an unnecessary error check Co-authored-by: Daniel Nephin <[email protected]> * upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22 Co-authored-by: Daniel Nephin <[email protected]> Co-authored-by: Freddy <[email protected]> Co-authored-by: Paul Banks <[email protected]>
* debug: remove the CLI check for debug_enabled The API allows collecting profiles even debug_enabled=false as long as ACLs are enabled. Remove this check from the CLI so that users do not need to set debug_enabled=true for no reason. Also: - fix the API client to return errors on non-200 status codes for debug endpoints - improve the failure messages when pprof data can not be collected Co-Authored-By: Dhia Ayachi <[email protected]> * remove parallel test runs parallel runs create a race condition that fail the debug tests * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * extract wait group outside the go routine to avoid a race condition * capture pprof in a separate go routine * perform a single capture for pprof data for the whole duration * add missing vendor dependency * add a change log and fix documentation to reflect the change * create function for timestamp dir creation and simplify error handling * use error groups and ticker to simplify interval capture loop * Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval * refactor Logs capture routine and add log capture specific test * improve error reporting when log test fail * change test duration to 1s * make time parsing in log line more robust * refactor log time format in a const * test on log line empty the earliest possible and return Co-authored-by: Freddy <[email protected]> * rename function to captureShortLived * more specific changelog Co-authored-by: Paul Banks <[email protected]> * update documentation to reflect current implementation * add test for behavior when invalid param is passed to the command * fix argument line in test * a more detailed description of the new behaviour Co-authored-by: Paul Banks <[email protected]> * print success right after the capture is done * remove an unnecessary error check Co-authored-by: Daniel Nephin <[email protected]> * upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22 Co-authored-by: Daniel Nephin <[email protected]> Co-authored-by: Freddy <[email protected]> Co-authored-by: Paul Banks <[email protected]>
…0360) * debug: remove the CLI check for debug_enabled The API allows collecting profiles even debug_enabled=false as long as ACLs are enabled. Remove this check from the CLI so that users do not need to set debug_enabled=true for no reason. Also: - fix the API client to return errors on non-200 status codes for debug endpoints - improve the failure messages when pprof data can not be collected Co-Authored-By: Dhia Ayachi <[email protected]> * remove parallel test runs parallel runs create a race condition that fail the debug tests * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * snapshot the timestamp at the beginning of the capture - timestamp used to create the capture sub folder is snapshot only at the beginning of the capture and reused for subsequent captures - capture append to the file if it already exist * Revert "snapshot the timestamp at the beginning of the capture" This reverts commit c2d0334 * Refactor captureDynamic to extract capture logic for each item in a different func * extract wait group outside the go routine to avoid a race condition * capture pprof in a separate go routine * perform a single capture for pprof data for the whole duration * add missing vendor dependency * add a change log and fix documentation to reflect the change * create function for timestamp dir creation and simplify error handling * use error groups and ticker to simplify interval capture loop * Logs, profile and traces are captured for the full duration. Metrics, Heap and Go routines are captured every interval * refactor Logs capture routine and add log capture specific test * improve error reporting when log test fail * change test duration to 1s * make time parsing in log line more robust * refactor log time format in a const * test on log line empty the earliest possible and return Co-authored-by: Freddy <[email protected]> * rename function to captureShortLived * more specific changelog Co-authored-by: Paul Banks <[email protected]> * update documentation to reflect current implementation * add test for behavior when invalid param is passed to the command * fix argument line in test * a more detailed description of the new behaviour Co-authored-by: Paul Banks <[email protected]> * print success right after the capture is done * remove an unnecessary error check Co-authored-by: Daniel Nephin <[email protected]> * upgraded github.com/google/pprof v0.0.0-20181206194817-3ea8567a2e57 => v0.0.0-20210601050228-01bbb1931b22 Co-authored-by: Daniel Nephin <[email protected]> Co-authored-by: Freddy <[email protected]> Co-authored-by: Paul Banks <[email protected]> Co-authored-by: Daniel Nephin <[email protected]> Co-authored-by: Freddy <[email protected]> Co-authored-by: Paul Banks <[email protected]>
Current behaviour when generating dynamic debug files is to create multiple files in subfolders that represent each capture interval that make it harder to correlate logs and debug
This PR is to: