Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve nsexec logging #2836

Merged
merged 2 commits into from
Apr 13, 2021
Merged

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Mar 6, 2021

Draft as it's blocked by #2835

In order to make runc --debug actually useful for debugging nsexec
bugs, provide information about all the internal operations when in
debug mode.

Before:

# runc --debug run x123
DEBU[0000] nsexec:601 nsexec started                    
DEBU[0000] child process in init()                      
DEBU[0000] logging has already been configured          
DEBU[0000] log pipe has been closed: EOF                
DEBU[0001] process exited                                pid=1472529 status=0

After:

# runc --debug run x123
DEBU[0000] nsexec[1472380]: => nsexec container setup   
DEBU[0000] nsexec-0[1472380]: ~> nsexec stage-0         
DEBU[0000] nsexec-0[1472380]: spawn stage-1             
DEBU[0000] nsexec-0[1472380]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[1472384]: ~> nsexec stage-1         
DEBU[0000] nsexec-1[1472384]: unshare remaining namespace (except cgroupns) 
DEBU[0000] nsexec-1[1472384]: spawn stage-2             
DEBU[0000] nsexec-1[1472384]: request stage-0 to forward stage-2 pid (1472385) 
DEBU[0000] nsexec-0[1472380]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[1472380]: forward stage-1 (1472384) and stage-2 (1472385) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[1472384]: signal completion to stage-0 
DEBU[0000] nsexec-1[1472384]: <~ nsexec stage-1         
DEBU[0000] nsexec-0[1472380]: stage-1 complete          
DEBU[0000] nsexec-0[1472380]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[1472380]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[1472380]: signalling stage-2 to run 
DEBU[0000] nsexec-2[1]: unshare cgroup namespace        
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-0[1472380]: stage-2 complete          
DEBU[0000] nsexec-0[1472380]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[1472380]: <~ nsexec stage-0         
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] child process in init()                      
DEBU[0000] init: closing the pipe to signal completion  
DEBU[0000] sending signal to process urgent I/O condition 
DEBU[0000] sending signal to process urgent I/O condition 
DEBU[0001] process exited                                pid=1472385 status=0

This is a carry of #2460 with a few fixes on top:

Closes: #2460

(this was originally developed in #2487)

@thaJeztah
Copy link
Member

This still draft?

@kolyshkin kolyshkin marked this pull request as ready for review March 31, 2021 01:36
@kolyshkin
Copy link
Contributor Author

No longer a draft; @cyphar PTAL (this is mostly your code)

In order to make 'runc --debug' actually useful for debugging nsexec
bugs, provide information about all the internal operations when in
debug mode.

[@kolyshkin: rebasing; fix formatting via indent for make validate to pass]

Signed-off-by: Aleksa Sarai <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor Author

Rebased again; PTAL @cyphar

@kolyshkin
Copy link
Contributor Author

centos7 fails

=== RUN   TestSystemdFreeze
    utils_test.go:86: exec_test.go:539: unexpected error: unable to freeze
        
--- FAIL: TestSystemdFreeze (0.74s)

This is the second time I see it, not sure what's happening but definitely unrelated to this PR.

CI restarted.

cyphar
cyphar previously approved these changes Apr 9, 2021
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll take your word for it that most of this new code is mine. 😬

@kolyshkin
Copy link
Contributor Author

CentOS CI failure is a flake (#2760), restarting.

Since the previous commit, some strings logged by write_log() contain a
literal newline, which leads to errors like this one:

> # time="2020-06-07T15:41:37Z" level=error msg="failed to decode \"{\\\"level\\\":\\\"debug\\\", \\\"msg\\\": \\\"nsexec-0[2265]: update /proc/2266/uid_map to '0 1000 1\\n\" to json: invalid character '\\n' in string literal"

The fix is to escape such characters.

Add a simple (as much as it can be) routine which implements JSON string
escaping as required by RFC4627, section 2.5, plus escaping of DEL (0x7f)
character (not required, but allowed by the standard, and usually done
by tools such as jq).

As much as I hate to code something like this, I was not able to find
a ready to consume and decent C implementation (not using glib).

Added a test case (and some additional asserts in C code, conditionally
enabled by the test case) to make sure the implementation is correct.
The test case have to live in a separate directory so we can use
different C flags to compile the test, and use C from go test.

[v2: try to simplify the code, add more tests]
[v3: don't do exit(1), try returning an error instead]

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin requested a review from cyphar April 12, 2021 23:52
@kolyshkin
Copy link
Contributor Author

@cyphar PTAL; the only change after your review was this: #2836 (comment)

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

free(s);
// As malloc failed, strdup can fail, too, so in the worst case
// scenario NULL will be returned from here.
return strdup("escape_json_string: out of memory");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it makes more sense to always return NULL (and then in the caller provide the fallback error message), but it doesn't really matter imho.

@cyphar cyphar closed this in 14ce8be Apr 13, 2021
@cyphar cyphar merged commit 14ce8be into opencontainers:master Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants