Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-16921: daemon: Make binary writing idempotent #3825

Merged
merged 2 commits into from
Jul 31, 2023

Conversation

sinnykumari
Copy link
Contributor

- What I did

- How to verify it

- Description for the changelog

@sinnykumari sinnykumari changed the title daemon: no need to reexec when target and source rhel version is same OCPBUGS-16921: daemon: no need to reexec when target and source rhel version is same Jul 27, 2023
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 27, 2023
@openshift-ci-robot
Copy link
Contributor

@sinnykumari: This pull request references Jira Issue OCPBUGS-16921, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jul 27, 2023
@openshift-ci openshift-ci bot requested a review from gpei July 27, 2023 15:52
@yuqi-zhang
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 27, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2023
@@ -466,7 +466,8 @@ func ReexecuteForTargetRoot(target string) error {
// Otherwise, we assume that there's no suffixing needed. Hopefully
// by RHEL10 the MCD will have fundamentally changed and we won't be doing the
// chroot() thing anymore.
klog.Info("not chrooting for source=rhel-%s target=rhel-%s", sourceMajor, targetMajor)
klog.Infof("not chrooting for source=rhel-%s target=rhel-%s", sourceMajor, targetMajor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually sorry, hold on a second...I think I was wrong in our chat. We do need to chroot right? We just don't want to re-exec...

Something more like

diff --git a/pkg/daemon/daemon.go b/pkg/daemon/daemon.go
index 6a9f2079c..2e18bf5da 100644
--- a/pkg/daemon/daemon.go
+++ b/pkg/daemon/daemon.go
@@ -471,35 +471,38 @@ func ReexecuteForTargetRoot(target string) error {
 	} else {
 		klog.Info("assuming we can use container binary chroot() to host")
 	}
-	sourceBinary := "/usr/bin/machine-config-daemon" + sourceBinarySuffix
-	src, err := os.Open(sourceBinary)
-	if err != nil {
-		return fmt.Errorf("opening %s: %w", sourceBinary, err)
-	}
-	defer src.Close()
+	var targetBin string
+	if sourceBinarySuffix != "" {
+		sourceBinary := "/usr/bin/machine-config-daemon" + sourceBinarySuffix
+		src, err := os.Open(sourceBinary)
+		if err != nil {
+			return fmt.Errorf("opening %s: %w", sourceBinary, err)
+		}
+		defer src.Close()
 
-	targetBinBase := "run/bin/machine-config-daemon"
-	targetBin := filepath.Join(target, targetBinBase)
-	targetBinDir := filepath.Dir(targetBin)
-	if _, err := os.Stat(targetBinDir); err != nil {
-		if err := os.Mkdir(targetBinDir, 0o755); err != nil {
-			return fmt.Errorf("mkdir %s: %w", targetBinDir, err)
+		targetBinBase := "run/bin/machine-config-daemon"
+		targetBin = filepath.Join(target, targetBinBase)
+		targetBinDir := filepath.Dir(targetBin)
+		if _, err := os.Stat(targetBinDir); err != nil {
+			if err := os.Mkdir(targetBinDir, 0o755); err != nil {
+				return fmt.Errorf("mkdir %s: %w", targetBinDir, err)
+			}
 		}
-	}
 
-	f, err := os.Create(targetBin)
-	if err != nil {
-		return fmt.Errorf("writing %s: %w", targetBin, err)
-	}
-	if _, err := io.Copy(f, src); err != nil {
+		f, err := os.Create(targetBin)
+		if err != nil {
+			return fmt.Errorf("writing %s: %w", targetBin, err)
+		}
+		if _, err := io.Copy(f, src); err != nil {
+			f.Close()
+			return fmt.Errorf("writing %s: %w", targetBin, err)
+		}
+		if err := f.Chmod(0o755); err != nil {
+			return err
+		}
+		// Must close our writable fd
 		f.Close()
-		return fmt.Errorf("writing %s: %w", targetBin, err)
 	}
-	if err := f.Chmod(0o755); err != nil {
-		return err
-	}
-	// Must close our writable fd
-	f.Close()
 
 	if err := syscall.Chroot(target); err != nil {
 		return fmt.Errorf("failed to chroot to %s: %w", target, err)
@@ -509,6 +512,9 @@ func ReexecuteForTargetRoot(target string) error {
 		return fmt.Errorf("failed to change directory to /: %w", err)
 	}
 
+	if targetBin == "" {
+		return nil
+	}
 	// Now we will see the binary in the target root
 	targetBin = "/" + targetBinBase
 	// We have a "belt and suspenders" approach for detecting the case where

?

But there's still something weird going on here because I don't understand how we can be recursing here...IOW why are we getting that error of having the text file being busy?

Copy link
Contributor Author

@sinnykumari sinnykumari Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, we don't need these new binaries copy stuff but we do need to chroot. This happens when doing things in hurry :/

@cgwalters
Copy link
Member

/lgtm cancel
Per above...

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2023
@sinnykumari sinnykumari changed the title OCPBUGS-16921: daemon: no need to reexec when target and source rhel version is same OCPBUGS-16921: daemon: Make binary writing idempotent Jul 27, 2023
@sinnykumari
Copy link
Contributor Author

/retest-required

@sinnykumari
Copy link
Contributor Author

not sure why tests are in pending state

@sinnykumari sinnykumari marked this pull request as draft July 27, 2023 17:35
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 27, 2023
@sinnykumari sinnykumari marked this pull request as ready for review July 27, 2023 17:35
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 27, 2023
@openshift-ci openshift-ci bot requested review from jkyros and yuqi-zhang July 27, 2023 17:35
@cgwalters
Copy link
Member

cgwalters commented Jul 27, 2023

For the (async, public) record we had a live chat about this and we figured out that what is likely happening here is that in this scenario the MCD is running as a pod (from the binary in /run) and the playbook is running the MCD in once-from via an ansible playbook.

So then your question is...wait, how can it possibly work to have two instances of the MCD running? Well, I don't know...this goes to #1592

@sinnykumari
Copy link
Contributor Author

Looks good to go in
@yuqi-zhang for lgtm

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall should be good, 2 minor comments inline

if _, err := io.Copy(f, src); err != nil {
targetBinDir := filepath.Dir(targetBin)
if _, err := os.Stat(targetBinDir); err != nil {
if err := os.Mkdir(targetBinDir, 0o755); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making the assumption that the error of the above is errnotfound? Might be helpful to make it explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, I am not sure adding here additional error check will add much value. Irrespective of what is the error, we are returning the error anyway with no further action needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant the error return from os.Stat(), must like my other comment below, not the os.Mkdir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return fmt.Errorf("mkdir %s: %w", targetBinDir, err)

// Be idempotent
if _, err := os.Stat(targetBin); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so we skip the write if the targetBin exists? Might be good to also make that explicit here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, updated

When we are skipping the mcd binary write when it already
exists. Today this is the case for RHEL worker node during upgrade.
Explicitly make sure that we are skipping because file doesn't exist
@yuqi-zhang
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 31, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6b2fa8d and 2 for PR HEAD edc68be in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 31, 2023

@sinnykumari: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit a623f63 into openshift:master Jul 31, 2023
@openshift-ci-robot
Copy link
Contributor

@sinnykumari: Jira Issue OCPBUGS-16921: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-16921 has been moved to the MODIFIED state.

In response to this:

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants