Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[determine-reboot-cause] Ignore non-hardware reboot cause #6349

Merged
merged 2 commits into from
Jan 5, 2021

Conversation

vaibhavhd
Copy link
Contributor

- Why I did it
Reboot cause prints "Non-Hardware" instead of showing software reboot cause.

admin@vlab-01:~$ show reboot-cause
Non-Hardware (N/A)
admin@vlab-01:~$

- How I did it
Fixed the handling for Non Hardware reboot cause. Ignore if Non-Hardware is present in the hardware_reboot_cause output.

This will make software reboot cause to be taken as the effective reboot-cause.
- How to verify it
With fix, the hardware reboot cause is ignored (if it is non hw):

admin@vlab-01:~$ show reboot-cause 
User issued 'warm-reboot' command [User: admin, Time: Mon 04 Jan 2021 05:20:49 PM UTC]
admin@vlab-01:~$ 

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

sujinmkang
sujinmkang previously approved these changes Jan 4, 2021
Comment on lines 190 to 195
if proc_cmdline_reboot_cause is not None:
previous_reboot_cause = software_reboot_cause
elif hardware_reboot_cause is not REBOOT_CAUSE_NON_HARDWARE:
elif REBOOT_CAUSE_NON_HARDWARE not in hardware_reboot_cause:
previous_reboot_cause = hardware_reboot_cause
else:
previous_reboot_cause = software_reboot_cause
Copy link
Contributor

@jleveque jleveque Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer using .startswith(REBOOT_CAUSE_NON_HARDWARE) rather than REBOOT_CAUSE_NON_HARDWARE not in hardware_reboot_cause` to be more precise.

Maybe we can simplify this to:

    if proc_cmdline_reboot_cause is not None or hardware_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
        previous_reboot_cause = software_reboot_cause
    else:
        previous_reboot_cause = hardware_reboot_cause

We might also want to check if hardware_reboot_cause is None:

    if (proc_cmdline_reboot_cause is not None or
            hardware_reboot_cause is None or
            hardware_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
        previous_reboot_cause = software_reboot_cause
    else:
        previous_reboot_cause = hardware_reboot_cause

@sujinmkang: Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I also would like if we can simplify the handling further to just:

-    hardware_reboot_cause = None
     # This variable is kept for future-use purpose. When proc_cmd_line/vendor/software provides
     # any additional_reboot_info it will be stored as a "comment" in REBOOT_CAUSE_HISTORY_FILE
     additional_reboot_info = "N/A"
 
-    # 1. Check if the previous reboot was warm/fast reboot by testing whether there is "fast|fastfast|warm" in /proc/cmdline
-    proc_cmdline_reboot_cause = find_proc_cmdline_reboot_cause()
-
-    # 2. Check if the previous reboot was caused by hardware
-    #    If yes, the hardware reboot cause will be treated as the reboot cause
-    hardware_reboot_cause = find_hardware_reboot_cause()
-
-    # 3. If there is a REBOOT_CAUSE_FILE, it will contain any software-related
-    #    reboot info. We will use it as the previous cause.
-    software_reboot_cause = find_software_reboot_cause()
-
-    # The main decision logic of the reboot cause:
-    # If there is a reboot cause indicated by /proc/cmdline, it should be warmreboot/fastreboot
-    #   the software_reboot_cause which is the content of /hosts/reboot-cause/reboot-cause.txt
-    #   will be treated as the reboot cause
-    # Elif there is a reboot cause indicated by platform API,
-    #   the hardware_reboot_cause will be treated as the reboot cause
-    # Else the software_reboot_cause will be treated as the reboot cause
-    if proc_cmdline_reboot_cause is not None:
-        previous_reboot_cause = software_reboot_cause
-    elif REBOOT_CAUSE_NON_HARDWARE not in hardware_reboot_cause:
-        previous_reboot_cause = hardware_reboot_cause
+    # Check if the previous reboot was warm/fast reboot by testing whether there is "fast|fastfast|warm" in /proc/cmdline
+    previous_reboot_cause = find_proc_cmdline_reboot_cause()
+
+    # If /proc/cmdline does not indicate reboot cause, check if the previous reboot was caused by hardware
+    if previous_reboot_cause is None:
+        previous_reboot_cause = find_hardware_reboot_cause()
+        if previous_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
+            # If the reboot cause if non HW, get the reboot cause from REBOOT_CAUSE_FILE
+            previous_reboot_cause = find_software_reboot_cause()
     else:
-        previous_reboot_cause = software_reboot_cause
+        # Get the reboot cause from REBOOT_CAUSE_FILE
+        previous_reboot_cause = find_software_reboot_cause()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condensed logic will then look like:

    # Remove stale PREVIOUS_REBOOT_CAUSE_FILE if it exists
    if os.path.exists(PREVIOUS_REBOOT_CAUSE_FILE):
        os.remove(PREVIOUS_REBOOT_CAUSE_FILE)

    # This variable is kept for future-use purpose. When proc_cmd_line/vendor/software provides
    # any additional_reboot_info it will be stored as a "comment" in REBOOT_CAUSE_HISTORY_FILE
    additional_reboot_info = "N/A"

    # Check if the previous reboot was warm/fast reboot by testing whether there is "fast|fastfast|warm" in /proc/cmdline
    previous_reboot_cause = find_proc_cmdline_reboot_cause()

    # If /proc/cmdline does not indicate reboot cause, check if the previous reboot was caused by hardware
    if previous_reboot_cause is None:
        previous_reboot_cause = find_hardware_reboot_cause()
        if previous_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
            # If the reboot cause if non HW, get the reboot cause from REBOOT_CAUSE_FILE
            previous_reboot_cause = find_software_reboot_cause()
    else:
        # Get the reboot cause from REBOOT_CAUSE_FILE
        previous_reboot_cause = find_software_reboot_cause()

    # Current time
    reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))

    # Save the previous cause info into its history file as json format
    reboot_cause_dict = get_reboot_cause_dict(previous_reboot_cause, additional_reboot_info, reboot_cause_gen_time)

    # Create reboot-cause-#time#.json under history directory
    REBOOT_CAUSE_HISTORY_FILE = os.path.join(REBOOT_CAUSE_HISTORY_DIR, "reboot-cause-{}.json".format(reboot_cause_gen_time))

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. I also suggest using a different variable to store the /proc/cmdline cause (I also reworded a comment):

    # Remove stale PREVIOUS_REBOOT_CAUSE_FILE if it exists
    if os.path.exists(PREVIOUS_REBOOT_CAUSE_FILE):
        os.remove(PREVIOUS_REBOOT_CAUSE_FILE)

    # This variable is kept for future-use purpose. When proc_cmd_line/vendor/software provides
    # any additional_reboot_info it will be stored as a "comment" in REBOOT_CAUSE_HISTORY_FILE
    additional_reboot_info = "N/A"

    # Check if the previous reboot was warm/fast reboot by testing whether there is "fast|fastfast|warm" in /proc/cmdline
    proc_cmdline_reboot_cause = find_proc_cmdline_reboot_cause()

    # If /proc/cmdline does not indicate reboot cause, check if the previous reboot was caused by hardware
    if proc_cmdline_reboot_cause is None:
        previous_reboot_cause = find_hardware_reboot_cause()
        if previous_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
            # If the reboot cause is non-hardware, get the reboot cause from REBOOT_CAUSE_FILE
            previous_reboot_cause = find_software_reboot_cause()
    else:
        # Get the reboot cause from REBOOT_CAUSE_FILE
        previous_reboot_cause = find_software_reboot_cause()

    # Current time
    reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))

    # Save the previous cause info into its history file as json format
    reboot_cause_dict = get_reboot_cause_dict(previous_reboot_cause, additional_reboot_info, reboot_cause_gen_time)

    # Create reboot-cause-#time#.json under history directory
    REBOOT_CAUSE_HISTORY_FILE = os.path.join(REBOOT_CAUSE_HISTORY_DIR, "reboot-cause-{}.json".format(reboot_cause_gen_time))

We could also simplify further, but the logic might not be as clear:

    # Remove stale PREVIOUS_REBOOT_CAUSE_FILE if it exists
    if os.path.exists(PREVIOUS_REBOOT_CAUSE_FILE):
        os.remove(PREVIOUS_REBOOT_CAUSE_FILE)

    # This variable is kept for future-use purpose. When proc_cmd_line/vendor/software provides
    # any additional_reboot_info it will be stored as a "comment" in REBOOT_CAUSE_HISTORY_FILE
    additional_reboot_info = "N/A"

    # Default the reboot cause to the cause found in REBOOT_CAUSE_FILE
    previous_reboot_cause = find_software_reboot_cause()

    # Check if the previous reboot was warm/fast reboot by testing whether there is "fast|fastfast|warm" in /proc/cmdline
    proc_cmdline_reboot_cause = find_proc_cmdline_reboot_cause()

    # If /proc/cmdline does not indicate reboot cause, check if the previous reboot was caused by hardware
    if proc_cmdline_reboot_cause is None:
        hardware_reboot_cause = find_hardware_reboot_cause()
        if not hardware_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
            previous_reboot_cause = hardware_reboot_cause

    # Current time
    reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))

    # Save the previous cause info into its history file as json format
    reboot_cause_dict = get_reboot_cause_dict(previous_reboot_cause, additional_reboot_info, reboot_cause_gen_time)

    # Create reboot-cause-#time#.json under history directory
    REBOOT_CAUSE_HISTORY_FILE = os.path.join(REBOOT_CAUSE_HISTORY_DIR, "reboot-cause-{}.json".format(reboot_cause_gen_time))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Update the logic now. I kept it at the first proposal.

@vaibhavhd vaibhavhd changed the title Ignore NON HW reboot cause [determine-reboot-cause] Ignore non-hardware reboot cause Jan 4, 2021
@lguohan
Copy link
Collaborator

lguohan commented Jan 5, 2021

@sujinmkang , can you sign-off?

@sujinmkang
Copy link
Collaborator

@vaibhavhd It looks good to me but I was also thinking about logic like below.

    # Check if the previous reboot was warm/fast reboot by testing whether there is "fast|fastfast|warm" in /proc/cmdline
    proc_cmdline_reboot_cause = find_proc_cmdline_reboot_cause()

    # Check the hardware_reboot_cause indicated by platform API
    hardware_reboot_cause = find_hardware_reboot_cause()

    # If /proc/cmdline does not indicate reboot cause and hardware_reboot_cause is non_hardware
    if proc_cmdline_reboot_cause is None and not hardware_reboot_cause.startswith(REBOOT_CAUSE_NON_HARDWARE):
        previous_reboot_cause = hardware_reboot_cause
    else:
        # Get the reboot cause from REBOOT_CAUSE_FILE
        previous_reboot_cause = find_software_reboot_cause()

While checking the findhardware_reboot_cause(), if there is no minor cause available, dummy "()" is following the major hardware_reboot_cause all the time. Not all hardware reboot cause has hardware_reboot_cause_minor.
can we remove the dummy if there is no minor hardware reboot cause?

line #125
    hardware_reboot_cause = "{} ({})".format(hardware_reboot_cause_major, hardware_reboot_cause_minor)

@vaibhavhd
Copy link
Contributor Author

@vaibhavhd It looks good to me but I was also thinking about logic like below.

Thanks Sujin, I agree with the suggestion of not including minor dummy cause. As discussed, I will create a new PR for this enhancement along with some other minor refactoring that I have in mind.

For now, let's go ahead with the bug fix for the issue.

@vaibhavhd vaibhavhd merged commit 6fd78b6 into sonic-net:master Jan 5, 2021
@vaibhavhd vaibhavhd deleted the reboot-cause-fix branch January 5, 2021 18:50
lguohan pushed a commit that referenced this pull request Jan 6, 2021
- Why I did it - Reboot cause prints "Non-Hardware (N/A)" instead of showing the software reboot cause.
The issue is mishandling of hardware reboot cause in determine-reboot-cause script.

- How I did it
Fixed the handling for Non Hardware reboot cause. Ignore if Non-Hardware is present in the hardware_reboot_cause output. Added some code refactoring for simplicity.

- How to verify it - With fix, the hardware reboot cause is ignored (if it is non hw):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants