Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SOS Report to Troubleshoot customer environments #14778

Closed
mfeifer opened this issue Apr 17, 2017 · 22 comments
Closed

Create SOS Report to Troubleshoot customer environments #14778

mfeifer opened this issue Apr 17, 2017 · 22 comments
Assignees
Labels

Comments

@mfeifer
Copy link
Contributor

mfeifer commented Apr 17, 2017

When troubleshooting in large environments, would like some kind of health report (or another name) to get vital signs of an environment. This is not the same as getting all the logs.

For discussion:

  • Number of Appliances
  • ManageIQ version on each appliance
  • Server Roles enabled on each appliance (and its failover if applicable)
  • Number of each type of worker on each appliance and its allocated memory
  • Memory and CPU on Appliance
  • Appliance Platform
  • Database size
  • Replicated Databases
@mfeifer
Copy link
Contributor Author

mfeifer commented Apr 17, 2017

@gtanzillo @blomquisg @dmetzger57 @Fryguy @agrare

Talk amongst yourselves, discuss.

@mfeifer mfeifer changed the title Create Health Report to Troubleshoot large environments Create SOS Report to Troubleshoot large environments Apr 17, 2017
@mfeifer
Copy link
Contributor Author

mfeifer commented Apr 17, 2017

Not sure if appropriate, but first occurrence of an error in the log?
Recent environmental changes? (I don't think that this can be automated.)

@mfeifer mfeifer changed the title Create SOS Report to Troubleshoot large environments Create Health Report to Troubleshoot large environments Apr 17, 2017
@jrafanie
Copy link
Member

jrafanie commented Apr 17, 2017

I think this PR, #14107, answers the appliances, roles, and workers part of the bullet points. If it's helpful, we can add the VERSION, and memory information in the existing rake task or add new ones that we log...

The memory information should be available and updated fairly often in

  • miq_servers:

    • system_memory_free
    • system_memory_used (this + the above is the total RAM)
    • system_swap_free
    • system_swap_used (this + the above is the total swap)
  • miq_workers:

    • percent_cpu
    • cpu_time
    • os_priority
    • memory_usage (RSS)
    • memory_size (VSS)
    • proportional_set_size (PSS)

@mfeifer
Copy link
Contributor Author

mfeifer commented Apr 17, 2017

@mfeifer mfeifer changed the title Create Health Report to Troubleshoot large environments Create SOS Report to Troubleshoot large environments Apr 17, 2017
@Fryguy Fryguy changed the title Create SOS Report to Troubleshoot large environments Create SOS Report to Troubleshoot customer environments Apr 17, 2017
@Fryguy
Copy link
Member

Fryguy commented Apr 17, 2017

We should investigate if that sosreport tool is worth formally plugging into, or if it just makes sense to just create a simple tools script to get what we need as a first pass, or just reuse an existing command like evm:status:full as @jrafanie said.

As a further explanation of the request here, it really isn't about large environments, but is for ANY environment where support is requested. In other words, this report would be something that support must run/ask for as a first response in every ticket they respond to. Right now, we always "ask for logs", but that is a very heavy handed request since the logs are huge and need to be parsed and reviewed to find what we are looking for. Instead, this report is meant to be a single, simple command, with a simple 1-page text output, that can be dumped to a file or copied into an email or bug ticket, which gives support and/or developers the information that they almost always need and always ask for (we even ask for this stuff when we already have the logs, which is kind of silly, but the logs are so onerous).

If rake evm:status:full is the answer, that is fine, and if we want to enhance that, that is fine as well, however in addition to just having the script, we must also update our processes to ask for this report in every support case.

@blomquisg
Copy link
Member

I definitely like the idea of evm:status:full. Basically, anything we can pack into a one-liner.

And, if it's a one-liner, we could always have sosreport call that if that ends up being the end goal.

@blomquisg
Copy link
Member

Oh, something else to add:

Region, Zone, Provider landscape

Region
|
|-- Zone 1
|   |
|   |-- Provider ABC
|   |   |
|   |   |-- Provider ABC Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)
|   |
|   |-- Provider 123
|       |
|       |-- Provider 123 Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)
|
|-- Zone 2
    |
    |-- Provider XYZ
    |   |
    |   |-- Provider XYZ Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)
    |
    |-- Provider 456
        |
        |-- Provider 456 Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)

@mfeifer
Copy link
Contributor Author

mfeifer commented Apr 24, 2017

@gtanzillo @blomquisg @dmetzger57 @Fryguy @agrare

Where are we with this? I do not want to get to the point where we hit another set of issues and wonder what happened to the SOS report?

@Fryguy
Copy link
Member

Fryguy commented May 10, 2017

Oh, something else to add:

Region, Zone, Provider landscape

There are already the tools/db_printers that do something similar that we can include into the single report.

@jdeubel
Copy link
Member

jdeubel commented May 10, 2017

Things that would be useful:

  • The provider landscape for all providers is critical as per @blomquisg
  • Status of connections (Netstat,etc...)
  • Current status of the Miq_queue (similar to count for states
  • DB health
    I will add more as I think of them.

@blomquisg blomquisg self-assigned this May 11, 2017
@chessbyte
Copy link
Member

@bronaghs Can you make traction on a first cut of this. Once in place, we can iterate and improve it.

@bronaghs
Copy link

@chessbyte - will do.

@bronaghs
Copy link

@miq-bot assign @juliancheal

@miq-bot miq-bot assigned juliancheal and unassigned blomquisg May 16, 2017
@itamarh
Copy link

itamarh commented May 17, 2017

Some indication to which features are used to allow understanding complexity and load in the environment around:

  • smartstate analysis
  • compliance policies - defined/used
  • chargeback reports / any reports (just how many were run / last timestamp)
  • auotmation - if/how many defined/used
  • service catalog items - count/types

@mfeifer
Copy link
Contributor Author

mfeifer commented Jun 8, 2017

@dmetzger57 help

@juliancheal
Copy link
Member

@mfeifer I've done some work on this, but I keep getting delayed.

@dmetzger57
Copy link
Contributor

@juliancheal I'll reach out to you and setup a time when we can chat. I'm going to be dedicating part of my time helping @mfeifer effort to make field engineering faster/better/stronger

@dmetzger57
Copy link
Contributor

Perhaps a simple start is implementing a cli tool for gathering high level configuration / health information, initially providing the following information:

Of course taking into account multi-appliance, multi-zone, multi-region fun 😄

An SOS Report contains a massive amount of information, this suggestion looks to provide a light weight tool to begin gaining a perspective on the environment being supported, it can be added to an SOS Report if desired.

@chessbyte
Copy link
Member

An SOS Report contains a massive amount of information

I think the whole point of an SOS Report is something that is relatively small that can be cut/paste in an email to get clarity on the user's environment. This would precede the set of ManageIQ logs that tend to be massive and are typically shared via attachment or a link to an available storage location.

As @ohadlevy mentioned via email, perhaps we can borrow some ideas from the Foreman project here and here.

@chessbyte chessbyte assigned dmetzger57 and unassigned juliancheal Nov 8, 2017
@miq-bot
Copy link
Member

miq-bot commented May 14, 2018

This issue has been automatically marked as stale because it has not been updated for at least 6 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions!

@miq-bot miq-bot added the stale label May 14, 2018
@JPrause
Copy link
Member

JPrause commented Jan 23, 2019

@dmetzger57 is this still a valid issue. If not can you close.
If there's no update by next week, I'll be closing this issue.

@JPrause
Copy link
Member

JPrause commented Jan 29, 2019

Closing issue. If you feel the issue needs to remain open, please let me know and it will be reopened.
@miq-bot close_issue

@miq-bot miq-bot closed this as completed Jan 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests