Create SOS Report to Troubleshoot customer environments #14778

mfeifer · 2017-04-17T18:30:06Z

When troubleshooting in large environments, would like some kind of health report (or another name) to get vital signs of an environment. This is not the same as getting all the logs.

For discussion:

Number of Appliances
ManageIQ version on each appliance
Server Roles enabled on each appliance (and its failover if applicable)
Number of each type of worker on each appliance and its allocated memory
Memory and CPU on Appliance
Appliance Platform
Database size
Replicated Databases

mfeifer · 2017-04-17T18:31:36Z

@gtanzillo @blomquisg @dmetzger57 @Fryguy @agrare

Talk amongst yourselves, discuss.

mfeifer · 2017-04-17T18:40:20Z

Not sure if appropriate, but first occurrence of an error in the log?
Recent environmental changes? (I don't think that this can be automated.)

jrafanie · 2017-04-17T18:48:57Z

I think this PR, #14107, answers the appliances, roles, and workers part of the bullet points. If it's helpful, we can add the VERSION, and memory information in the existing rake task or add new ones that we log...

The memory information should be available and updated fairly often in

miq_servers:
- system_memory_free
- system_memory_used (this + the above is the total RAM)
- system_swap_free
- system_swap_used (this + the above is the total swap)
miq_workers:
- percent_cpu
- cpu_time
- os_priority
- memory_usage (RSS)
- memory_size (VSS)
- proportional_set_size (PSS)

mfeifer · 2017-04-17T18:52:19Z

See https://github.com/sosreport/sos/wiki/How-to-Write-a-Plugin

Fryguy · 2017-04-17T19:23:17Z

We should investigate if that sosreport tool is worth formally plugging into, or if it just makes sense to just create a simple tools script to get what we need as a first pass, or just reuse an existing command like evm:status:full as @jrafanie said.

As a further explanation of the request here, it really isn't about large environments, but is for ANY environment where support is requested. In other words, this report would be something that support must run/ask for as a first response in every ticket they respond to. Right now, we always "ask for logs", but that is a very heavy handed request since the logs are huge and need to be parsed and reviewed to find what we are looking for. Instead, this report is meant to be a single, simple command, with a simple 1-page text output, that can be dumped to a file or copied into an email or bug ticket, which gives support and/or developers the information that they almost always need and always ask for (we even ask for this stuff when we already have the logs, which is kind of silly, but the logs are so onerous).

If rake evm:status:full is the answer, that is fine, and if we want to enhance that, that is fine as well, however in addition to just having the script, we must also update our processes to ask for this report in every support case.

blomquisg · 2017-04-17T19:33:09Z

I definitely like the idea of evm:status:full. Basically, anything we can pack into a one-liner.

And, if it's a one-liner, we could always have sosreport call that if that ends up being the end goal.

blomquisg · 2017-04-17T19:39:47Z

Oh, something else to add:

Region, Zone, Provider landscape

Region
|
|-- Zone 1
|   |
|   |-- Provider ABC
|   |   |
|   |   |-- Provider ABC Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)
|   |
|   |-- Provider 123
|       |
|       |-- Provider 123 Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)
|
|-- Zone 2
    |
    |-- Provider XYZ
    |   |
    |   |-- Provider XYZ Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)
    |
    |-- Provider 456
        |
        |-- Provider 456 Summary (# of Clusters, Hosts, VMs, Images, Instances, Containers, etc..)

mfeifer · 2017-04-24T14:32:02Z

@gtanzillo @blomquisg @dmetzger57 @Fryguy @agrare

Where are we with this? I do not want to get to the point where we hit another set of issues and wonder what happened to the SOS report?

Fryguy · 2017-05-10T17:19:16Z

Oh, something else to add:

Region, Zone, Provider landscape

There are already the tools/db_printers that do something similar that we can include into the single report.

jdeubel · 2017-05-10T17:23:52Z

Things that would be useful:

The provider landscape for all providers is critical as per @blomquisg
Status of connections (Netstat,etc...)
Current status of the Miq_queue (similar to count for states
DB health
I will add more as I think of them.

chessbyte · 2017-05-11T19:21:18Z

@bronaghs Can you make traction on a first cut of this. Once in place, we can iterate and improve it.

bronaghs · 2017-05-11T20:19:36Z

@chessbyte - will do.

bronaghs · 2017-05-16T17:21:42Z

@miq-bot assign @juliancheal

itamarh · 2017-05-17T02:20:26Z

Some indication to which features are used to allow understanding complexity and load in the environment around:

smartstate analysis
compliance policies - defined/used
chargeback reports / any reports (just how many were run / last timestamp)
auotmation - if/how many defined/used
service catalog items - count/types

mfeifer · 2017-06-08T15:05:50Z

@dmetzger57 help

juliancheal · 2017-06-08T15:26:34Z

@mfeifer I've done some work on this, but I keep getting delayed.

dmetzger57 · 2017-06-08T15:48:50Z

@juliancheal I'll reach out to you and setup a time when we can chat. I'm going to be dedicating part of my time helping @mfeifer effort to make field engineering faster/better/stronger

dmetzger57 · 2017-09-18T19:10:36Z

Perhaps a simple start is implementing a cli tool for gathering high level configuration / health information, initially providing the following information:

EVM status
- Fine release and newer: 'rake evm:status_full'
- Pre-Fine release: 'rake evm:status'
Inventory breakdown, see Create SOS Report to Troubleshoot customer environments #14778 (comment)
Miq_queue health
- Total number of messages
- Message count by method_name & state
Per-Appliance the output of free(1)
Per-Appliance Config
- Enabled Roles
- Worker counts

Of course taking into account multi-appliance, multi-zone, multi-region fun 😄

An SOS Report contains a massive amount of information, this suggestion looks to provide a light weight tool to begin gaining a perspective on the environment being supported, it can be added to an SOS Report if desired.

chessbyte · 2017-09-25T15:11:45Z

An SOS Report contains a massive amount of information

I think the whole point of an SOS Report is something that is relatively small that can be cut/paste in an email to get clarity on the user's environment. This would precede the set of ManageIQ logs that tend to be massive and are typically shared via attachment or a link to an available storage location.

As @ohadlevy mentioned via email, perhaps we can borrow some ideas from the Foreman project here and here.

miq-bot · 2018-05-14T04:00:39Z

This issue has been automatically marked as stale because it has not been updated for at least 6 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions!

JPrause · 2019-01-23T16:40:24Z

@dmetzger57 is this still a valid issue. If not can you close.
If there's no update by next week, I'll be closing this issue.

JPrause · 2019-01-29T19:22:45Z

Closing issue. If you feel the issue needs to remain open, please let me know and it will be reopened.
@miq-bot close_issue

mfeifer changed the title ~~Create Health Report to Troubleshoot large environments~~ Create SOS Report to Troubleshoot large environments Apr 17, 2017

mfeifer changed the title ~~Create SOS Report to Troubleshoot large environments~~ Create Health Report to Troubleshoot large environments Apr 17, 2017

mfeifer changed the title ~~Create Health Report to Troubleshoot large environments~~ Create SOS Report to Troubleshoot large environments Apr 17, 2017

Fryguy changed the title ~~Create SOS Report to Troubleshoot large environments~~ Create SOS Report to Troubleshoot customer environments Apr 17, 2017

blomquisg self-assigned this May 11, 2017

miq-bot assigned juliancheal and unassigned blomquisg May 16, 2017

chessbyte assigned dmetzger57 and unassigned juliancheal Nov 8, 2017

miq-bot added the stale label May 14, 2018

miq-bot closed this as completed Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create SOS Report to Troubleshoot customer environments #14778

Create SOS Report to Troubleshoot customer environments #14778

mfeifer commented Apr 17, 2017 •

edited

Loading

mfeifer commented Apr 17, 2017

mfeifer commented Apr 17, 2017

jrafanie commented Apr 17, 2017 •

edited

Loading

mfeifer commented Apr 17, 2017

Fryguy commented Apr 17, 2017

blomquisg commented Apr 17, 2017

blomquisg commented Apr 17, 2017

mfeifer commented Apr 24, 2017 •

edited

Loading

Fryguy commented May 10, 2017

jdeubel commented May 10, 2017

chessbyte commented May 11, 2017

bronaghs commented May 11, 2017

bronaghs commented May 16, 2017

itamarh commented May 17, 2017

mfeifer commented Jun 8, 2017

juliancheal commented Jun 8, 2017

dmetzger57 commented Jun 8, 2017

dmetzger57 commented Sep 18, 2017

chessbyte commented Sep 25, 2017

miq-bot commented May 14, 2018

JPrause commented Jan 23, 2019

JPrause commented Jan 29, 2019

Create SOS Report to Troubleshoot customer environments #14778

Create SOS Report to Troubleshoot customer environments #14778

Comments

mfeifer commented Apr 17, 2017 • edited Loading

mfeifer commented Apr 17, 2017

mfeifer commented Apr 17, 2017

jrafanie commented Apr 17, 2017 • edited Loading

mfeifer commented Apr 17, 2017

Fryguy commented Apr 17, 2017

blomquisg commented Apr 17, 2017

blomquisg commented Apr 17, 2017

mfeifer commented Apr 24, 2017 • edited Loading

Fryguy commented May 10, 2017

jdeubel commented May 10, 2017

chessbyte commented May 11, 2017

bronaghs commented May 11, 2017

bronaghs commented May 16, 2017

itamarh commented May 17, 2017

mfeifer commented Jun 8, 2017

juliancheal commented Jun 8, 2017

dmetzger57 commented Jun 8, 2017

dmetzger57 commented Sep 18, 2017

chessbyte commented Sep 25, 2017

miq-bot commented May 14, 2018

JPrause commented Jan 23, 2019

JPrause commented Jan 29, 2019

mfeifer commented Apr 17, 2017 •

edited

Loading

jrafanie commented Apr 17, 2017 •

edited

Loading

mfeifer commented Apr 24, 2017 •

edited

Loading