Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df stats #2

Closed
thorhs opened this issue Dec 1, 2020 · 14 comments · Fixed by #3
Closed

df stats #2

thorhs opened this issue Dec 1, 2020 · 14 comments · Fixed by #3
Assignees

Comments

@thorhs
Copy link
Owner

thorhs commented Dec 1, 2020

there is only information about volume group and not capacity filesystem :

TYPE aix_disk_free gauge

aix_disk_free{disk="hdisk1",vgname="ngamsoft",machine_serial="21475DV",lpar="GAMAY",group_id="32772"} 49664
aix_disk_free{disk="hdisk0",vgname="rootvg",machine_serial="21475DV",lpar="GAMAY",group_id="32772"} 61824

Is it possible to add "df" information or other to follow capacity of space disk ?

Regards
Frederic

Originally posted by @fredtriplefred in #1 (comment)

@thorhs
Copy link
Owner Author

thorhs commented Dec 1, 2020

@fredtriplefred I'm sure it wouldn't be too difficult. We have traditional nagios monitoring which has covered that aspect for us so it never came up. This is definetly something that should be included. I'll try to dig into that soon.

@thorhs
Copy link
Owner Author

thorhs commented Dec 2, 2020

@fredtriplefred I have released v1.8.0 with support for node_filesystem_X metrics. Unfortunately we are in a change freeze at work so I can't test it out too much, but what I have tested seems to be file. I would appreciate it if you could try to deploy this new version and see if all the numbers add up.

@fredtriplefred
Copy link

Thanks Thors it works perfectly 👍

I have integrated it to Grafana and alert system
Concerning others metrics (cpu/disk,memory,partition,...), have you already create specific queries to summarize activities for Prometheus or Grafana as there are a lot but i'm not enough specialist to gather most important of them.

Another request if possible.
Would you have compiled a version for AIX6 because I have this message on our scores in this version ?
exec(): 0509-036 Cannot load program ./node_exporter_aix because of the following errors:
0509-130 Symbol resolution failed for node_exporter_aix because:
0509-136 Symbol ___strcmp (number 1) is not exported from
dependent module /usr/lib/libc.a(shr.o).
0509-136 Symbol __get_lc_charmap_ptr (number 20) is not exported from
dependent module /usr/lib/libc.a(shr.o).
0509-136 Symbol cur_locale (number 113) is not exported from
dependent module /usr/lib/libc.a(shr.o).
0509-192 Examine .loader section symbols with the
'dump -Tv' command.

Thanks and regards
Frederic

@thorhs
Copy link
Owner Author

thorhs commented Dec 2, 2020

@fredtriplefred No, sorry, I don’t have access to anything below 7.1. If you install the packages in the readme you should be able to build it your self. It would actually be an interesting experiment to see if there are any issues.

I have some graphs I could share soon. Take the info with a grain of salt, I may have misinterpreted some of the metrics. The cpu stats are especially iffy. But, they do reflect the trends :)

@thorhs
Copy link
Owner Author

thorhs commented Dec 2, 2020

@fredtriplefred I uploaded the dashboard that I use most frequently. It just went through some changes so I hope all the calculations match up. Give it a whirl and create issues if you find any. This was exported from grafana v6.5.1.

@fredtriplefred
Copy link

Hi thors !
sorry for delayed feeback, outage on Exalogic systems ...
Dashboard seems reflect correctly reality compared another monitoring i have (lpar2rrd)
Just an issue with the context of an AIX 7.1 and the exporter aix :

http://saveprod:9100/metrics DOWN instance="saveprod:9100" job="node_exporter_aix" 50.206s ago 20.28ms invalid UTF-8 label value
This is not logically related to exporter but if you have an idea ;)
Thanks for work in any case
Regards
Frederic

@fredtriplefred
Copy link

It seems Serial Number for this AIX bad formatted and be the source of error :

HELP node_load1 1m load average.

TYPE node_load1 gauge

node_load1{machine_serial="/?? ^B#0 ^DZp",lpar="saveprod",group_id="32773"} 2.99446

HELP node_load5 5m load average.

TYPE node_load5 gauge

node_load5{machine_serial="/?? ^B#0 ^DZp",lpar="saveprod",group_id="32773"} 4.5938

Which is the command used to extract it ?

Regards
Frederic

@thorhs
Copy link
Owner Author

thorhs commented Dec 4, 2020

This is coming from the libperfstat library. I've has issues if the system tools are not being used, for example if /opt/freeware/bin is ahead of /use/bin. See if you can run with bog-standard PATH and LIBPATH.

@fredtriplefred
Copy link

yes surely in relation with context environment but it works (with just an error on diskadapter) if executed manually and not as a service :
saveprod.root / => /usr/local/bin/node_exporter_aix
Node exporter for AIX version 1.8.0.0 listening on port 9100
Error calling perfstat_diskadapter: Invalid argument

so may be rather in context around the service ?
By waiting, i use nohup instead as the service.

Good week-end
Frederic

@thorhs
Copy link
Owner Author

thorhs commented Dec 6, 2020

@fredtriplefred Could you give v1.10.0 a go? I'm trying to set PATH and LIBPATH to some sane values on startup to see if that helps. It works correctly if I try to start it up using a PATH string that had issues previously, so hopefully it just works now.

@fredtriplefred
Copy link

Hello Thorhs
Sorry i always receive these bad characters for serial machine variable in metrics only for this lpar :
machine_serial="/ท่ �$ภ �Zp"
Regards
Frederic

@thorhs
Copy link
Owner Author

thorhs commented Dec 7, 2020

@fredtriplefred Hmmm... that is odd.

Unfortunately I don't have any control over the libperfstat, and how it finds the machine serial number. By running a trace on the process, it seems like this is command is being executed by the libperfstat library to get the machine serial number:

lscfg -vpl sysplanar0 2>/dev/null|grep -p "System VPD:" |grep "Machine/Cabinet"

I have set the path to system only directories, and emptied the LIBPATH so there should be no outside influence. What i find most peculiar is that the command works on the command line, but fails in SRC.

If you run the above command, what is the output? On my end, I get:

[REIKNISTOFA\rb747@rba-nim-dev node_exporter_aix]$ lscfg -vpl sysplanar0 2>/dev/null|/usr/bin/grep -p "System VPD:" |grep "Machine/Cabinet"
        Machine/Cabinet Serial No...XXXXXXXX

What version of AIX is this LPAR running?

I could add a flag to manually set the machine_serial, if that would be an acceptable solution, or even read it from a file in /etc/sysconfig.

@fredtriplefred
Copy link

yes same results as you, it works in command line :
saveprod.root / => lscfg -vpl sysplanar0 2>/dev/null|grep -p "System VPD:" |grep "Machine/Cabinet"
Machine/Cabinet Serial No...785A0A0
Not really a problem finally as it works with nohup background and i reboot partition only when there is a mandatory (update 6=>7 last time) :
saveprod.root / => uptime
03:18PM up 453 days, 3:58, 5 users, load average: 1.63, 3.39, 4.10

Which cpu_pool_id references ?
The id os the Shared Processor Pool used by the lpar ?

Regards
Frederic

@thorhs
Copy link
Owner Author

thorhs commented Dec 8, 2020

Ok. The cpupool_id is what is returned from the perfstat_partition_config, I have not fully investigated it this should be the shared processor pool the LPAR is in. If you are not using shared processor pools, this is probably 0 for all LPARs. I'm hoping I can use this to graph up the total CPU used per pool, as well as the free capacity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants