Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ganglia not enabled with extra_json in config #1322

Closed
cchng opened this issue Sep 18, 2019 · 12 comments · Fixed by #1323 or aws/aws-parallelcluster-cookbook#402
Closed

Ganglia not enabled with extra_json in config #1322

cchng opened this issue Sep 18, 2019 · 12 comments · Fixed by #1323 or aws/aws-parallelcluster-cookbook#402
Labels
bug update Update related issue

Comments

@cchng
Copy link

cchng commented Sep 18, 2019

Environment:

  • AWS ParallelCluster 2.4.1
  • OS: ubuntu 1604
  • Scheduler: sge
  • Master instance type: m5.large
  • Compute instance type: c5.large

Bug description and how to reproduce:
Added extra_json = { "cluster" : { "ganglia_enabled" : "yes" } } under [cluster] section
https://docs.aws.amazon.com/parallelcluster/latest/ug/moving-from-cfncluster-to-aws-parallelcluster.html

private/public url did not show with status command

pcluster status ...
Status: UPDATE_COMPLETE
MasterServer: RUNNING
MasterPublicIP: xx.xx.xxx.xxx
ClusterUser: ubuntu
MasterPrivateIP: xx.xx.xxx.xxx

stacks parameter:
ExtraJson | {"cfncluster": {"ganglia_enabled": "yes"}}

Public/Private URL to access Ganglia (disabled by default)
From browser, connection time out with public url http:///ganglia - private url not accessible

Additional context:
Master node up and running
Looking into additional instance metrics as part of troubleshooting #1275

@sean-smith
Copy link
Contributor

sean-smith commented Sep 18, 2019

Did you open up the security group parallelcluster-<CLUSTER_NAME>-MasterSecurityGroup-<xxx> to port 80?

The output should definitely be shown. Let me try and reproduce. Can you share your config?

@sean-smith
Copy link
Contributor

Can you take a look at the Cloudformation console > Outputs, then try that ganglia url shown there. It looks like:

image

sean-smith added a commit to sean-smith/aws-parallelcluster that referenced this issue Sep 18, 2019
* Fixes aws#1322

Signed-off-by: Sean Smith <[email protected]>
@cchng
Copy link
Author

cchng commented Sep 18, 2019

Hi @sean-smith,

Here's the config

[aws]
aws_region_name = us-west-2

[cluster default]
key_name = <key>
vpc_settings = <default>
compute_instance_type = c5.large
master_instance_type = c5.xlarge
initial_queue_size = 1
max_queue_size = 20
maintain_initial_size = true
scheduler = sge
cluster_type = ondemand
placement = compute
placement_group = DYNAMIC
master_root_volume_size = 20
base_os = ubuntu1604
ebs_settings = custom
extra_json = { "cluster" : { "ganglia_enabled" : "yes" } }
tags = {...}


[ebs custom]
volume_type = io1
volume_iops = 1000
volume_size = 250
encrypted = true

[vpc <default>]
vpc_id = <...>
master_subnet_id = <...>

[global]
cluster_template = default
update_check = true
sanity_check = false

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

Yes, I tried the urls shown on the console outputs.
with the security groups opened up I got

Not Found
The requested URL /ganglia/ was not found on this server.

Apache/2.4.18 (Ubuntu) Server at <ip> Port 80

@sean-smith
Copy link
Contributor

@cchng I solved the issue with the outputs in #1323

Seems like the ganglia service isn't starting. Can you check/var/log/cfn-init.log and grep for "ganglia".

I was able to create a cluster, open the security group and see ganglia.

@cchng
Copy link
Author

cchng commented Sep 18, 2019

@sean-smith thanks for #1323
Here's the log

$ grep -i ganglia /var/log/cfn-init.log
Recipe: aws-parallelcluster::_ganglia_install
  * apt_package[ganglia-monitor] action install
    - install version 3.6.0-6ubuntu4 of package ganglia-monitor
  * apt_package[ganglia-webfrontend] action install
    - install version 3.6.1-1ubuntu1.1 of package ganglia-webfrontend
  * execute[copy ganglia apache conf] action run
    - execute cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf
  * template[/etc/ganglia/gmetad.conf] action create
    - update content in file /etc/ganglia/gmetad.conf from b8f766 to 3fcb7b
    --- /etc/ganglia/gmetad.conf        2016-02-10 16:11:16.000000000 +0000
    +++ /etc/ganglia/.chef-gmetad20190918-2409-18c7oim.conf     2019-09-18 21:26:13.665899529 +0000
  * template[/etc/ganglia/gmond.conf] action create
    - update content in file /etc/ganglia/gmond.conf from 556740 to a99f7b
    --- /etc/ganglia/gmond.conf 2016-02-10 16:11:16.000000000 +0000
    +++ /etc/ganglia/.chef-gmond20190918-2409-14mdps8.conf      2019-09-18 21:26:13.677899535 +0000
    -  user = ganglia
    -    path = "/usr/lib/ganglia/modcpu.so"
    -    path = "/usr/lib/ganglia/moddisk.so"
    -    path = "/usr/lib/ganglia/modload.so"
    -    path = "/usr/lib/ganglia/modmem.so"
    -    path = "/usr/lib/ganglia/modnet.so"
    -    path = "/usr/lib/ganglia/modproc.so"
    -    path = "/usr/lib/ganglia/modsys.so"
    -include ('/etc/ganglia/conf.d/*.conf')
    +    path = "/usr/lib/ganglia/modcpu.so"
    +    path = "/usr/lib/ganglia/moddisk.so"
    +    path = "/usr/lib/ganglia/modload.so"
    +    path = "/usr/lib/ganglia/modmem.so"
    +    path = "/usr/lib/ganglia/modnet.so"
    +    path = "/usr/lib/ganglia/modproc.so"
    +    path = "/usr/lib/ganglia/modsys.so"
    +include ("/etc/ganglia/conf.d/*.conf")
  * service[ganglia-monitor] action enable (up to date)
  * service[ganglia-monitor] action restart
    - restart service service[ganglia-monitor]

@sean-smith
Copy link
Contributor

Can you verify that the ganglia service is running, for example:

$ service gmond status
gmond (pid 16401) is running...

@cchng
Copy link
Author

cchng commented Sep 19, 2019

$ service gmond status
● gmond.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)
service ganglia-monitor status
● ganglia-monitor.service
 Loaded: loaded (/etc/init.d/ganglia-monitor; bad; vendor preset: enabled)
   Active: active (running) since Thu 2019-09-19 05:47:39 UTC; 14min ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 2
   Memory: 1.1M
      CPU: 153ms
   CGroup: /system.slice/ganglia-monitor.service
           └─20378 /usr/sbin/gmond --pid-file /var/run/gmond.pid

$ service gmetad status
● gmetad.service
   Loaded: loaded (/etc/init.d/gmetad; bad; vendor preset: enabled)
   Active: active (running) since Thu 2019-09-19 05:47:38 UTC; 18min ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 9
   Memory: 155.5M
      CPU: 433ms
   CGroup: /system.slice/gmetad.service
           └─20325 /usr/sbin/gmetad --pid-file /var/run/gmetad.pid

Sep 19 05:47:38 ip-172-31-47-83 systemd[1]: Starting gmetad.service...
Sep 19 05:47:38 ip-172-31-47-83 gmetad[20321]: Starting Ganglia Monitor Meta-Daemon: gmetad.
Sep 19 05:47:38 ip-172-31-47-83 systemd[1]: Started gmetad.service.

Also, don't know if it is relevant but I'm creating the cluster from windows 10.

@sean-smith
Copy link
Contributor

Windows 10, doesn't make a difference. I've tested on ubuntu1604 and gotten the same result.

image

Marking this as a bug.

@sean-smith sean-smith added the bug label Sep 19, 2019
@sean-smith
Copy link
Contributor

I can confirm it's working on amazon linux. As a temporary workaround, while we work on a fix, you could try that os.

@cchng
Copy link
Author

cchng commented Sep 19, 2019

Okay I'll give it a shot with alinux @sean-smith

@cchng
Copy link
Author

cchng commented Sep 19, 2019

confirming ganglia is up and running with alinux

demartinofra pushed a commit that referenced this issue Sep 19, 2019
* Fixes #1322

Signed-off-by: Sean Smith <[email protected]>
@sean-smith sean-smith reopened this Sep 20, 2019
@sean-smith
Copy link
Contributor

Issue was closed automatically by #1323 but is not resolved.

lukeseawalker added a commit to lukeseawalker/aws-parallelcluster-cookbook that referenced this issue Oct 2, 2019
lukeseawalker added a commit to lukeseawalker/aws-parallelcluster-cookbook that referenced this issue Oct 2, 2019
lukeseawalker added a commit to lukeseawalker/aws-parallelcluster-cookbook that referenced this issue Oct 2, 2019
lukeseawalker added a commit to aws/aws-parallelcluster-cookbook that referenced this issue Oct 3, 2019
@enrico-usai enrico-usai added the update Update related issue label Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug update Update related issue
Projects
None yet
3 participants