Inquiry about Maximum Node Scale of QMaster in SGE #33

Nativu5 · 2024-03-20T15:25:07Z

Hello SGE maintainers,

I am reaching out to discuss an issue I've encountered regarding the scalability of SGE, particularly when expanding the number of nodes in a cluster. My experience has been smooth with clusters up to 1000 nodes. However, when I attempted to scale up to 2000 nodes, I faced significant performance degradation. The system became extremely sluggish, to the point where commands like qhost would hang indefinitely.

In my efforts to troubleshoot this issue, I came across a discussion that pointed to the nofile limit for qmaster, suggesting that it restricts the number of sockets QMaster can utilize for communication. The thread can be found here: Grid Engine Qmaster Hard Descriptor Limit and Soft Descriptor Limit Discussion.

Following the recommendations, I adjusted the nofile limit and the MAX_DYN_EC parameter, hoping to resolve the communication bottleneck. Unfortunately, it appears that QMaster is still limited to using only 978 sockets. This limitation was also highlighted in #1592.

I would appreciate any insights or guidance you can provide on the following:

What is the maximum node scale that SGE is expected/tested to support efficiently?
Is there a recommended configuration or workaround to support larger clusters?

Thank you for your time and support.

daimh · 2024-03-23T17:46:32Z

Follow the steps below to change the limit of open files to 1234567. Please keep us updated 😄

grep -B 1 1234 /etc/systemd/system/sgemaster.service
[Service]
LimitNOFILE=1234567
systemctl daemon-reload
systemctl restart sgemaster
grep open /proc/$(pgrep sge_qmaster)/limits
Max open files 1234567 1234567 files

Nativu5 · 2024-03-25T16:35:45Z

Thanks for reply. I have set the nofile limits as mentioned before. However, SGE QMaster keeps using only 978 fds. This strange phenomenon is also recorded in aws/aws-parallelcluster#1592.

I have done some investigation on this, and it turns out that in source/libs/comm/cl_commlib.c#L1276, the fd used by SGE is limited by FD_SETSIZE. I don't see the necessity for this restriction, as SGE utilizes epoll on Linux instead of select.

In my tests, it was possible to make SGE use more fd by modifying this code. Hopefully my feedback will help you locate the issue.

daimh · 2024-03-30T10:46:06Z

Thanks a lot for reporting this and especially the links. It was an interesting read. I will keep them in mind if we do need modify the source code.

That being said, in the weekly test virtual machine, I confirm a systemd unit file modification works. Here is the session I copied and pasted.

###Before
[root@master ~]# grep open /proc/$(pgrep sge_qmaster)/limits
Max open files            1024                 524288               files
[root@master ~]# grep -A 1 Service /etc/systemd/system/sgemaster.service
[Service]
Type=forking

###Modification
[root@master ~]# sed -i "s/\[Service]/[Service]\nLimitNOFILE=1234567/" /etc/systemd/system/sgemaster.service

###After
[root@master ~]# grep -A 2 Service /etc/systemd/system/sgemaster.service
[Service]
LimitNOFILE=1234567
Type=forking
[root@master ~]# systemctl daemon-reload
[root@master ~]# systemctl restart sgemaster
[root@master ~]# grep open /proc/$(pgrep sge_qmaster)/limits
Max open files            1234567              1234567              files

FWIW, the test VM is from
https://github.com/daimh/sge/tree/master/tests/5-keystrokes-to-setup-a-cluster-without-root-privilege

Nativu5 · 2024-03-30T17:45:58Z

I agree and confirm that the modification on the systemd unit file will make the Linux kernel give more fds to SGE. However, this doesn’t mean SGE will utilize those additional fds. You may check out /opt/sge/default/spool/qmaster/message after launching SGE with proper fd limit set. Probably you will notice a line complaining SGE is only using ~970 fds. And this is the real problem: you have enough fds shown in /proc, but SGE just won’t use that. As I mentioned before, it’s the code (possibly a bug) in Qmaster’s launching process that makes SGE decide to use only 970 fds, which leads to communication failures on large clusters. Hope I have explained it clearly. (Sry for my poor English and thx for the patience)

daimh · 2024-04-01T00:48:24Z

Got it! you made it crystal clear!

The root cause is that SGE code used to use 'select', which has a limitation of 1024. So line 1274 of "libs/comm/cl_commlib.c" reduced "max_open_connections" to FD_SETSIZE no matter how high we set the FD limit to. Posix released "poll" for a large number of connections. Additionally, I am even attempted to use "epoll" for better performance on Linux. After all, Linux is dominant nowadays.

As this change is fundamental, I hope we can thoroughly test the "poll or epoll" feature before releasing it. Are you willing to collaborate on this potential improvement and test it on your 2000+ node cluster?

Nativu5 · 2024-04-01T08:23:17Z

I appreciate the significance of transitioning to poll/epoll as a major update that necessitates extensive testing. I'm open to collaborating on the testing aspect. However, I'd like to clarify that the 2000+ node cluster is not a physical one. Instead, it is realized through a specialized testing framework designed for testing HPC resource management softwares (e.g., Slurm/SGE/OpenPBS).

This framework could virtualize up to 1000 computing nodes on a single physical server (equipped with 64 cores and 256G RAM). Currently, the framework has not been made public, as we are in the process of writing a paper that includes details about the framework. We are working to make it open source in the next few months. If the feature development has made any progress, I would like to offer the framework for test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry about Maximum Node Scale of QMaster in SGE #33

Inquiry about Maximum Node Scale of QMaster in SGE #33

Nativu5 commented Mar 20, 2024

daimh commented Mar 23, 2024 •

edited

Loading

Nativu5 commented Mar 25, 2024

daimh commented Mar 30, 2024 •

edited

Loading

Nativu5 commented Mar 30, 2024 via email •

edited

Loading

daimh commented Apr 1, 2024

Nativu5 commented Apr 1, 2024

Inquiry about Maximum Node Scale of QMaster in SGE #33

Inquiry about Maximum Node Scale of QMaster in SGE #33

Comments

Nativu5 commented Mar 20, 2024

daimh commented Mar 23, 2024 • edited Loading

Nativu5 commented Mar 25, 2024

daimh commented Mar 30, 2024 • edited Loading

Nativu5 commented Mar 30, 2024 via email • edited Loading

daimh commented Apr 1, 2024

Nativu5 commented Apr 1, 2024

daimh commented Mar 23, 2024 •

edited

Loading

daimh commented Mar 30, 2024 •

edited

Loading

Nativu5 commented Mar 30, 2024 via email •

edited

Loading