Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry about Maximum Node Scale of QMaster in SGE #33

Open
Nativu5 opened this issue Mar 20, 2024 · 6 comments
Open

Inquiry about Maximum Node Scale of QMaster in SGE #33

Nativu5 opened this issue Mar 20, 2024 · 6 comments

Comments

@Nativu5
Copy link

Nativu5 commented Mar 20, 2024

Hello SGE maintainers,

I am reaching out to discuss an issue I've encountered regarding the scalability of SGE, particularly when expanding the number of nodes in a cluster. My experience has been smooth with clusters up to 1000 nodes. However, when I attempted to scale up to 2000 nodes, I faced significant performance degradation. The system became extremely sluggish, to the point where commands like qhost would hang indefinitely.

In my efforts to troubleshoot this issue, I came across a discussion that pointed to the nofile limit for qmaster, suggesting that it restricts the number of sockets QMaster can utilize for communication. The thread can be found here: Grid Engine Qmaster Hard Descriptor Limit and Soft Descriptor Limit Discussion.

Following the recommendations, I adjusted the nofile limit and the MAX_DYN_EC parameter, hoping to resolve the communication bottleneck. Unfortunately, it appears that QMaster is still limited to using only 978 sockets. This limitation was also highlighted in #1592.

I would appreciate any insights or guidance you can provide on the following:

  • What is the maximum node scale that SGE is expected/tested to support efficiently?
  • Is there a recommended configuration or workaround to support larger clusters?

Thank you for your time and support.

@daimh
Copy link
Owner

daimh commented Mar 23, 2024

Follow the steps below to change the limit of open files to 1234567. Please keep us updated 😄

  1. grep -B 1 1234 /etc/systemd/system/sgemaster.service
    [Service]
    LimitNOFILE=1234567

  2. systemctl daemon-reload

  3. systemctl restart sgemaster

  4. grep open /proc/$(pgrep sge_qmaster)/limits
    Max open files 1234567 1234567 files

@Nativu5
Copy link
Author

Nativu5 commented Mar 25, 2024

Thanks for reply. I have set the nofile limits as mentioned before. However, SGE QMaster keeps using only 978 fds. This strange phenomenon is also recorded in aws/aws-parallelcluster#1592.

I have done some investigation on this, and it turns out that in source/libs/comm/cl_commlib.c#L1276, the fd used by SGE is limited by FD_SETSIZE. I don't see the necessity for this restriction, as SGE utilizes epoll on Linux instead of select.

In my tests, it was possible to make SGE use more fd by modifying this code. Hopefully my feedback will help you locate the issue.

@daimh
Copy link
Owner

daimh commented Mar 30, 2024

Thanks a lot for reporting this and especially the links. It was an interesting read. I will keep them in mind if we do need modify the source code.

That being said, in the weekly test virtual machine, I confirm a systemd unit file modification works. Here is the session I copied and pasted.

###Before
[root@master ~]# grep open /proc/$(pgrep sge_qmaster)/limits
Max open files            1024                 524288               files
[root@master ~]# grep -A 1 Service /etc/systemd/system/sgemaster.service
[Service]
Type=forking

###Modification
[root@master ~]# sed -i "s/\[Service]/[Service]\nLimitNOFILE=1234567/" /etc/systemd/system/sgemaster.service

###After
[root@master ~]# grep -A 2 Service /etc/systemd/system/sgemaster.service
[Service]
LimitNOFILE=1234567
Type=forking
[root@master ~]# systemctl daemon-reload
[root@master ~]# systemctl restart sgemaster
[root@master ~]# grep open /proc/$(pgrep sge_qmaster)/limits
Max open files            1234567              1234567              files

FWIW, the test VM is from
https://github.com/daimh/sge/tree/master/tests/5-keystrokes-to-setup-a-cluster-without-root-privilege

@Nativu5
Copy link
Author

Nativu5 commented Mar 30, 2024 via email

@daimh
Copy link
Owner

daimh commented Apr 1, 2024

Got it! you made it crystal clear!

The root cause is that SGE code used to use 'select', which has a limitation of 1024. So line 1274 of "libs/comm/cl_commlib.c" reduced "max_open_connections" to FD_SETSIZE no matter how high we set the FD limit to. Posix released "poll" for a large number of connections. Additionally, I am even attempted to use "epoll" for better performance on Linux. After all, Linux is dominant nowadays.

As this change is fundamental, I hope we can thoroughly test the "poll or epoll" feature before releasing it. Are you willing to collaborate on this potential improvement and test it on your 2000+ node cluster?

@Nativu5
Copy link
Author

Nativu5 commented Apr 1, 2024

I appreciate the significance of transitioning to poll/epoll as a major update that necessitates extensive testing. I'm open to collaborating on the testing aspect. However, I'd like to clarify that the 2000+ node cluster is not a physical one. Instead, it is realized through a specialized testing framework designed for testing HPC resource management softwares (e.g., Slurm/SGE/OpenPBS).

This framework could virtualize up to 1000 computing nodes on a single physical server (equipped with 64 cores and 256G RAM). Currently, the framework has not been made public, as we are in the process of writing a paper that includes details about the framework. We are working to make it open source in the next few months. If the feature development has made any progress, I would like to offer the framework for test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants