-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry about Maximum Node Scale of QMaster in SGE #33
Comments
Follow the steps below to change the limit of open files to 1234567. Please keep us updated 😄
|
Thanks for reply. I have set the nofile limits as mentioned before. However, SGE QMaster keeps using only 978 fds. This strange phenomenon is also recorded in aws/aws-parallelcluster#1592. I have done some investigation on this, and it turns out that in source/libs/comm/cl_commlib.c#L1276, the fd used by SGE is limited by FD_SETSIZE. I don't see the necessity for this restriction, as SGE utilizes epoll on Linux instead of select. In my tests, it was possible to make SGE use more fd by modifying this code. Hopefully my feedback will help you locate the issue. |
Thanks a lot for reporting this and especially the links. It was an interesting read. I will keep them in mind if we do need modify the source code. That being said, in the weekly test virtual machine, I confirm a systemd unit file modification works. Here is the session I copied and pasted.
FWIW, the test VM is from |
I agree and confirm that the modification on the systemd unit file will make the Linux kernel give more fds to SGE.
However, this doesn’t mean SGE will utilize those additional fds. You may check out /opt/sge/default/spool/qmaster/message after launching SGE with proper fd limit set. Probably you will notice a line complaining SGE is only using ~970 fds. And this is the real problem: you have enough fds shown in /proc, but SGE just won’t use that.
As I mentioned before, it’s the code (possibly a bug) in Qmaster’s launching process that makes SGE decide to use only 970 fds, which leads to communication failures on large clusters. Hope I have explained it clearly. (Sry for my poor English and thx for the patience)
|
Got it! you made it crystal clear! The root cause is that SGE code used to use 'select', which has a limitation of 1024. So line 1274 of "libs/comm/cl_commlib.c" reduced "max_open_connections" to FD_SETSIZE no matter how high we set the FD limit to. Posix released "poll" for a large number of connections. Additionally, I am even attempted to use "epoll" for better performance on Linux. After all, Linux is dominant nowadays. As this change is fundamental, I hope we can thoroughly test the "poll or epoll" feature before releasing it. Are you willing to collaborate on this potential improvement and test it on your 2000+ node cluster? |
I appreciate the significance of transitioning to poll/epoll as a major update that necessitates extensive testing. I'm open to collaborating on the testing aspect. However, I'd like to clarify that the 2000+ node cluster is not a physical one. Instead, it is realized through a specialized testing framework designed for testing HPC resource management softwares (e.g., Slurm/SGE/OpenPBS). This framework could virtualize up to 1000 computing nodes on a single physical server (equipped with 64 cores and 256G RAM). Currently, the framework has not been made public, as we are in the process of writing a paper that includes details about the framework. We are working to make it open source in the next few months. If the feature development has made any progress, I would like to offer the framework for test. |
Hello SGE maintainers,
I am reaching out to discuss an issue I've encountered regarding the scalability of SGE, particularly when expanding the number of nodes in a cluster. My experience has been smooth with clusters up to 1000 nodes. However, when I attempted to scale up to 2000 nodes, I faced significant performance degradation. The system became extremely sluggish, to the point where commands like qhost would hang indefinitely.
In my efforts to troubleshoot this issue, I came across a discussion that pointed to the nofile limit for qmaster, suggesting that it restricts the number of sockets QMaster can utilize for communication. The thread can be found here: Grid Engine Qmaster Hard Descriptor Limit and Soft Descriptor Limit Discussion.
Following the recommendations, I adjusted the nofile limit and the MAX_DYN_EC parameter, hoping to resolve the communication bottleneck. Unfortunately, it appears that QMaster is still limited to using only 978 sockets. This limitation was also highlighted in #1592.
I would appreciate any insights or guidance you can provide on the following:
Thank you for your time and support.
The text was updated successfully, but these errors were encountered: