-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support riscv64 test in docker containers with qemu #4307
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Accelerator1996 - this looks like a nice option for many tests, there are just a few other considerations:
- what are the configuration requirements for a node with the label 'docker.qemu' (and is that a good label for such a node, so that it follows and/or sets a pattern in our labeling schema).
- If the requirement is simply that docker is installed on a machine, then we have sw.tool.docker label that could be used.
- action on @smlambert to document our current labeling schema
- some tests require to run inside of docker containers, what happens when those tests are sent to these riscv emulators. If they can not be run on these, then we may need to update some other code to prevent it
b5136cc
to
cb7f051
Compare
Hi, @smlambert! I think the label you said is suitable, so I have modified it. The label of riscv64_linux contains sw.os.linux. So if a linux machine with docker installed, it can run in most cases with the help of multiarch/qemu-user-static. In the code, whether it is using this docker or the execution of testBuild(), it is wrapped by try-catch, so I think exceptions can be handled. |
@Accelerator1996 Do you have the Dockerfile which creates the image that this PR is pulling? |
Sorry, we're still sorting it out. |
One other enhancement to do as part of this PR, similar to what we have done to support dynamic agents hinging on the parameter CLOUD_PROVIDER, is to prefer real nodes first if they are available, and if not available, then fire up a dynamic agent or container. (see code here) So, for our use case we would like to have a mix of both real risc-v hardware and these emulators. I think we can take this same approach, where we set CLOUD_PROVIDER=local (or some such value), and if that is set, check first if real nodes are available and if not, fire up these containers. I have asked for feedback/review from @sophia-guo as she added the dynamic agent support originally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will mark this PR so it doesn't get merged before a few other details are discussed. Looking forward to leveraging the concept when it is ready.
Agree with @smlambert if this can be enhanced as prefer real nodes first if they are available, and if not available fire up a dynamic agent or container. Feel like the current PR will only work by explicitly setting the LABEL as 'hw.arch.riscv&&sw.tool.docker&&....'. That is it won't run in docker containers if there is no risc-v machine( labeled with The way Dynamic agents work is doing extra check , trigger a agent and run with the docker container. |
@smlambert @sophia-guo Thank you all very much! I will improve my code according to your opinions recently. |
1cbe8ed
to
45c522c
Compare
Hi, @smlambert @sophia-guo! Does my change meet your expectations? I have added DynamicAgents configuration. If there is no risc-v machine, by setting CLOUD_PROVIDER=local, it can run test on the machine labeled with the 'sw.os.linux&&hw.bits.64&&ci.agent.dynamic&&sw.tool.docker'. I have tested it in our pipeline and it works fine. In addition, one of my changes is the condition of whether to start the vm. I think that the spliced label should not be used only, but the original label should be allowed to be used. Otherwise, the configuration similar to "required docker" will not be able to start the vm. |
Thanks @Accelerator1996, looks good. I will await the Dockerfile for https://hub.docker.com/r/alibabadragonwelljdk/riscv-normal-qemu_6.0.0-rvv-1.0 before running some test jobs to verify this PR. |
I have written a dockerfile, but I haven’t verified it completely yet. I will submit it together after I verify it in the past few days. May I ask what directory should my dockerfile be placed in? |
You do not need to place it in our repository, but it would be good to be able to see it shared somewhere publicly visible before I kick off test runs. |
👌 |
params.LABEL is normally used as the agent name when people want to run the job on specific machine ( for example, debug). If it's not available then no need to run on any other available agent. So I think without the change of ORIG_LABEL part you PR should work in your local pipeline without setting param.LABEL, correct? |
Hi, @smlambert @sxa ! This is the repository where our dockerfile is stored https://github.com/dragonwell-releng/docker-qemu-riscv64. If the local docker version is too old, you need to use "docker buildx" instead of "docker build". We tested the image we made locally and it works fine. |
In the code, the prerequisite for your dynamic agent to start is that there is a machine with specific label. But if user sets the "required docker" to true, this place will never be called unless he manually added "sw.tool.docker" in the MAP["LABEL"]. ORIG_LABEL just lets him meet the condition, so that some backup agents can be enabled for testing when all available nodes are offline. If there are still available nodes, he will not use dynamic agent. Therefore I think ORIG_LABEL is necessary. |
I did not forget about this one, I will get a chance to try some test runs tomorrow. |
45c522c
to
541d1be
Compare
Sorry, I verified it in other repo. When I sorted out the code, there was an extra "+". I have modified it and it is ok now. Sorry for the trouble. |
No worries @Accelerator1996 - here is the next run https://ci.adoptium.net/view/Test_grinder/job/Grinder/6793/ I can take a closer look tomorrow, to offer suggestions. |
params.LABEL is normally used as the agent name when people want to run the job on specific node. @Accelerator1996 the case you mentioned aqa-tests/buildenv/jenkins/openjdk_tests Lines 173 to 180 in 541d1be
The dynamic part one is good in grinder https://ci.adoptium.net/view/Test_grinder/job/Grinder/6807/ for our case with real risc-v hardware available. In this grinder https://ci.adoptium.net/view/Test_grinder/job/Grinder/6807/ I didn't ask for docker required and believe that is the expected case. |
Rerun with CLOUD_PROVIDER set (which I forgot to do on the earlier runs): https://ci.adoptium.net/view/Test_grinder/job/Grinder/6810 |
03e1fda
to
dcd66a5
Compare
0f8cb23
to
1a193dc
Compare
buildenv/jenkins/openjdk_tests
Outdated
@@ -192,6 +193,10 @@ timestamps{ | |||
LABEL += "&&!sw.os.ubuntu.22" | |||
} | |||
|
|||
if (SPEC.equals("linux_riscv64") && params.CLOUD_PROVIDER.equals("local")) { | |||
LABEL = LABEL.minus("&&hw.arch.riscv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused. This settings will make the dynamic agent has high priority than the hardware one if both dynamic and hardware are available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused. This settings will make the dynamic agent has high priority than the hardware one if both dynamic and hardware are available?
The hardware machine will be selected when required_docker is false. The dynamic machine will be selected when required_docker is false, add a value such as 'fyre' in DynamicAgents instead of 'local' and set CLOUD_PROVIDER to be 'fyre'. Docker will be selected when required_docker is true or CLOUD_PROVIDER is 'local'. It will select hardware one when both are available and not all hardware machines are busy if we set parameters as above.
We probably need to clarify two things. One is about the printout message. Some current printout message might not be clear and cause confusion. There is a PR to address that #4407. That means output as following doesn't mean the job will run on dynamic agent - it only means none of labeled machines are online ( not same as avaialbe) The second one is I'm not sure why the change depends on if params.DOCKER_REQUIRED is set or not? I've tried four cases with the PR on our jenkins ( riscv64 static agent available) the result as following:
@Accelerator1996 I'm not sure your environment so does it mean in your environment there is no other linux static agents labeled sw.os.linux? Also I feel like your case is asking for a docker required agent and then using that agent setting up the qemu container and do the test? Yes, agreed might be helpful if we could setup a short meeting to make all those questions clear. |
@sophia-guo - it is likely the guidance I was giving to @Accelerator1996 . I expected our code to work this way:
I leave it to you both to meet and discuss. |
580292b
to
fdae63f
Compare
Thank you very much for your work guidance @smlambert @sophia-guo . Thank you very much! I have improved my code. Since it involves resource scheduling, I would like to explain my thinking and current situation. The current scheduling is like this. When there is no node with the given label, the project will wait for a certain period of time until there is an available node. When there is a machine with a given label and a given params.CLOUD_PROVIDER, then make a judgement. If there is an idle node, use it, otherwise use the dynamic agents. Now the switch of the dynamic proxy is the parameter CLOUD_PROVIDER, but I think that since it is a dynamic agent, should it be used regardless of whether CLOUD_PROVIDER is set? The following is my logic. First, condition 1 is changed to whether there is an online node or non-empty dynamicAgents setting in map. Only when neither exists, start to wait for the given label node to go online, otherwise enter the else branch. Then confirm whether it is set dynamicAgents, if it is empty, it means that it does not support, start to get the idle physical machine. Here I have modified the judgment conditions for the idle machine, because there may be cases where the machine setting concurrency is greater than 1. If all hardware machines are busy and the number of executors is full, then start to enable the dynamic agents. Then if the parameter CLOUD_PROVIDER is set, if it is not empty and in the defined list, use it, otherwise use the first one of the list which defined in list. Of course, if the list is empty, no dynamic agent will be used. The following is my self-test report: 2. CLOUD_PROVIDER == 'azure' 3. CLOUD_PROVIDER == 'fyre' 4. CLOUD_PROVIDER == 'local' 5. CLOUD_PROVIDER == 'xx' |
Thanks for those exhaustive notes and testing @Accelerator1996 ! This design discussion and work has been very beneficial. In my original thinking, CLOUD_PROVIDER == '' and CLOUD_PROVIDER == 'local' equated to the same thing, but I agree with you, they can/should be considered different cases. CLOUD_PROVIDER indicates who will supply the resources being used. CLOUD_PROVIDER == '' resource will be supplied by other appropriately labelled Jenkins nodes attached to the Jenkins server, so when resources are requested, and exceed the number of real hardware, if it is defined in Jenkinsfilebase code send the task off to a differently labeled machine (sw.tool.docker&& CLOUD_PROVIDER == 'azure|fyre|xx' resource is supplied by those cloud providers by using the appropriate Jenkins plugin for each of those providers to spin up dynamic agents CLOUD_PROVIDER == 'local', similar to azure|fyre|xx, resource is supplied by Jenkins server, by using Jenkins Docker plugin to spin up containers on the fly (not supported yet, likely has arch restrictions, need to investigate, related: https://devopscube.com/docker-containers-as-build-slaves-jenkins/). |
fdae63f
to
816b418
Compare
As you said, the startup of docker may be related to arch, including our qemu also needs to run on linux/amd64, so now I limit the use of 'local' to nodes with label |
I have a question about this case. The label |
buildenv/jenkins/openjdk_tests
Outdated
LABEL += '&&ci.agent.dynamic' | ||
if (params.CLOUD_PROVIDER != null && params.CLOUD_PROVIDER in dynamicAgents) { | ||
dynamicLabel = params.CLOUD_PROVIDER | ||
} else if (dynamicAgents.size() >= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the case (1) --> 7 as I mentioned in #4307 (comment), which I would suggest to disregard it.
When there is no hardware machine to meet the requirements, it will try to use dynamic agent. If there is no dynamic, then the test job will wait for available nodes. If it runs here, it actually means that the user does not actually have a hardware machine with the corresponding label. The result is the same. |
The discussion is based on the current PR. |
7e87345
to
94c395a
Compare
Hello, Thank you very much for your work guidance @smlambert @sophia-guo .As you said, in fact, we should distinguish between dynamicAgents and dockerAgents. Because testing with docker is different from testing with dynamicAgent, the former mainly depends on the machine environment, we only need a specific arch machine with docker to run test. But dynamicAgents depends on the plugin of jenkins, which is not available in every jenkins, so I think the latter mechanism should be retained. In view of the fact that we maybe run tests in docker containers more frequently in the future. I think a DockerAgents should be added in map, which will be a common design. The following is my self-test report:
|
94c395a
to
427d3c0
Compare
dockerAgentLabel = dockerAgents[0] | ||
} | ||
if (!dockerAgentLabel.equals('')) { | ||
if (dockerAgentLabel.equals('default') && SPEC.equals('linux_riscv64')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we combine those two ifs to one? Feels like if (!dockerAgentLabel.equals('')) is unnecessary?
if (!dockerAgentLabel.equals('')) {
if (dockerAgentLabel.equals('default') && SPEC.equals('linux_riscv64')) {
LABEL = LABEL.minus("ci.role.test&&") | ||
LABEL += '&&ci.agent.dynamic' | ||
println "Cannot find any idle nodes. Starting dynamic vm, nodeLabel: '${LABEL}', dynamicLabel: '${params.CLOUD_PROVIDER}'" | ||
} else if (params.CLOUD_PROVIDER != null && params.CLOUD_PROVIDER in dockerAgents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those two cases if (params.CLOUD_PROVIDER != null && params.CLOUD_PROVIDER in dockerAgents)
and else if (dockerAgents.size() >= 1)
same? For both dockerAgentLabel=default
?
LGTM other than two minor comments. For the extra property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we can revise the naming in a later PR as required as I suspect we will continue to play in this area of code in the upcoming months. Thanks @Accelerator1996 and @sophia-guo for this enhancement and the discussions that it triggered.
Support risc-v test in docker containers with qemu to solve the problem that someone may not have risc-v machine. We test it locally and it was fine.