Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization performance severely degraded since Tycho 1.7.0 #342

Closed
k-spa opened this issue Oct 19, 2021 · 11 comments · Fixed by #394
Closed

Parallelization performance severely degraded since Tycho 1.7.0 #342

k-spa opened this issue Oct 19, 2021 · 11 comments · Fixed by #394
Labels
performance 🚨 regression Regression compared to previous release - blocks upcoming release and other merges
Milestone

Comments

@k-spa
Copy link
Contributor

k-spa commented Oct 19, 2021

We have been using Java 11 / Tycho 1.6.0 in our project since the release of Java 11.

The build is quite lengthy and, without any special configuration, takes about 3 hours. Therefore we set the -T 1C command line option to build several projects in parallel. This logs warnings about unsupported parallelization but has always been working fine. Building in parallel drops the build time to 2 hours.

Anticipating the upcoming Java 17 support I tried upgrading the build system to Tycho 2.4.0 to test for any major roadblocks.

With Tycho 2.4.0 the unparallelized build takes the same 3 hours as with Tycho 1.6.0, but setting the parallelization option actually increases the build time to 3 hours and 47 minutes!

I tested with older versions of Tycho and it seems that the severe slowdown happens since version 1.7.0. I did not expect that - minor changes in the build time are fine, but I expected the build time to stay at least in the area of 2 hours.

with -T 1C without -T 1C
Tycho 1.6.0 2 hours 3 hours
Tycho 1.7.0+ ~4 hours 3 hours

Going from our current, parallelized, 2 hour build, that is an increase of almost 100 % - but even increasing the build time to 3 hours by building unparallelized is hardly an option.

Is there a problem with running the build parallelized since Tycho 1.7.0? Do I have to configure something differently?

Our maven call looks like this:
mvn install -Dmaven.repo.local=<repository location> -P <profile name> -T 1C -fae

The profile just determines which tests are executed and should not behave differently whether the -T 1C parameter is present or not.

@akurtakov
Copy link
Member

Can you provide small reproducer where this is visible for testing purposes?

@laeubi
Copy link
Member

laeubi commented Oct 20, 2021

I tried upgrading the build system to Tycho 2.4.0

Have you tried tycho 2.5.0? It contains some improvements for large build setups regarding performance.

@k-spa
Copy link
Contributor Author

k-spa commented Oct 21, 2021

Oh, I completely missed the release of Tycho 2.5.0, thanks for the heads up.

The numbers are in:

with -T 1C without -T 1C
Tycho 1.6.0 2 hours 3 hours
Tycho 2.4.0 ~4 hours 3 hours
Tycho 2.5.0 3 h 30 min 3 h 15 min

It is a bit faster but still nowhere near to 1.6.0-durations. The non-parallelized build seems to have become a little bit slower (although maybe within error margins).

I will try to create a reproducible project, but I cannot make promises. I guess it will have to be a bunch of test projects with sleeps in them.

@mickaelistria
Copy link
Contributor

The gap in term of amount of commits between Tycho 1.6.0 and 2.5.0 is huge. It's going to be very hard, and not so motivating, to identify the change that cause this issue. And if issue is between 1.6.0 and 1.7.0, it's also not so motivating to dig into outdated code.
So I recommend that you consider upgrading to a recent release more often if you want to get better support. And this is true for all OSS projects.

@Bananeweizen
Copy link
Contributor

I would guess the root cause to be around making the mojos thread-safe. We did that by simply introducing an artificial lock object in each and every mojo. Therefore if your builds use a huge amount of time in just one identical mojo for a big number of maven modules, that one mojo may dominate the build and cannot be run in parallel anymore. I.e. you can run tycho-foo and tycho-bar in parallel, but not tycho-foo 2 times, since that change.
Using a lock object is the most simple implementation, and for some mojos it might be possible to lift that restriction after further analysis whether they actually access some shared objects or not. But for that it would be good to see which tycho mojo is the one dominating your builds.
You might want to run Yourkit YouMonitor (not be confused with Yourkit Profiler) to get all the insights of your build times: https://www.yourkit.com/youmonitor/features/

@k-spa
Copy link
Contributor Author

k-spa commented Nov 21, 2021

First of all: This might seriously be the best tool recommendation I have gotten in quite a while, thanks a lot!

I monitored builds with tycho 1.7.0 and 2.5.0 and the data I got fully supports your suspicion.

As I mentioned earlier, we run the build with the -T 1C option, which means that a lot of maven modules are being build around the same time.

This is visible in the following screenshot that shows several builds starting at around the same timestamp. This behavior is identical, whether tycho 1.7.0 or tycho 2.5.0 is used.
01_parallel_builds

But when single module builds are examined, it gets really interesting. Please have a look at the duration of the tycho-surefire goal and the duration of the started test process in the following two screenshots:
02_test_much_faster03_even_more_delay

This is totally different from the results I could observe with tycho 1.7.0. To me it seems that the tycho-surefire goal is started and then waits for sometimes over two hours until the actual JUnit process is started that then usually finishes in a matter of minutes. I actually plotted the build timeline and it is clearly visible that with tycho 2.4.0 only one JUnit process is running at a time - so even with multiple surefire goals started in parallel, the actual JUnit runs are locked against each other.

As the tests take up a majority of our build time, this really hurts us.

I do suspect though that it might be possible to lessen that restriction for tycho-surefire - as said before, we've been successfully building with -T 1C for years now.

Last but not least - thanks for someone taking interest. I fully understand that this might not be the most fun or motivating issue to analyze or work on and I certainly wish we would have noticed the problem sooner, but as we all cannot change the past, this is what it is now. You can be assured that it is not in my personal interest to not upgrade infrastructure software for such a long time either, but that should not be the topic here.

Anyhow, if you need further data, I'll do my best to provide it. I hope the analysis helped!

@laeubi
Copy link
Member

laeubi commented Nov 22, 2021

It seems the lock was introduced with Tycho 1.7.x 8afcd61 from a quick look at the code it seems that at least we do not need to lock globally as each project is using its own work-directory.
The tricky part would be to check if the used services are thread save (even as the could be), but at least I think we can leave the lock at the runTest method.

My suggestion would be, move that method call outside the synchronized block and check if that helps in your case or causing issues. If it helps we can then integrate this change permanently into Tycho.

@k-spa
Copy link
Contributor Author

k-spa commented Nov 24, 2021

Your suspicion seems to be correct. I cloned the repository and tested your suggested change:

2021-11-24_09h31_18

The build is successful (no multi threading failures, although there are up to 16 test projects running in parallel) and the build time is back down to approximately 2 hours.

I'm a bit unsure though on how to continue from here. As far as I have seen I would need to sign the Eclipse ECA in order to commit the patch myself. I have not done that yet and would have to check with my company first if I would actually be allowed to do that. Should I be going forward with that or would you rather have someone of you commit the change yourself? I'd be fine with that too.

@laeubi
Copy link
Member

laeubi commented Nov 24, 2021

@k-spa please check first with your company as it is generally a great opportunity to contribute (small or larger) patches to (eclipse) OSS projects and such a small change is a good starting-point to get familiar with the process and tools.

If it would become too complicated let me know and I'll write an own patch for this but as you are currently the only one complaining its your time-scedule and probably it helps to convince your company to contribute if they are directly affected by this.

@laeubi laeubi added performance 🚨 regression Regression compared to previous release - blocks upcoming release and other merges labels Nov 24, 2021
@mickaelistria
Copy link
Contributor

Should I be going forward with that or would you rather have someone of you commit the change yourself?

We'd rather have you submitting the patch on your own name for proper and fair attribution, and also -since you've demonstrated some interest in the project and proven you have skills to improve it- to get you familiar with the contribution process as it's very likely you'll want to suggest some more changes in the future.

k-spa added a commit to k-spa/tycho that referenced this issue Nov 25, 2021
Running tests while the lock is present leads to the tests being executed in series which severely degrades performance when the build is running with parallel maven projects.
@laeubi laeubi linked a pull request Nov 25, 2021 that will close this issue
laeubi added a commit that referenced this issue Nov 25, 2021
Fixes #342: Run surefire tests outside of lock
@Bananeweizen
Copy link
Contributor

We should probably mention this in the release notes. If there are people with SWTBot UI tests and parallel build enabled, they would previously not run into issues, but after this change multiple bot tests might try to manipulate windows and controls at the same time. That will probably fail, at least on Windows (due to not being able to run with multiple separate interactive sessions in parallel). In fact, in our company a big amount of tests follows that SWTBot pattern, and that's why we don't use the parallel build for surefire testing. Otherwise I might have noticed this specific bug much earlier myself. :)

Bananeweizen added a commit to Bananeweizen/tycho that referenced this issue Nov 29, 2021
Bananeweizen added a commit that referenced this issue Nov 30, 2021
Document parallel testing changes (fixes #342)
@laeubi laeubi added this to the 2.6 milestone Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🚨 regression Regression compared to previous release - blocks upcoming release and other merges
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants