Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Backpropagation does not work #93

Closed
mworchel opened this issue Jan 15, 2020 · 23 comments
Closed

[Windows] Backpropagation does not work #93

mworchel opened this issue Jan 15, 2020 · 23 comments

Comments

@mworchel
Copy link
Contributor

mworchel commented Jan 15, 2020

When running the pose_estimation sample under Windows, the optimization part does not actually perform any optimization. The loss seems to vary randomly and the final estimate does not visually differ from the initial one:

redner_backprop

It seems as if the parameter updates are not correctly computed (or are way too small since there is no visual difference between iterations) for whatever reason.

System:

  • Windows 10 x64
  • Python 3.7.4
  • Redner 0.3.2 (CPU)
@mworchel
Copy link
Contributor Author

I tested the pose_estimation sample in colab in both, CPU and GPU mode. The results are very different:

CPU:
redner_colab_cpu

GPU:
redner_colab_gpu

The CPU mode seems to have some convergence issues in general. Maybe deep down it's somehow related to the Windows backprop issue mentioned above.

@BachiLi Any idea, what could be the cause for the discrepancy?

@BachiLi
Copy link
Owner

BachiLi commented Jan 15, 2020

Interesting. Looking into this.

@BachiLi
Copy link
Owner

BachiLi commented Jan 15, 2020

CPU mode runs fine on my mac...I'm really confused.

@BachiLi
Copy link
Owner

BachiLi commented Jan 15, 2020

It also runs fine on my linux machine. This seems like a Colab-specific issue?

@BachiLi
Copy link
Owner

BachiLi commented Jan 15, 2020

This issue exists on Colab for all redner versions I tested. I have no idea why there is a discrepancy between Colab and my Linux machine.

@BachiLi
Copy link
Owner

BachiLi commented Jan 15, 2020

This issue also exists on the Tensorflow side. Actually the tensorflow version crashes on Colab occasionally.

@mworchel
Copy link
Contributor Author

Typical case of 'but it runs on my machine' :D That is really strange. Maybe has something to do with the type of CPU?

@BachiLi
Copy link
Owner

BachiLi commented Jan 15, 2020

Yes, something is wrong on Colab.

@BachiLi
Copy link
Owner

BachiLi commented Jan 15, 2020

I have a deadline next week and have to work on something else now. Please let me know if you find anything suspicious.

@mworchel
Copy link
Contributor Author

mworchel commented Jan 15, 2020

Tested the pose estimation with my (custom) Windows GPU branch which is currently based on redner 0.2.3 and the backpropagation works without issues:

redner_win_gpu

However, the CPU mode fails as above. So it really seems to be some CPU related issue that exists at least since 0.2.3.

I'll keep my eye open. Good luck with your deadline for now!

@BachiLi
Copy link
Owner

BachiLi commented Feb 2, 2020

@mworchel This should be fixed by the commit above (20af170). This is, unsurprisingly, caused by access to uninitialized buffers. In particular the code didn't consider the case where max_bounces=0. Thanks a lot for reporting this and please let me know if this fixes the problem on your side.

@mworchel
Copy link
Contributor Author

mworchel commented Feb 3, 2020

That's good news. At least in Colab it seems to work now. However, for Windows it still doesn't work. Same behavior on the CPU as before.

There is a small chance that it's still due to some different initialization behavior of MSVC and GCC/Clang. I also just discovered that I didn't properly port one of the compiler intrinsics. My version of ffs gives the index of the first set MSB, not LSB. I fixed that in PR #104 (and double checked with https://github.com/nemequ/portable-snippets/tree/master/builtin). However, this doesn't fix the backpropagation issue either.

Is there a way to verify the integrity of the edge tree or another way to verify that the stuff relying on intrinsics works the same on all systems?

@BachiLi
Copy link
Owner

BachiLi commented Feb 3, 2020

Hmm. Maybe we can set up some unit tests for the intrinsic. I can potentially do it this weekend. Thank you so much for your time by the way.

@BachiLi
Copy link
Owner

BachiLi commented Feb 3, 2020

@mworchel Have you checked if the atomics are working properly on windows?

@mworchel
Copy link
Contributor Author

mworchel commented Feb 4, 2020

Unfortunately, I didn't fully verify that the atomics work. The file atomic_msvc.h is taken from a 3rd party repo (as noted in the comment above the header) and the functions looked reasonable to me. Are the atomic operations only required by the backward pass?

My hope is, that the CPU path is a little bit easier to debug. Would be great if we had some basic tests. Like I said before, it's a great research project and I'm happy I can contribute something :)

BachiLi added a commit that referenced this issue Mar 17, 2020
TensorFlow support WIP. This should addresses
#7 and
#93
@BachiLi
Copy link
Owner

BachiLi commented Mar 17, 2020

Windows machines should be able to pip install redner-gpu and pip install redner from now on. TensorFlow is not supported yet due to some compilation issues (MSVC is not happy with the TensorFlow headers).

@BachiLi BachiLi closed this as completed Mar 17, 2020
@mworchel
Copy link
Contributor Author

Great news! Did you find any new evidence why the CPU backprop didn't work? I suppose most people use the GPU version anyway, so it's not that urgent.

@BachiLi
Copy link
Owner

BachiLi commented Mar 18, 2020

Huh, I thought that was resolved. I'll test on CPU later.

@mworchel
Copy link
Contributor Author

Unfortunately not. Your initialization fix fixed it for colab but not for Windows. I wasn't able to track the issue any further. Hope you'll find something.

@BachiLi BachiLi reopened this Mar 18, 2020
@BachiLi
Copy link
Owner

BachiLi commented Mar 18, 2020

Pretty sure there is a bug in atomic add in windows cpu. Should be fixed soon.

BachiLi added a commit that referenced this issue Mar 18, 2020
Fix windows cpu atomic add (#93)
@BachiLi
Copy link
Owner

BachiLi commented Mar 19, 2020

0.4.3 should fix this. It was a type conversion issue in the atomics: InterlockedCompareExchange takes integers as arguments and we passed float to them without reinterpret_cast.

@mworchel
Copy link
Contributor Author

Nice catch! Will try it as soon as I can.

@mworchel
Copy link
Contributor Author

Indeed fixed in 0.4.3. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants