Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying a delayed finalizer approach to fix threaded deadlock (#141) #157

Closed

Conversation

IanButterworth
Copy link
Contributor

@IanButterworth IanButterworth commented Jun 21, 2020

Edit: Seems the julia crash was poor const handling that's now fixed.

Trying out @vtjnash’s suggestion from #141

For now, I recommend only doing trivial things in a finalizer, like appending it to a queue to be cleaned up later.

This PR "fixes" the deadlock seen in the main DFTK example in #141 in the sense that the example finishes, but when the queued & delayed finalizer runs, Julia crashes. It's obviously probably not a proper fix yet as I'm sure there's a better/correct/functional way to do this, but I thought I'd share my workings

using DFTK    
DFTK.FFTW.set_num_threads(Threads.nthreads() * 4)
    
function main()    
    for Ecut in 0.0 .+ collect(3:20)    
        println("Ecut = $Ecut")    
        a = 10.263141334305942    
        lattice = a / 2 .* [[0 1 1.]; [1 0 1.]; [1 1 0.]]    
        model = Model(lattice; terms=[Kinetic()], n_electrons=4)    
        basis = PlaneWaveBasis(model, Ecut)    
    end    
end    
    
@show Threads.nthreads()
main() 
Threads.nthreads() = 6
Ecut = 4.0
Ecut = 5.0
Ecut = 6.0
Ecut = 7.0
Ecut = 8.0
Ecut = 9.0
Ecut = 10.0
Ecut = 11.0
Ecut = 12.0
Ecut = 13.0
Ecut = 14.0
Ecut = 15.0
Ecut = 16.0
Ecut = 17.0
Ecut = 18.0
Ecut = 19.0
Ecut = 20.0

cc. @stevengj @vtjnash @mfherbst

@IanButterworth
Copy link
Contributor Author

If this queue & delayed freeing approach is valid, I think the key is figuring out the right thing to trigger destroy_queued_plans on.

If this timer-based approach isn't appropriate, is there another point that would be reliably logical to call destroy_queued_plans?

@IanButterworth
Copy link
Contributor Author

bump.
It'd be good to figure out if this is acceptable

@giordano did you manage to test it on your deadlocking code?

@giordano
Copy link
Member

@giordano did you manage to test it on your deadlocking code?

Yes, I confirm that, in the code where I was experiencing a deadlock, with this PR and FFTW.set_num_threads(8), now fft doesn't freeze and instead the computation goes on!

@IanButterworth IanButterworth marked this pull request as ready for review June 30, 2020 23:08
@IanButterworth
Copy link
Contributor Author

@stevengj @vtjnash is this an acceptable solution?

@ararslan ararslan requested a review from stevengj June 30, 2020 23:57
return nothing
end

const DESTROY_QUEUE_TIMER = Ref{Timer}(Timer(destroy_queued_plans, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really seem reasonable to me; if there was a deadlock before, then this just introduces a race condition, no? Also, a Timer with an interval of 0 seems like a spinloop, which troubles me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That timer has a delay of 0, so calls once immediately. Interval is a kwarg that’s not used here.

My take on the original suggestion is to delay the destroying of plans until general fft planning has ended. But I don’t think the timer approach is right. Is there another event that could trigger destroying the queue, other than the finalizer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could try to drain the queue manually after any operation returns. Since the timer will wait for yield, I think you can usually use a zero-length timeout (basically just an idle event, but we don't wrap those in libuv)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried changing the 5 second timer to 0 seconds and the deadlock with the MWE returns.
1 second gives the same as 5 seconds, but 0.1 seconds results in deadlock. It seems too unstable to be considered a fix

@vtjnash
Copy link
Contributor

vtjnash commented Jul 1, 2020

Ah, true. The issue is that it’s using a non-reentrant lock, but that means we have difficulty since we’re using Julia’s task system, but not our locks. (So also might run into similar deadlocks if two threads try to FFT at the same time, possibly?)

@IanButterworth
Copy link
Contributor Author

Is there anything on the horizon that could be/support a viable fix for this? Feels like the issue itself is deadlocked..

@Octogonapus
Copy link
Contributor

Have there been any updates on this? How can we make progress?

@IanButterworth
Copy link
Contributor Author

@vtjnash I forget if we came up with any next steps when we discussed this at juliacon?

@vtjnash
Copy link
Contributor

vtjnash commented Aug 17, 2020

I think we said a couple things (or at least, I thought them and meant to say them) about what's going on here under the hood. I looked into this a bit closer today, and realized that we missed that we can call set_planner_hooks when adding the call to fftw_threads_set_callback. Per the warning in the documentation (http://www.fftw.org/fftw3_doc/Thread-safety.html#Thread-safety), this can cause "the worst of all worlds". What may be desirable here is to tell fftw to use our Base.ReentrantLock instead in these callback hooks. When we unlock this lock, we can then flush the list of pending finalizers (instead of using the timer here).

(In the less immediate future, we should run all finalizers on a separate thread, which would have avoided this from being a problem in the first place)

@stevengj
Copy link
Member

#160 uses ReentrantLock and gets rid of the FFTW locks, so it should fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants