-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor concurrent uploading of local changes #18449
Closed
Closed
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
531c4ae
Refactor: rename context to appContext and rm comment
ovitrif 7b0eafd
Refactor: resolve warnings
ovitrif 0f85412
Test: add test for `queueUploadFromAllSites`
ovitrif 5c0c2ac
Update: refactor upload starter to continue on cancellation
ovitrif 59219d7
Update: refactor try-catch block to runCatching
ovitrif 12d4421
Update: rename test for multiple uploads
ovitrif eccad75
Update: adapt comment docs for `checkConnectionAndUpload`
ovitrif f1c2f51
Add release notes for mutex bugfix
ovitrif e3bcf48
Cleanup comment for UploadStarterConcurrentTest class
ovitrif ecd44b6
Update: log isPage bool as int for improved readability
ovitrif 89481d1
Update: refactor fixtures to check drafts upload order
ovitrif 5b6a02a
Refactor: cleanup redundant test setup code
ovitrif 0e96e11
Refactor: extract UploadFixtures to dry test code
ovitrif caf73d4
Update: improve logs during upload of posts & pages
ovitrif 90dccd2
Update: provide Mutex via DI for testability
ovitrif c06843a
Refactor: cleanup code of concurrent test
ovitrif 21c1160
Update: fix post id counter in test
ovitrif 4c9f4dd
Update: fix refs in comment doc of UploadStarter
ovitrif 3e95c6f
Add: unit test simulating the crash in resume onCancellation handler
ovitrif 6874e8f
Update: refactor test setup code to allow more flexibility
ovitrif a3bf1b6
Update: rewrite mutex unlock test
ovitrif cf16fd2
Refactor: cleanup mutex test & reactivate
ovitrif de1e92c
Refactor: group mutex test blocks
ovitrif fdcedc6
Refactor: remove unneeded advanceUntilIdle
ovitrif c5f4379
Refactor: more cleanup in mutex test
ovitrif cc5ac60
Refactor: make trackAutoUploadAction fun foldable
ovitrif 8157e1c
Refactor: use run to avoid unnecessary var in mutex test
ovitrif 10df537
Update: remove page from tests
ovitrif ae858f6
Update: add asserts to test
ovitrif 9f0a4c0
Refactor: make test fail because of exception
ovitrif c3ab67a
Fix: wrap mutex lock/unlock in try/catch block
ovitrif File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
60 changes: 60 additions & 0 deletions
60
WordPress/src/test/java/org/wordpress/android/ui/uploads/UploadFixtures.kt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
package org.wordpress.android.ui.uploads | ||
|
||
import androidx.lifecycle.Lifecycle | ||
import androidx.lifecycle.MutableLiveData | ||
import androidx.lifecycle.ProcessLifecycleOwner | ||
import org.mockito.kotlin.any | ||
import org.mockito.kotlin.doReturn | ||
import org.mockito.kotlin.mock | ||
import org.wordpress.android.fluxc.model.PostModel | ||
import org.wordpress.android.fluxc.model.SiteModel | ||
import org.wordpress.android.fluxc.model.post.PostStatus | ||
import org.wordpress.android.fluxc.store.UploadStore | ||
import org.wordpress.android.ui.posts.PostUtilsWrapper | ||
import org.wordpress.android.util.DateTimeUtils | ||
import org.wordpress.android.util.NetworkUtilsWrapper | ||
import org.wordpress.android.viewmodel.helpers.ConnectionStatus | ||
import java.util.Date | ||
|
||
internal object UploadFixtures { | ||
private var postIdIndex = 0 | ||
|
||
private fun makePostTitleFromId() = postIdIndex.toString().padStart(2, '0') | ||
|
||
fun resetTestPostIdIndex() { postIdIndex = 0 } | ||
|
||
fun createMockedNetworkUtilsWrapper() = mock<NetworkUtilsWrapper> { on { isNetworkAvailable() } doReturn true } | ||
|
||
fun createConnectionStatusLiveData(initialValue: ConnectionStatus?): MutableLiveData<ConnectionStatus> { | ||
return MutableLiveData<ConnectionStatus>().apply { value = initialValue } | ||
} | ||
|
||
fun createMockedPostUtilsWrapper() = mock<PostUtilsWrapper> { | ||
on { isPublishable(any()) } doReturn true | ||
on { isPostInConflictWithRemote(any()) } doReturn false | ||
} | ||
|
||
fun createMockedUploadStore(numberOfAutoUploadAttempts: Int) = mock<UploadStore> { | ||
on { getNumberOfPostAutoUploadAttempts(any()) } doReturn numberOfAutoUploadAttempts | ||
} | ||
|
||
fun createMockedUploadServiceFacade() = mock<UploadServiceFacade> { | ||
on { isPostUploadingOrQueued(any()) } doReturn false | ||
} | ||
|
||
fun createMockedProcessLifecycleOwner(lifecycle: Lifecycle = mock()) = mock<ProcessLifecycleOwner> { | ||
on { this.lifecycle } doReturn lifecycle | ||
} | ||
|
||
fun createLocallyChangedPostModel(postStatus: PostStatus = PostStatus.DRAFT, page: Boolean = false) = | ||
PostModel().apply { | ||
setId(++postIdIndex) | ||
setTitle(makePostTitleFromId()) | ||
setStatus(postStatus.toString()) | ||
setIsLocallyChanged(true) | ||
setDateLocallyChanged(DateTimeUtils.iso8601FromTimestamp(Date().time / 1000)) | ||
setIsPage(page) | ||
} | ||
|
||
fun createSiteModel(isWpCom: Boolean = true) = SiteModel().apply { setIsWPCom(isWpCom) } | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure this will cause the other coroutines (for other sites) to not be canceled since this try/catch is around the coroutine
launch
es (which return instantly) and not around the actual code that throws the exception (call toupdate
). Also, this function is creating a newcoroutineScope
which does not use aSupervisorJob
(line 114) so cancellations will be propagated to sibling coroutine jobs if any job fails.Also, even though we are talking about Coroutine cancellation, this doesn't mean that code will actually stop running immediately because coroutine cancellation is cooperative.
To be honest, I couldn't quite understand what exactly this class is trying to accomplish in terms of job execution order, parallel work, and structured concurrency, so it is a bit hard to understand if the proposed fixes here and previous PR would work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, and I think this actually might shed light on the original issue. I see that this same possibility exists in an earlier version of this implementation, and I believe that the underlying "double-unlock" issue could be a result of this.
One theory / possibility (after reading the backscroll, and this comment) is that the
forEach
is spawning multiple jobs on theioDispatcher
(with a thread pool), most of them blocking on thelock()
invocation. Then, if any of the child jobs throws, the parent job cancels, and all siblings too, and then things get a little bit tricky, imo. Sincelock()
docs say:and
So, although each site's uploads should be blocked and served in FIFO manner, I wonder if the cancellation bubbling up to the parent job and then back down allows another child job to acquire the lock before it is cancelled as a sibling 🤔
Though there is a "prompt cancellation guarantee", does this guarantee that it cancels immediately when a sibling cancels? (Since those children can be on different threads, along with the fact that we catch and re-throw, it makes me wonder).
I have not tested this hypothesis, so please take it with a grain of salt. I just wanted to suggest it as I've skimmed the back scroll, and your comment seems that it might shed some light on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was actually trying to reproduce it with some custom code (though I was running it as a Unit test on the JVM and not on an actual device, so I believe thread implementation might be different).
Because of the
mutex
, no upload jobs will be running at the same time, though their coroutines would've been started and they're all suspended at thelock
function.In my tests, I see this definitely happens, though it doesn't seem to cause of the crash because in that scenario the
unlock
call throwing an exception would be either this or this, but not the one causing the Sentry crash, which is this one.Since the crash happens, I am guessing is also possible that timing can make that be called right when one of the jobs released the lock AND while the cancellation is reaching the internal continuation in
suspendCancellableCoroutineReusable
from the Mutex.This is a tricky crash, and I still think it could happen even using just the
withLock
and not callingunlock
on our side, though that extraunlock
on our side definitely doesn't help, so maybe @ovitrif changes here help solve the issue.I feel that updating to coroutines to 1.7.0 will definitely solve this crash (since the Mutex internals were completely changed) though not sure what would take to make that dependency update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just adding more information about this part here:
What I mean is: I don't understand why we are starting multiple parallel jobs but suspending them all on the same mutex. In this scenario it feels to me the
mutex
(and concurrency) is not needed since in practice we are just running each job sequentially.It looks like all those jobs could be called from 1 coroutine, with some exception handling, instead of relying on concurrent jobs and a syncing mechanism (
mutex
in this case).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is exactly what I was thinking might be happening (and thanks for linking to the various sources where
unlock
is called). It is possible that the test dispatcher exhibits different behavior from theioDispatcher
actually used in production. If it is indeed a timing issue / race condition of some sort, this might explain why it is hard to reproduce in the test environment.I also agree, since, as @ovitrif pointed out, it is basically doing the same thing we were doing (i.e. calling
unlock
in a finally).I agree with this as well, especially since the jobs are blocked at the highest level, before any control flow traverses into the more granular layers of the implementation where service boundaries might offer concurrency advantages (disk vs. network performance characteristics).
Also, considering performance from a high level, if we hypothetically implemented more granular concurrency, I have doubts that such advantages would be significant anyway, since the likely bottleneck would be the device's network bandwidth (in most circumstances) - so uploading to multiple sites simultaneously likely won't be much (if at all) faster than uploading to each one sequentially, and incurs this complexity cost, as well as an increased risk of multiple interrupted transfers instead of just one in the case of network failure.
That said, currently, it seems that the purpose of the
mutex
is described in the code comment:so maybe more work needs to be done to make those
queueUploadFrom{AllSites,Site}
methods idempotent? I don't have enough context to know what problem thismutex
solves here. Perhaps the "deduping" should be happening at the layer closer to the queue, with any needed synchronization there? @ParaskP7 👋 😄 I see that this code comment was from 43949f1 but it appears originate a bit earlier: (#10878) - so I'm not sure if you have full context of the original reason for that comment. But I wonder wdyt about this? I.e. do you think it is possible this synchronization / deduping could be done in the service somehow?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the input @thomashorta and @mkevins 🙇🏻 , you're awesome and really helpful 🙏🏻
If I can think outside the box on this one, this is also my view. That
mutex
should be replaced with a different solution that avoids the post duplication 👍🏻 . To me, locking an async job to enforce synchronicity (basically manually locking and unlockingmutex
to force sequential batching) is more like a workaround than an actual solution, therefore we should not take themutex
as a given for solving the underlying issue it attempted to solve. This would automatically fix the crash we are seeing in Sentry.