experiment: provisionerdaemon - investigate intermittent job wait failure #146
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Investigating failures like this: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32
where it looks like the job is completed, but the completion condition is never satisfied
Runs
Issue 1: Data race in
p.acquiredJobDone
context (provisioners.d
)Failure trace: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:84
There is a race in the
p.acquiredJobDone
chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with<-p.acquiredJobDone
, but in parallel, anacquireJob
could've been started, which would create a new channel forp.acquiredJobDone
.The fix I tried was to also grab the
acquiredJobMutex
in theClose
function. This, at first, caused a deadlock - because there was another case where the mutexes could be grabbed in reverse order (acquiredJobMutex
-> thencloseMutex
). That other place, though, was storing a bool in an atomic, so actually didn't need the mutex guard.Attempted fix here: 42ce721
Still hit a related race: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:84
So tried a second fix here: 84dd68a
The second fix didn't work, trying to switch from chan -> wait group: a8725cd