Skip to content

experiment: provisionerdaemon - investigate intermittent job wait failure #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

bryphe-coder
Copy link
Contributor

@bryphe-coder bryphe-coder commented Feb 2, 2022

Investigating failures like this: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32

where it looks like the job is completed, but the completion condition is never satisfied

Runs

Issue 1: Data race in p.acquiredJobDone context (provisioners.d)

Failure trace: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:84

There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone.

The fix I tried was to also grab the acquiredJobMutex in the Close function. This, at first, caused a deadlock - because there was another case where the mutexes could be grabbed in reverse order (acquiredJobMutex -> then closeMutex). That other place, though, was storing a bool in an atomic, so actually didn't need the mutex guard.

Attempted fix here: 42ce721

Still hit a related race: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:84

So tried a second fix here: 84dd68a

The second fix didn't work, trying to switch from chan -> wait group: a8725cd

  • Run 1: ✅
  • Run 2: peer failure

@bryphe-coder bryphe-coder changed the base branch from main to provisionerdaemon February 2, 2022 22:34
@bryphe-coder bryphe-coder marked this pull request as draft February 2, 2022 22:34
@codecov
Copy link

codecov bot commented Feb 2, 2022

Codecov Report

Merging #146 (a8725cd) into provisionerdaemon (03ed951) will decrease coverage by 0.02%.
The diff coverage is 75.00%.

Impacted file tree graph

@@                  Coverage Diff                  @@
##           provisionerdaemon     #146      +/-   ##
=====================================================
- Coverage              67.35%   67.33%   -0.03%     
=====================================================
  Files                    101      101              
  Lines                   5098     5100       +2     
  Branches                  68       68              
=====================================================
  Hits                    3434     3434              
+ Misses                  1357     1354       -3     
- Partials                 307      312       +5     
Flag Coverage Δ
unittest-go-macos-latest 63.48% <70.00%> (-0.71%) ⬇️
unittest-go-ubuntu-latest 66.55% <75.00%> (+0.30%) ⬆️
unittest-go-windows-latest 63.67% <70.00%> (+0.09%) ⬆️
unittest-js 64.92% <ø> (ø)
Impacted Files Coverage Δ
provisionerd/provisionerd.go 72.38% <70.00%> (-0.25%) ⬇️
provisioner/terraform/provision.go 76.02% <100.00%> (+0.33%) ⬆️
peer/conn.go 75.19% <0.00%> (-3.62%) ⬇️
peer/channel.go 84.14% <0.00%> (-3.05%) ⬇️
coderd/provisionerdaemons.go 47.33% <0.00%> (+3.68%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03ed951...a8725cd. Read the comment docs.

@bryphe-coder bryphe-coder self-assigned this Feb 2, 2022
@bryphe-coder
Copy link
Contributor Author

Distilled this out into #148 and #149

@kylecarbs kylecarbs deleted the bryphe/provisionerdaemon/history-failure branch March 23, 2022 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant