Description
Summary: Our current postgres code that handles creating builds and sending them out to provisionerd is subtly bugged from an ACID standpoint, and results in provisionerd failing to acquire a job.
Details:
I'm hitting this error in the TestDelete
CLI test.
t.go:81: 2022-06-16 18:41:25.096 [DEBUG] <github.com/coder/coder/coderd/coderd.go:382> debugLogRequest.func1.1 POST /api/v2/workspaces/7138ac7f-4cbc-49d0-a772-aa2420a0e91d/builds
t.go:81: 2022-06-16 18:41:25.097 [DEBUG] (provisionerd-lucid_ptolemy4) <github.com/coder/coder/coderd/provisionerdaemons.go:154> (*provisionerdServer).AcquireJob locked job from database {"id": "da766fb0-6e98-4a92-892b-a0565570d3cd"}
t.go:81: 2022-06-16 18:41:25.097 [WARN] (provisionerd) <github.com/coder/coder/provisionerd/provisionerd.go:232> (*Server).acquireJob acquire job ...
"error": request job was invalidated: get workspace build: sql: no rows in result set
storj.io/drpc/drpcwire.UnmarshalError:26
storj.io/drpc/drpcstream.(*Stream).HandlePacket:198
storj.io/drpc/drpcmanager.(*Manager).manageReader:211
t.go:81: 2022-06-16 18:41:25.097 [DEBUG] <github.com/coder/coder/coderd/coderd.go:382> debugLogRequest.func1.1 GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
t.go:81: 2022-06-16 18:41:25.104 [DEBUG] <github.com/coder/coder/coderd/coderd.go:382> debugLogRequest.func1.1 GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4/logs
✘ Queued [0ms]:53:
ptytest.go:53: ⧗ Running
ptytest.go:53: ⧗ Running
t.go:81: 2022-06-16 18:41:25.357 [DEBUG] <github.com/coder/coder/coderd/coderd.go:382> debugLogRequest.func1.1 GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
delete_test.go:71:
Error Trace: delete_test.go:71
asm_arm64.s:1263
Error: Target error should be in err chain:
expected: "EOF"
in chain: "get workspace build: sql: no rows in result set"
Test: TestDelete/DifferentUser
✘ Running [0ms]53:
ptytest.go:92: match exceeded deadline: wanted "Cleaning Up"; got "\x1b[1A\r✘ Queued [0ms]\r\n⧗ \x1b[;mRunning\x1b[0m \r\n⧗ \x1b[;mRunning\x1b[0m \r\n\x1b[1A\r✘ Running [0ms]\r\n"
ptytest.go:97:
Error Trace: ptytest.go:97
delete_test.go:75
Error: Received unexpected error:
EOF
Test: TestDelete/DifferentUser
t.go:81: 2022-06-16 18:41:35.098 [DEBUG] (provisionerd) <github.com/coder/coder/provisionerd/provisionerd.go:1073> (*Server).closeWithError closing server with error {"error": null}
--- FAIL: TestDelete/DifferentUser (10.27s)
Coderd tries to acquire a job and fails with request job was invalidated: get workspace build: sql: no rows in result set
that is to say, that it selected and locked a job in the database, then tried to look up the corresponding workspace build, and the build wasn't there.
How can this be?
When we create the build, we insert the job and build in a single transaction, so you wouldn't expect some other query to be able to find the job but then not the build.
After reading https://www.postgresql.org/docs/current/transaction-iso.html very carefully, I think this has to do with our use of SELECT FOR UPDATE
in the locking query.
it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database. This behavior makes Read Committed mode unsuitable for commands that involve complex search conditions
It seems that the locking query, run in Read Committed mode (default), is able to see the effect of the transaction adding the job prior to that transaction being committed. We may need to upgrade one or both of these transactions to a higher level of isolation.