database race on create build / acquire job

**Summary**:  Our current postgres code that handles creating builds and sending them out to provisionerd is subtly bugged from an ACID standpoint, and results in provisionerd failing to acquire a job.

**Details**:

I'm hitting this error in the `TestDelete` CLI test.

```
    t.go:81: 2022-06-16 18:41:25.096 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	POST /api/v2/workspaces/7138ac7f-4cbc-49d0-a772-aa2420a0e91d/builds
    t.go:81: 2022-06-16 18:41:25.097 [DEBUG]	(provisionerd-lucid_ptolemy4)	<github.com/coder/coder/coderd/provisionerdaemons.go:154>	(*provisionerdServer).AcquireJob	locked job from database	{"id": "da766fb0-6e98-4a92-892b-a0565570d3cd"}
    t.go:81: 2022-06-16 18:41:25.097 [WARN]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:232>	(*Server).acquireJob	acquire job ...
        "error": request job was invalidated: get workspace build: sql: no rows in result set
                 	storj.io/drpc/drpcwire.UnmarshalError:26
                 	storj.io/drpc/drpcstream.(*Stream).HandlePacket:198
                 	storj.io/drpc/drpcmanager.(*Manager).manageReader:211
    t.go:81: 2022-06-16 18:41:25.097 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
    t.go:81: 2022-06-16 18:41:25.104 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4/logs
✘ Queued [0ms]:53: 
    ptytest.go:53: ⧗  Running 
    ptytest.go:53: ⧗  Running 
    t.go:81: 2022-06-16 18:41:25.357 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
    delete_test.go:71: 
        	Error Trace:	delete_test.go:71
        	            				asm_arm64.s:1263
        	Error:      	Target error should be in err chain:
        	            	expected: "EOF"
        	            	in chain: "get workspace build: sql: no rows in result set"
        	Test:       	TestDelete/DifferentUser
✘ Running [0ms]53: 
    ptytest.go:92: match exceeded deadline: wanted "Cleaning Up"; got "\x1b[1A\r✘ Queued [0ms]\r\n⧗  \x1b[;mRunning\x1b[0m \r\n⧗  \x1b[;mRunning\x1b[0m \r\n\x1b[1A\r✘ Running [0ms]\r\n"
    ptytest.go:97: 
        	Error Trace:	ptytest.go:97
        	            				delete_test.go:75
        	Error:      	Received unexpected error:
        	            	EOF
        	Test:       	TestDelete/DifferentUser
    t.go:81: 2022-06-16 18:41:35.098 [DEBUG]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:1073>	(*Server).closeWithError	closing server with error	{"error": null}
    --- FAIL: TestDelete/DifferentUser (10.27s)
```

Coderd tries to acquire a job and fails with `request job was invalidated: get workspace build: sql: no rows in result set` that is to say, that it selected and locked a job in the database, then tried to look up the corresponding workspace build, and the build wasn't there.

How can this be?

When we create the build, we insert the job and build in a single transaction, so you wouldn't expect some other query to be able to find the job but then not the build.

After reading https://www.postgresql.org/docs/current/transaction-iso.html very carefully, I think this has to do with our use of `SELECT FOR UPDATE` in the locking query.

> it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database. This behavior makes Read Committed mode unsuitable for commands that involve complex search conditions

It seems that the locking query, run in Read Committed mode (default), is able to see the effect of the transaction adding the job prior to that transaction being committed.  We may need to upgrade one or both of these transactions to a higher level of isolation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

database race on create build / acquire job #2436

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

database race on create build / acquire job #2436

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions