Skip to content

database race on create build / acquire job #2436

Closed
@spikecurtis

Description

@spikecurtis

Summary: Our current postgres code that handles creating builds and sending them out to provisionerd is subtly bugged from an ACID standpoint, and results in provisionerd failing to acquire a job.

Details:

I'm hitting this error in the TestDelete CLI test.

    t.go:81: 2022-06-16 18:41:25.096 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	POST /api/v2/workspaces/7138ac7f-4cbc-49d0-a772-aa2420a0e91d/builds
    t.go:81: 2022-06-16 18:41:25.097 [DEBUG]	(provisionerd-lucid_ptolemy4)	<github.com/coder/coder/coderd/provisionerdaemons.go:154>	(*provisionerdServer).AcquireJob	locked job from database	{"id": "da766fb0-6e98-4a92-892b-a0565570d3cd"}
    t.go:81: 2022-06-16 18:41:25.097 [WARN]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:232>	(*Server).acquireJob	acquire job ...
        "error": request job was invalidated: get workspace build: sql: no rows in result set
                 	storj.io/drpc/drpcwire.UnmarshalError:26
                 	storj.io/drpc/drpcstream.(*Stream).HandlePacket:198
                 	storj.io/drpc/drpcmanager.(*Manager).manageReader:211
    t.go:81: 2022-06-16 18:41:25.097 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
    t.go:81: 2022-06-16 18:41:25.104 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4/logs
✘ Queued [0ms]:53: 
    ptytest.go:53: ⧗  Running 
    ptytest.go:53: ⧗  Running 
    t.go:81: 2022-06-16 18:41:25.357 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
    delete_test.go:71: 
        	Error Trace:	delete_test.go:71
        	            				asm_arm64.s:1263
        	Error:      	Target error should be in err chain:
        	            	expected: "EOF"
        	            	in chain: "get workspace build: sql: no rows in result set"
        	Test:       	TestDelete/DifferentUser
✘ Running [0ms]53: 
    ptytest.go:92: match exceeded deadline: wanted "Cleaning Up"; got "\x1b[1A\r✘ Queued [0ms]\r\n⧗  \x1b[;mRunning\x1b[0m \r\n⧗  \x1b[;mRunning\x1b[0m \r\n\x1b[1A\r✘ Running [0ms]\r\n"
    ptytest.go:97: 
        	Error Trace:	ptytest.go:97
        	            				delete_test.go:75
        	Error:      	Received unexpected error:
        	            	EOF
        	Test:       	TestDelete/DifferentUser
    t.go:81: 2022-06-16 18:41:35.098 [DEBUG]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:1073>	(*Server).closeWithError	closing server with error	{"error": null}
    --- FAIL: TestDelete/DifferentUser (10.27s)

Coderd tries to acquire a job and fails with request job was invalidated: get workspace build: sql: no rows in result set that is to say, that it selected and locked a job in the database, then tried to look up the corresponding workspace build, and the build wasn't there.

How can this be?

When we create the build, we insert the job and build in a single transaction, so you wouldn't expect some other query to be able to find the job but then not the build.

After reading https://www.postgresql.org/docs/current/transaction-iso.html very carefully, I think this has to do with our use of SELECT FOR UPDATE in the locking query.

it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database. This behavior makes Read Committed mode unsuitable for commands that involve complex search conditions

It seems that the locking query, run in Read Committed mode (default), is able to see the effect of the transaction adding the job prior to that transaction being committed. We may need to upgrade one or both of these transactions to a higher level of isolation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiArea: HTTP API

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions