Skip to content

replicasync: deleting temporarily unhealthy replica breaks node even after it recovers #8858

Closed
@spikecurtis

Description

@spikecurtis

with multiple Coderd replicas, each Coderd periodically runs a query that deletes replicas that haven't updated in 3x the configured interval (3 x 5s = 15s by default). If a replica becomes temporarily unhealthy or disconnected from the DB, other Coderds will delete its row. Then, if it reconnects, it attempts a SQL UPDATE on its row, which fails because the row is deleted. It fails like this in a loop forever.

I have personally observed this in a test environment.

We should modify this UPDATE query to be an "upsert", e.g. INSERT INTO ... ON CONFLICT ... DO UPDATE

A workaround is to restart the Coderd so that it re-inserts itself into the table.

Finally, the log describing this problem is a "WARNING" but it should definitely be an "ERROR" --- if we can't update the replicas table it's a big problem.

Metadata

Metadata

Assignees

Labels

s1Bugs that break core workflows. Only humans may set this.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions