replicasync: deleting temporarily unhealthy replica breaks node even after it recovers

with multiple Coderd replicas, each Coderd periodically runs a query that deletes replicas that haven't updated in 3x the configured interval (3 x 5s = 15s by default).  If a replica becomes temporarily unhealthy or disconnected from the DB, other Coderds will delete its row.  Then, if it reconnects, it attempts a SQL `UPDATE` on its row, which fails because the row is deleted.  It fails like this in a loop forever.

I have personally observed this in a test environment.

We should modify this `UPDATE` query to be an "upsert", e.g. `INSERT INTO ... ON CONFLICT ... DO UPDATE`

A workaround is to restart the Coderd so that it re-inserts itself into the table.

Finally, the log describing this problem is a "WARNING" but it should definitely be an "ERROR" --- if we can't update the replicas table it's a big problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

replicasync: deleting temporarily unhealthy replica breaks node even after it recovers #8858

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

replicasync: deleting temporarily unhealthy replica breaks node even after it recovers #8858

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions