Description
with multiple Coderd replicas, each Coderd periodically runs a query that deletes replicas that haven't updated in 3x the configured interval (3 x 5s = 15s by default). If a replica becomes temporarily unhealthy or disconnected from the DB, other Coderds will delete its row. Then, if it reconnects, it attempts a SQL UPDATE
on its row, which fails because the row is deleted. It fails like this in a loop forever.
I have personally observed this in a test environment.
We should modify this UPDATE
query to be an "upsert", e.g. INSERT INTO ... ON CONFLICT ... DO UPDATE
A workaround is to restart the Coderd so that it re-inserts itself into the table.
Finally, the log describing this problem is a "WARNING" but it should definitely be an "ERROR" --- if we can't update the replicas table it's a big problem.