Open
Description
When Coder is deployed in HA mode, there are multiple servers each competing to perform the same lifecycle execution tasks. This means that, on occasion, we get false-positives happening.
2024-12-08 07:43:46.918 [debu] autobuild: run stats elapsed=0s transitions={}
2024-12-08 07:43:46.917 [debu] autobuild: skipping workspace workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8 workspace_name=<name> ...
error= last transition not valid for autostart or autostop:
github.com/coder/coder/v2/coderd/autobuild.getNextTransition
/home/runner/work/coder/coder/coderd/autobuild/lifecycle_executor.go:450
2024-12-08 07:43:36.184 [debu] autobuild: run stats elapsed=0s transitions={}
2024-12-08 07:43:36.183 [debu] autobuild: skipping workspace workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8 workspace_name=<name> ...
error= last transition not valid for autostart or autostop:
github.com/coder/coder/v2/coderd/autobuild.getNextTransition
/home/runner/work/coder/coder/coderd/autobuild/lifecycle_executor.go:450
2024-12-08 07:43:29.880 [debu] autobuild: run stats elapsed=0s transitions={"7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8":"stop"}
2024-12-08 07:43:29.877 [info] autobuild: scheduling workspace transition workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8 workspace_name=<name> transition=stop reason=autostop
2024-12-08 07:43:29.834 [debu] autobuild: auto building workspace workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8 workspace_name=<name> transition=stop
As the example logs show, lifecycle executions happen at 07:43:29
, 07:43:36
and 07:43:46
. Each of these represents a different Coder server performing the lifecycle execution. We see that the first one to run successfully transitions the workspace to a stopped state, however, the workspace is still returned as eligible from our GetWorkspacesEligibleForTransition
query when the other servers make their queries.
This could likely be solved by having a lock to only allow one server at a time to perform lifecycle execution.