Skip to content

Lifecycle executor has a false-positive race in HA #15786

Open
@DanielleMaywood

Description

@DanielleMaywood

When Coder is deployed in HA mode, there are multiple servers each competing to perform the same lifecycle execution tasks. This means that, on occasion, we get false-positives happening.

2024-12-08 07:43:46.918 [debu]  autobuild: run stats  elapsed=0s  transitions={}
2024-12-08 07:43:46.917 [debu]  autobuild: skipping workspace  workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8  workspace_name=<name> ...
    error= last transition not valid for autostart or autostop:
               github.com/coder/coder/v2/coderd/autobuild.getNextTransition
                   /home/runner/work/coder/coder/coderd/autobuild/lifecycle_executor.go:450
2024-12-08 07:43:36.184 [debu]  autobuild: run stats  elapsed=0s  transitions={}
2024-12-08 07:43:36.183 [debu]  autobuild: skipping workspace  workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8  workspace_name=<name> ...
    error= last transition not valid for autostart or autostop:
               github.com/coder/coder/v2/coderd/autobuild.getNextTransition
                   /home/runner/work/coder/coder/coderd/autobuild/lifecycle_executor.go:450
2024-12-08 07:43:29.880 [debu]  autobuild: run stats  elapsed=0s  transitions={"7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8":"stop"}
2024-12-08 07:43:29.877 [info]  autobuild: scheduling workspace transition  workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8  workspace_name=<name>  transition=stop  reason=autostop
2024-12-08 07:43:29.834 [debu]  autobuild: auto building workspace  workspace_id=7b1a6ba7-4f55-4bca-bfa4-b5b06e1a18a8  workspace_name=<name>  transition=stop

As the example logs show, lifecycle executions happen at 07:43:29, 07:43:36 and 07:43:46. Each of these represents a different Coder server performing the lifecycle execution. We see that the first one to run successfully transitions the workspace to a stopped state, however, the workspace is still returned as eligible from our GetWorkspacesEligibleForTransition query when the other servers make their queries.

This could likely be solved by having a lock to only allow one server at a time to perform lifecycle execution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions