Skip to content

chore: optimize GetPrebuiltWorkspaces query #18717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 9, 2025

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented Jul 2, 2025

Hopefully fixes coder/internal#715 this time.

Second attempt at this.
Previous attempt incorrectly returned all rows for which there existed a prebuild that had previously had a successful start transition at some point: #18588

Explain before (71.1ms): https://explain.dalibo.com/plan/9be18ab833b7a000
Explain after (9.8ms): https://explain.dalibo.com/plan/b4a94742gaha229g (EDIT: not selecting from correct CTE)
Explain after (11.2ms): https://explain.dalibo.com/plan/bea42b563ff7fbe7

Manually verified against dogfood db:

$ psql '<db url>' -f a.sql > a.txt
$ psql '<db url>' -f b.sql > b.txt
$ diff a.txt b.txt
28d27
<  18aa3c11-c1eb-4fc6-ae3b-8cbd01f81f21 | prebuild-hiyghekkb3gv2vq | 99064381-8750-407f-9d30-91f38b7911cc | 5d91f37e-5872-4821-b6ed-90574082628e | bdc27adb-7a8e-4a6f-9922-5a0f4a3885da | f     | 2025-07-02 13:34:58.537656+00
32a32
>  18aa3c11-c1eb-4fc6-ae3b-8cbd01f81f21 | prebuild-hiyghekkb3gv2vq | 99064381-8750-407f-9d30-91f38b7911cc | 5d91f37e-5872-4821-b6ed-90574082628e | bdc27adb-7a8e-4a6f-9922-5a0f4a3885da | f     | 2025-07-02 13:34:58.537656+00
$ diff <(sort a.txt) <(sort b.txt)
$ 

Diff is only due to unstable row order, also added additional testing in coderd/database/querier_test.go to validate changes.

If we want to be super careful about this, I can instead break out the updated query into a new function and diff the old versus the new on each reconcile call.
EDIT: decided to go ahead with this for safety.

@johnstcn johnstcn self-assigned this Jul 2, 2025
@johnstcn johnstcn marked this pull request as ready for review July 2, 2025 14:20
@@ -60,7 +111,9 @@ SELECT
FROM workspace_prebuilds p
INNER JOIN workspace_latest_builds b ON b.workspace_id = p.id
WHERE (b.transition = 'start'::workspace_transition
AND b.job_status = 'succeeded'::provisioner_job_status);
AND b.job_status = 'succeeded'::provisioner_job_status)
ORDER BY p.id;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: adding stable ordering for diffing

@johnstcn johnstcn force-pushed the cj/chore/GetRunningPrebuiltWorkspace-Optimize-2 branch from da75520 to 53c3ba5 Compare July 3, 2025 13:08
@johnstcn johnstcn requested a review from Copilot July 3, 2025 16:06
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces and validates an optimized version of the GetRunningPrebuiltWorkspaces query, compares its output against the original for correctness, and wires it through the reconciler, mocks, metrics, and DB layers.

  • Add a new optimized SQL query and corresponding Go method (GetRunningPrebuiltWorkspacesOptimized)
  • Integrate a comparator in the reconciler to log differences between original and optimized results
  • Expand tests and mocks to cover the new optimized method

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
enterprise/coderd/prebuilds/reconcile_test.go Added antagonists setup and CompareGetRunningPrebuiltWorkspacesResults tests
enterprise/coderd/prebuilds/reconcile.go Invoke optimized query, compare results, and log diffs
coderd/database/queries/prebuilds.sql Defined GetRunningPrebuiltWorkspacesOptimized CTE and ordered original query
coderd/database/queries.sql.go Added constant, row type, and method for optimized query
coderd/database/querier.go Extended interface with optimized method
coderd/database/dbmock/dbmock.go Mock recorder for optimized method
coderd/database/dbmetrics/querymetrics.go Metrics wrapper for optimized method
coderd/database/dbmem/dbmem.go Stub (panicking) implementation for optimized method
coderd/database/dbauthz/setup_test.go Skipped optimized method in authz tests
coderd/database/dbauthz/dbauthz_test.go Excluded optimized method from recursive authz test
coderd/database/dbauthz/dbauthz.go Added authorization wrapper for optimized method
Comments suppressed due to low confidence (1)

coderd/database/querier_test.go:5025

  • Consider adding a parallel test for 'GetRunningPrebuiltWorkspacesOptimized' to verify the optimized SQL returns the same results as the original under PostgreSQL.
// TestGetRunningPrebuiltWorkspaces ensures the correct behavior of the

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly alright, mainly just questioning the purpose/point of the workspace_latest_presets CTE.

WHERE workspace_builds.transition = 'start'::workspace_transition
AND workspace_builds.template_version_preset_id IS NOT NULL
ORDER BY latest_prebuilds.id, workspace_builds.build_number DESC
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite grasp the purpose of this CTE, it seems to pretty much do the same as latest_prebuilds? It's selecting distinct on workspace id from latest_prebuilds and joining workspace_builds on workspace id (we already have this data in the previous CTE, no?), and ordering by build number DESC, combined with distinct on that's == selecting directly from workspace_latest_builds, right? If we can't select the template_version_preset_id directly from the workspace_latest_builds then it'd still be more performant to join on build ID vs workspace ID and avoid t he distinct/order/where transition clause.

Might make more sense to just join workspace_builds in latest_prebuilds instead to get template_version_preset_id if required and remove this CTE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I omitted this CTE it caused different results for one specific prebuild instance, but I'm now not seeing the same discrepancy when I run the queries side-by-side. I'll try the modifications you suggest and see if it still yields the same results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at that CTE, the more I question it too. On the dogfood database, all of the workspace builds owned by the prebuild user have a non-null template_version_preset_id, whereas almost all subsequent ones have a null preset ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, whenever a template version is added, new presets are inserted into the database. I don't think we ever delete presets from the database. Except maybe when a template is deleted?
A prebuild should always correlate to a non-null preset. Could this be an issue when the active template version changes? 🤔

Copy link
Member Author

@johnstcn johnstcn Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending some more time looking at this, it appears that the current correct approach is to query the most recent non-null preset ID for a successful start transition. Simply selecting from workspace_latest_builds may not result in the correct preset_id. I discussed with Sas and a future change may result in preset_id not being a thing any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prebuilt workspaces should ideally never change their preset. This means that the first build should contain a preset id and all subsequent builds should have a null preset ID. We could optimise this by dropping the workspace_latest_presets CTE from the query and instead rely on the preset ID for build 0 in all cases.

This would probably be much more performant.

The issue with this optimisation is the potentially negligent edge case where a human being restarts a prebuilt workspace with a new preset. In the optimized case such a restarted prebuilt workspace would be claimed with the wrong preset.

If we're willing to tolerate this edge case, we can optimize by dropping the CTE. Otherwise, the CTE remains necessary in order to find the last set preset id for a prebuilt workspace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave the CTE for now. There is an argument to be made that once a human touches a prebuild that it's no longer valid to be claimed and as such only prebuilds with build_number = 1 should be claimable.

But what do we say to the God of Arguments? Not today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prebuilt workspaces should ideally never change their preset. This means that the first build should contain a preset id and all subsequent builds should have a null preset ID.

Why do we nullify the preset_id in subsequent builds? Conceptually, a prebuilt workspace is always associated with a preset, and clearing the field might imply the association no longer exists, which isn't quite accurate. Wouldn’t it make more sense to retain the preset_id to reflect this persistent relationship?

The issue with this optimisation is the potentially negligent edge case where a human being restarts a prebuilt workspace with a new preset.

How would this actually happen? Is it really possible for a prebuilt workspace to be restarted with a different preset? If we consider the association to the preset immutable once the prebuilt workspace is created, this edge case shouldn’t occur, right? Or is there a specific workflow that allows this to happen?

Either way, this is not a blocker for the PR, we can definitely move forward and revisit this discussion later if needed.

Copy link
Contributor

@SasSwart SasSwart Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullify the preset_id in subsequent builds

We don't explicitly nullify them. The fact that they are null on subsequent build is rather because there's nothing to tell a subsequent build what the preset_id should be. We could transfer the preset id from one build to the next, but that would roughly look like running the CTE that we're discussing here on every new build. Presets are set initially and when they change by specification of the user. For normal workspaces this user is a human being who chooses a preset. For prebuilt workspaces this user is the prebuilds system user and they specify the preset in the reconciliation loop.

How would this actually happen? Is it really possible for a prebuilt workspace to be restarted with a different preset?

Any workspace, prebuilt or not can be restarted with new parameters as long as those parameters are mutable. The same goes for presets. It is only possible via API right now. Neither the UI nor the CLI support changing a preset.

This means it is theoretically possible for a user with sufficient privileges to send an API call to restart with a new preset, but it is definitely not supported.

// Log the error but continue with original results
c.logger.Error(ctx, "optimized GetRunningPrebuiltWorkspacesOptimized failed", slog.Error(err))
} else {
CompareGetRunningPrebuiltWorkspacesResults(ctx, c.logger, allRunningPrebuilds, optimized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fun way to test it out 😄

WHERE workspace_builds.transition = 'start'::workspace_transition
AND workspace_builds.template_version_preset_id IS NOT NULL
ORDER BY latest_prebuilds.id, workspace_builds.build_number DESC
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, whenever a template version is added, new presets are inserted into the database. I don't think we ever delete presets from the database. Except maybe when a template is deleted?
A prebuild should always correlate to a non-null preset. Could this be an issue when the active template version changes? 🤔

Copy link
Contributor

@SasSwart SasSwart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to have to remove the dbmem bit (🥳), but otherwise this looks good. See my comment about potentially further optimising your query at the cost of some robustness. That's non-blocking though. Since we're just logging the diffs for now.

Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's try it 🤞

Comment on lines +5102 to +5105
_ = setupFixture(t, db, "stopped-prebuild", false, database.WorkspaceTransitionStop, database.ProvisionerJobStatusSucceeded)
_ = setupFixture(t, db, "failed-prebuild", false, database.WorkspaceTransitionStart, database.ProvisionerJobStatusFailed)
_ = setupFixture(t, db, "canceled-prebuild", false, database.WorkspaceTransitionStart, database.ProvisionerJobStatusCanceled)
_ = setupFixture(t, db, "deleted-prebuild", true, database.WorkspaceTransitionStart, database.ProvisionerJobStatusSucceeded)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 🌟

WHERE workspace_builds.transition = 'start'::workspace_transition
AND workspace_builds.template_version_preset_id IS NOT NULL
ORDER BY latest_prebuilds.id, workspace_builds.build_number DESC
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prebuilt workspaces should ideally never change their preset. This means that the first build should contain a preset id and all subsequent builds should have a null preset ID.

Why do we nullify the preset_id in subsequent builds? Conceptually, a prebuilt workspace is always associated with a preset, and clearing the field might imply the association no longer exists, which isn't quite accurate. Wouldn’t it make more sense to retain the preset_id to reflect this persistent relationship?

The issue with this optimisation is the potentially negligent edge case where a human being restarts a prebuilt workspace with a new preset.

How would this actually happen? Is it really possible for a prebuilt workspace to be restarted with a different preset? If we consider the association to the preset immutable once the prebuilt workspace is created, this edge case shouldn’t occur, right? Or is there a specific workflow that allows this to happen?

Either way, this is not a blocker for the PR, we can definitely move forward and revisit this discussion later if needed.

@johnstcn johnstcn merged commit 0367dba into main Jul 9, 2025
30 checks passed
@johnstcn johnstcn deleted the cj/chore/GetRunningPrebuiltWorkspace-Optimize-2 branch July 9, 2025 10:30
@github-actions github-actions bot locked and limited conversation to collaborators Jul 9, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: GetRunningPrebuiltWorkspaces creates lots of DB load
4 participants