-
Notifications
You must be signed in to change notification settings - Fork 937
chore: optimize GetPrebuiltWorkspaces query #18717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: optimize GetPrebuiltWorkspaces query #18717
Conversation
@@ -60,7 +111,9 @@ SELECT | |||
FROM workspace_prebuilds p | |||
INNER JOIN workspace_latest_builds b ON b.workspace_id = p.id | |||
WHERE (b.transition = 'start'::workspace_transition | |||
AND b.job_status = 'succeeded'::provisioner_job_status); | |||
AND b.job_status = 'succeeded'::provisioner_job_status) | |||
ORDER BY p.id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review: adding stable ordering for diffing
da75520
to
53c3ba5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces and validates an optimized version of the GetRunningPrebuiltWorkspaces
query, compares its output against the original for correctness, and wires it through the reconciler, mocks, metrics, and DB layers.
- Add a new optimized SQL query and corresponding Go method (
GetRunningPrebuiltWorkspacesOptimized
) - Integrate a comparator in the reconciler to log differences between original and optimized results
- Expand tests and mocks to cover the new optimized method
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
enterprise/coderd/prebuilds/reconcile_test.go | Added antagonists setup and CompareGetRunningPrebuiltWorkspacesResults tests |
enterprise/coderd/prebuilds/reconcile.go | Invoke optimized query, compare results, and log diffs |
coderd/database/queries/prebuilds.sql | Defined GetRunningPrebuiltWorkspacesOptimized CTE and ordered original query |
coderd/database/queries.sql.go | Added constant, row type, and method for optimized query |
coderd/database/querier.go | Extended interface with optimized method |
coderd/database/dbmock/dbmock.go | Mock recorder for optimized method |
coderd/database/dbmetrics/querymetrics.go | Metrics wrapper for optimized method |
coderd/database/dbmem/dbmem.go | Stub (panicking) implementation for optimized method |
coderd/database/dbauthz/setup_test.go | Skipped optimized method in authz tests |
coderd/database/dbauthz/dbauthz_test.go | Excluded optimized method from recursive authz test |
coderd/database/dbauthz/dbauthz.go | Added authorization wrapper for optimized method |
Comments suppressed due to low confidence (1)
coderd/database/querier_test.go:5025
- Consider adding a parallel test for 'GetRunningPrebuiltWorkspacesOptimized' to verify the optimized SQL returns the same results as the original under PostgreSQL.
// TestGetRunningPrebuiltWorkspaces ensures the correct behavior of the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly alright, mainly just questioning the purpose/point of the workspace_latest_presets
CTE.
WHERE workspace_builds.transition = 'start'::workspace_transition | ||
AND workspace_builds.template_version_preset_id IS NOT NULL | ||
ORDER BY latest_prebuilds.id, workspace_builds.build_number DESC | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't quite grasp the purpose of this CTE, it seems to pretty much do the same as latest_prebuilds
? It's selecting distinct on workspace id from latest_prebuilds
and joining workspace_builds
on workspace id (we already have this data in the previous CTE, no?), and ordering by build number DESC, combined with distinct on that's == selecting directly from workspace_latest_builds
, right? If we can't select the template_version_preset_id
directly from the workspace_latest_builds
then it'd still be more performant to join on build ID vs workspace ID and avoid t he distinct/order/where transition clause.
Might make more sense to just join workspace_builds
in latest_prebuilds
instead to get template_version_preset_id
if required and remove this CTE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I omitted this CTE it caused different results for one specific prebuild instance, but I'm now not seeing the same discrepancy when I run the queries side-by-side. I'll try the modifications you suggest and see if it still yields the same results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I look at that CTE, the more I question it too. On the dogfood database, all of the workspace builds owned by the prebuild user have a non-null template_version_preset_id, whereas almost all subsequent ones have a null preset ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, whenever a template version is added, new presets are inserted into the database. I don't think we ever delete presets from the database. Except maybe when a template is deleted?
A prebuild should always correlate to a non-null preset. Could this be an issue when the active template version changes? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After spending some more time looking at this, it appears that the current correct approach is to query the most recent non-null preset ID for a successful start transition. Simply selecting from workspace_latest_builds
may not result in the correct preset_id
. I discussed with Sas and a future change may result in preset_id
not being a thing any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prebuilt workspaces should ideally never change their preset. This means that the first build should contain a preset id and all subsequent builds should have a null preset ID. We could optimise this by dropping the workspace_latest_presets
CTE from the query and instead rely on the preset ID for build 0 in all cases.
This would probably be much more performant.
The issue with this optimisation is the potentially negligent edge case where a human being restarts a prebuilt workspace with a new preset. In the optimized case such a restarted prebuilt workspace would be claimed with the wrong preset.
If we're willing to tolerate this edge case, we can optimize by dropping the CTE. Otherwise, the CTE remains necessary in order to find the last set preset id for a prebuilt workspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave the CTE for now. There is an argument to be made that once a human touches a prebuild that it's no longer valid to be claimed and as such only prebuilds with build_number = 1
should be claimable.
But what do we say to the God of Arguments? Not today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prebuilt workspaces should ideally never change their preset. This means that the first build should contain a preset id and all subsequent builds should have a null preset ID.
Why do we nullify the preset_id
in subsequent builds? Conceptually, a prebuilt workspace is always associated with a preset, and clearing the field might imply the association no longer exists, which isn't quite accurate. Wouldn’t it make more sense to retain the preset_id
to reflect this persistent relationship?
The issue with this optimisation is the potentially negligent edge case where a human being restarts a prebuilt workspace with a new preset.
How would this actually happen? Is it really possible for a prebuilt workspace to be restarted with a different preset? If we consider the association to the preset immutable once the prebuilt workspace is created, this edge case shouldn’t occur, right? Or is there a specific workflow that allows this to happen?
Either way, this is not a blocker for the PR, we can definitely move forward and revisit this discussion later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullify the preset_id in subsequent builds
We don't explicitly nullify them. The fact that they are null on subsequent build is rather because there's nothing to tell a subsequent build what the preset_id
should be. We could transfer the preset id from one build to the next, but that would roughly look like running the CTE that we're discussing here on every new build. Presets are set initially and when they change by specification of the user. For normal workspaces this user is a human being who chooses a preset. For prebuilt workspaces this user is the prebuilds system user and they specify the preset in the reconciliation loop.
How would this actually happen? Is it really possible for a prebuilt workspace to be restarted with a different preset?
Any workspace, prebuilt or not can be restarted with new parameters as long as those parameters are mutable. The same goes for presets. It is only possible via API right now. Neither the UI nor the CLI support changing a preset.
This means it is theoretically possible for a user with sufficient privileges to send an API call to restart with a new preset, but it is definitely not supported.
// Log the error but continue with original results | ||
c.logger.Error(ctx, "optimized GetRunningPrebuiltWorkspacesOptimized failed", slog.Error(err)) | ||
} else { | ||
CompareGetRunningPrebuiltWorkspacesResults(ctx, c.logger, allRunningPrebuilds, optimized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fun way to test it out 😄
WHERE workspace_builds.transition = 'start'::workspace_transition | ||
AND workspace_builds.template_version_preset_id IS NOT NULL | ||
ORDER BY latest_prebuilds.id, workspace_builds.build_number DESC | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, whenever a template version is added, new presets are inserted into the database. I don't think we ever delete presets from the database. Except maybe when a template is deleted?
A prebuild should always correlate to a non-null preset. Could this be an issue when the active template version changes? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're going to have to remove the dbmem bit (🥳), but otherwise this looks good. See my comment about potentially further optimising your query at the cost of some robustness. That's non-blocking though. Since we're just logging the diffs for now.
…ebuiltWorkspace-Optimize-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's try it 🤞
_ = setupFixture(t, db, "stopped-prebuild", false, database.WorkspaceTransitionStop, database.ProvisionerJobStatusSucceeded) | ||
_ = setupFixture(t, db, "failed-prebuild", false, database.WorkspaceTransitionStart, database.ProvisionerJobStatusFailed) | ||
_ = setupFixture(t, db, "canceled-prebuild", false, database.WorkspaceTransitionStart, database.ProvisionerJobStatusCanceled) | ||
_ = setupFixture(t, db, "deleted-prebuild", true, database.WorkspaceTransitionStart, database.ProvisionerJobStatusSucceeded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 🌟
WHERE workspace_builds.transition = 'start'::workspace_transition | ||
AND workspace_builds.template_version_preset_id IS NOT NULL | ||
ORDER BY latest_prebuilds.id, workspace_builds.build_number DESC | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prebuilt workspaces should ideally never change their preset. This means that the first build should contain a preset id and all subsequent builds should have a null preset ID.
Why do we nullify the preset_id
in subsequent builds? Conceptually, a prebuilt workspace is always associated with a preset, and clearing the field might imply the association no longer exists, which isn't quite accurate. Wouldn’t it make more sense to retain the preset_id
to reflect this persistent relationship?
The issue with this optimisation is the potentially negligent edge case where a human being restarts a prebuilt workspace with a new preset.
How would this actually happen? Is it really possible for a prebuilt workspace to be restarted with a different preset? If we consider the association to the preset immutable once the prebuilt workspace is created, this edge case shouldn’t occur, right? Or is there a specific workflow that allows this to happen?
Either way, this is not a blocker for the PR, we can definitely move forward and revisit this discussion later if needed.
Hopefully fixes coder/internal#715 this time.
Second attempt at this.
Previous attempt incorrectly returned all rows for which there existed a prebuild that had previously had a successful start transition at some point: #18588
Explain before (71.1ms): https://explain.dalibo.com/plan/9be18ab833b7a000
Explain after (9.8ms): https://explain.dalibo.com/plan/b4a94742gaha229g(EDIT: not selecting from correct CTE)Explain after (11.2ms): https://explain.dalibo.com/plan/bea42b563ff7fbe7
Manually verified against dogfood db:
Diff is only due to unstable row order, also added additional testing in
coderd/database/querier_test.go
to validate changes.If we want to be super careful about this, I can instead break out the updated query into a new function and diff the old versus the new on each reconcile call.EDIT: decided to go ahead with this for safety.