Skip to content

fix: accumulate agentstats until reported and fix insights DAU offset #15832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 18, 2024

Conversation

mafredri
Copy link
Member

@mafredri mafredri commented Dec 11, 2024

This PR addresses a flake in TestDeploymentInsights caused by missing agent network stats. It also fixes the assumption that we should discard and not accumulate agent network stats if we can't keep up. Without accumulation we risk losing data.

Fixes coder/internal#259

@mafredri mafredri force-pushed the mafredri-fix-agentstats-acc-and-dau-flake branch from a1757f0 to e97f3a9 Compare December 11, 2024 15:32
@@ -89,7 +89,7 @@ func (api *API) returnDAUsInternal(rw http.ResponseWriter, r *http.Request, temp
}
for _, row := range rows {
resp.Entries = append(resp.Entries, codersdk.DAUEntry{
Date: row.StartTime.Format(time.DateOnly),
Date: row.StartTime.In(loc).Format(time.DateOnly),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Drive-by fix, the date was off-by-one depending on timezone.

} else {
s.networkStats = maps.Clone(virtual)
s.unreported = true
}
Copy link
Member Author

@mafredri mafredri Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: If the callback was called multiple times before reporting, we lost data as each update is a snapshot since the last.

This can happen if:

  1. The interval is short (tests)
  2. Report takes a long time

I believe the assumption is that the "ConnStatsCallback" reports a realistic count for "now", however, what it actually returns is closer to an additive diff between this and the previous report. Thus, if two callbacks happen in quick succession we're effectively zeroing the actual data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

@@ -76,7 +86,7 @@ func TestDeploymentInsights(t *testing.T) {
workspace := coderdtest.CreateWorkspace(t, client, template.ID)
coderdtest.AwaitWorkspaceBuildJobCompleted(t, client, workspace.LatestBuild.ID)

ctx := testutil.Context(t, testutil.WaitLong)
ctx := testutil.Context(t, testutil.WaitSuperLong)
Copy link
Member Author

@mafredri mafredri Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: In race mode, propagating the agent connection stats can take a while.

@mafredri mafredri changed the title fix: accumulate agentstats until reported and fix insights DAU offset fix: fix insights DAU offset by accumulating agentstats until reported Dec 11, 2024
@mafredri mafredri changed the title fix: fix insights DAU offset by accumulating agentstats until reported fix: accumulate agentstats until reported and fix insights DAU offset Dec 11, 2024
@mafredri mafredri marked this pull request as ready for review December 11, 2024 16:06
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

agent/stats.go Outdated
// Accumulate stats until they've been reported.
if s.unreported {
if s.networkStats == nil && virtual != nil {
s.networkStats = make(map[netlogtype.Connection]netlogtype.Counts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's save some allocations.

Suggested change
s.networkStats = make(map[netlogtype.Connection]netlogtype.Counts)
s.networkStats = make(map[netlogtype.Connection]netlogtype.Counts, len(virtual))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never actually benchmarked how much a difference a size hint gives for maps, especially ones that don't have a lot of data. Is there a significant difference?

Your suggestion made me realize this had a better fix 😄.

} else {
s.networkStats = maps.Clone(virtual)
s.unreported = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mafredri mafredri merged commit 4c5b737 into main Dec 18, 2024
33 checks passed
@mafredri mafredri deleted the mafredri-fix-agentstats-acc-and-dau-flake branch December 18, 2024 09:26
@github-actions github-actions bot locked and limited conversation to collaborators Dec 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flake: TestDeploymentInsights timeout
2 participants