Skip to content

feat(cli/ssh): simplify log file flags #7863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 12, 2023
Merged

feat(cli/ssh): simplify log file flags #7863

merged 16 commits into from
Jun 12, 2023

Conversation

ammario
Copy link
Member

@ammario ammario commented Jun 5, 2023

If we let the user decide where exactly to put the logs, we can remove our opinion about how the file is named and save help page real estate.

Also, fixes a pesky race condition.

Reworks #7646

@ammario ammario requested a review from spikecurtis June 6, 2023 01:00
@ammario ammario enabled auto-merge (squash) June 6, 2023 01:00
@ammario
Copy link
Member Author

ammario commented Jun 6, 2023

@spikecurtis I can see the benefits of using a dynamic path, but I think we should consider an approach with one log-dir flag. I suspect we already have option bloat (need to verify via telem), and it seems easy enough that a user would do coder ssh --log-dir /tmp/ to enable.

@ammario ammario requested a review from coadler June 6, 2023 15:46
@spikecurtis
Copy link
Contributor

@spikecurtis I can see the benefits of using a dynamic path, but I think we should consider an approach with one log-dir flag. I suspect we already have option bloat (need to verify via telem), and it seems easy enough that a user would do coder ssh --log-dir /tmp/ to enable.

one log-dir flag sounds fine to me

@ammario ammario requested review from spikecurtis and removed request for coadler June 6, 2023 22:32
@ammario
Copy link
Member Author

ammario commented Jun 8, 2023

@spikecurtis — can you take a look here when you get the chance?

cli/ssh.go Outdated
// block indefinitely on Stdin Read.
if rc, ok := inv.Stdin.(io.Closer); ok {
rc.Close()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells funny in that it shouldn't be the job of the cli command to close stdin.

Also, in the case where rawSSH goes down unexpectedly, but the workspace stays up, I think that we could still be blocked on reading from stdin, creating a deadlock with the waitgroup.

It should be the job of the clibase invoker to close stdin when the command is done -- but here that can create a deadlock with the waitgroup, since the main command won't return until the waitgroup is done.

Instead of considering the waitgroup to be all the command goroutines except the main one, why don't we make the waitgroup all the command goroutines including the main one? Then we can have a separate goroutine that just waits for the group, and closes the log file.

e.g.

var wg sync.WaitGroup
wg.Add(1)
defer wg.Done()
if logDirPath != "" {
    logfile, err := os.OpenFile(...)
    if err != nil {...}
    go func() {
        wg.Wait()
        logfile.Close()
    }
}

wg.Add(1)
go func() {
    defer wg.Done()
    ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree my solution is janky. The challenge with closing the logfile in a separate goroutine is it creates a leak — which was the main thing I was trying to fix when opening this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing in a separate goroutine is not a leak (unless the waitgroup fails to complete). It just waits for the other things to finish, closes the file, then exits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, it's not just a question of jankiness. As written, it's still bugged in the case where the rawSSH closes for some external reason not related to the workspace being stopped. The command will deadlock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing in a separate goroutine is not a leak (unless the waitgroup fails to complete). It just waits for the other things to finish, closes the file, then exits.

Maybe leak isn't the best word. It can show up in our leak detector and cause test flakiness, even though in production there is no leak.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, it's still bugged in the case where the rawSSH closes for some external reason not related to the workspace being stopped. The command will deadlock.

I don't understand this. If rawSSH closes for some external reason, and io.Copy is blocked on write, the write call would immediately fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. If rawSSH closes for some external reason, and io.Copy is blocked on write, the write call would immediately fail?

I'm worried about io.Copy blocking on Read() - the same thing you're trying to prevent here, but with rawSSH just closing for a different reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leak isn't the best word. It can show up in our leak detector and cause test flakiness, even though in production there is no leak.

Have you tried it and seen goleak firing? If there are goroutines still running, goleak will sleep and retry up to some maximum timeout, exactly so that things like what I'm proposing have time to resolve themselves before goleak complains about them.

I can't imagine this would be the case for waiting for a waitgroup and closing a file, but if for some reason the goroutine takes longer to resolve than goleak's default timeout, we can adjust the timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about the retry. I'll implement your proposal.

@matifali matifali requested a review from spikecurtis June 9, 2023 20:12
@ammario
Copy link
Member Author

ammario commented Jun 10, 2023

@spikecurtis — should be good to go

@ammario ammario merged commit 5de1084 into main Jun 12, 2023
@ammario ammario deleted the log-race branch June 12, 2023 05:18
@spikecurtis
Copy link
Contributor

LGTM

@github-actions github-actions bot locked and limited conversation to collaborators Jun 12, 2023
@mafredri
Copy link
Member

How was this PR merged event though the lint check failed? Thoughts @ammario?

@ammario
Copy link
Member Author

ammario commented Jun 12, 2023

There was a bug in a recent CI change that made the lint check not required.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants