Dedicated crash handler thread. #21826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

neunhoef wants to merge 15 commits into devel from feature/crash-handler-thread

+261 −6

Member

neunhoef commented Jun 27, 2025 •

edited

Loading

Scope & Purpose

This PR implements a dedicated crash handler thread.
The read signal handler triggers the dedicated thread to
write out useful information before it continues with the crash.
This is useful since a dedicated thread has none of the restrictions
of a signal handler and can for example use system calls.

This allows us to dump a lot more information out at crash time.

[*] 💩 Bugfix
[*] 🍕 New feature

Checklist


          First draft and experiment.

c1d34c2

neunhoef added this to the devel milestone

neunhoef self-assigned this

cla-bot bot added the cla-signed label

neunhoef added 7 commits

June 27, 2025 15:24


          Clean up code.

a0e27a6


          CHANGELOG.

abc4216


          Take out test assertion again.

4a0e661


          Increase backwards compatibility of the output.

27fb623


          Shut down crash handler thread when --version is used.

1cc8b35


          Merge remote-tracking branch 'origin/devel' into feature/crash-handle…

4d39c2c

…r-thread


          Merge remote-tracking branch 'origin/devel' into feature/crash-handle…

5b45310

…r-thread

goedderz requested changes

View reviewed changes

lib/CrashHandler/CrashHandler.cpp

Comment on lines +99 to +100

		/// @brief dedicated crash handler thread
		std::unique_ptr<std::thread> crashHandlerThread = nullptr;

Member

goedderz Jun 30, 2025

Should we use a std::jthread instead?

lib/CrashHandler/CrashHandler.cpp

Comment on lines +94 to +103

+              /// @brief atomic variable for coordinating between signal handler and crash
+              /// handler thread
+              std::atomic<arangodb::CrashHandlerState> crashHandlerState(
+                  arangodb::CrashHandlerState::IDLE);
+              /// @brief dedicated crash handler thread
+              std::unique_ptr<std::thread> crashHandlerThread = nullptr;
+              /// @brief mutex to protect thread creation/destruction
+              std::mutex crashHandlerThreadMutex;

Member

goedderz Jun 30, 2025

I think these variables should be tied together, with crashHandlerThreadFunction(), either into a common struct or a namespace.

lib/CrashHandler/CrashHandler.cpp Outdated

Comment on lines 391 to 393

+                while (true) {
+                  arangodb::CrashHandlerState state =
+                      crashHandlerState.load(std::memory_order_acquire);

Member

goedderz Jun 30, 2025

Note that, since C++20, a std::atomic has methods wait(T value), notify_one() and notify_all(). And if I read https://en.cppreference.com/w/cpp/utility/program/signal correctly, calling these from the signal handler should be OK, as long as the atomic is lock free (which it has to be anyway). I think it's covered by "f is a non-static member function invoked on an object obj, such that obj.is_lock_free() yields true". But please double-check if you agree.

That would simplify this code and the wait function in the crash handler, and make it more polite/efficient.

In addition, if we use std::jthread (see my other comment), we can use stop token handling to address the thread shutdown instead of the atomic. This might be nicer to use (also see my comment on shutdown) and separate the concerns a little.

lib/CrashHandler/CrashHandler.cpp Outdated

Comment on lines 475 to 478

+                  // Signal the dedicated crash handler thread (force trigger even if not
+                  // idle)
+                  ::crashHandlerState.store(arangodb::CrashHandlerState::CRASH_DETECTED,
+                                            std::memory_order_release);

Member

goedderz Jun 30, 2025

I would vote for both commenting and asserting for everything we do here that we're allowed to do it in the signal handler (or that it's not allowed and we're doing it anyway). So here I'd add

static_assert(decltype(::crashHandlerState)::is_always_lock_free);

or so, and a comment that the store is allowed under this condition.

lib/CrashHandler/CrashHandler.cpp Outdated

Comment on lines 470 to 473

+                  // Log signal-specific information in the signal handler
                   logCrashInfo("signal handler invoked", signal, info, ucontext);
+                  // Morally, we should do this after the work of the crash handler thread,
+                  // however, for backwards compatibility, we keep it here as first thing.

Member

goedderz Jun 30, 2025

I think it actually preferable to do it in the crash handler thread as much as possible, to reduce or eliminate the amount of UB we have in the crash handler, at least the actual logging.

lib/CrashHandler/CrashHandler.cpp

Comment on lines +483 to 489

+                  // Now log the backtrace from the crashed thread (only the crashed thread
+                  // can do this)
                   logBacktrace();
-                  logProcessInfo();
+                  // Final cleanup
                   arangodb::Logger::flush();
                   arangodb::Logger::shutdown();

Member

goedderz Jun 30, 2025

Same here, I think it would be preferable to do this in the crash handler thread as much as possible (would need some refactoring/splitting of logBacktrace()).

lib/CrashHandler/CrashHandler.h Outdated

Comment on lines 67 to 68

		/// @brief shuts down the crash handler thread
		static void shutdownCrashHandler();

Member

goedderz Jun 30, 2025

I don't think its workable to have a function that must be called before each call to exit or similar. As of now, you've addressed an exit call in arangod.cpp (there is even a second one there) - and how does this affect the client-tools? And you've addressed to exit()s in the VersionFeature, but it seems we have at least a dozen other places calling exit.

I haven't thought it through, so there may be problems, but some ideas how we might address this:

If we move the crashHandlerThread variable in a struct, we might be able to stop the thread in that thread's destructor
We could install handlers with std::atexit and std::at_quick_exit (I don't think std::set_terminate is needed?)
Don't install the thread globally, but with a scoped variable that we put in runServer, probably where we call CrashHandler::installCrashHandler(). It might then make sense to move installCrashHandler() into the constructor of the object, and maybe even remove the handlers in the destructor - that would clarify exactly in which scope the CrashHandler, including its handlers and thread, is active.

Intuitively I like the third option best, and the second the least.

Member Author

neunhoef Jul 1, 2025

I will try to do this as much as possible. Something with static lifetime will be necessary, or else the signal handler (which can be executed any time from any thread) and the std::terminate handler (which is executed on assertion failure or other catastrophic events, again essentially any time in any thread) do not have access to it. So making it into a totally scoped variable will not be possible.
I will try to avoid the explicit shutdown by additionally creating a scoped CrashHandler object (which can do the normal initialization and shutdown). But an atexit handler will be needed anyway, since we have these places which answer to --version, which use exit to stop the process right away. In this case the scoped variable is no longer cleaned up properly. Let's see what I can come up with.

lib/CrashHandler/CrashHandler.h Outdated

Comment on lines 30 to 36

+              /// @brief States for the crash handler thread coordination
+              enum class CrashHandlerState : int {
+                IDLE = 0,               ///< idle state, waiting for crashes
+                CRASH_DETECTED = 1,     ///< crash detected, handling in progress
+                HANDLING_COMPLETE = 2,  ///< crash handling complete
+                SHUTDOWN = 3            ///< shutdown requested
+              };

Member

goedderz Jun 30, 2025

If this follows (only) specific state transitions, these should be noted in a comment.

Member Author

neunhoef Jul 1, 2025

Done.

neunhoef added 7 commits

June 30, 2025 14:59


          Use atomic wait/notify_all for crash handler thread.

3704b8e


          Merge remote-tracking branch 'origin/devel' into feature/crash-handle…

dded8d4

…r-thread


          Explain state transitions.

f308917


          Cleaner shutdown.

b5950e6


          Take out commented out code.

62e81ff


          clang-format.

1f33096


          First phase of moving stuff to dedicated crash handler thread.

70b04f0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels