Commit Graph

72 Commits

Author SHA1 Message Date
Maximilian Paß
9646542499 Inject Execution Debug Message
for measuring the performance of the until loop of the stderr connection.
2023-11-23 16:17:30 +01:00
Maximilian Paß
64412e1c4b Revert "Inject Execution Debug Message"
This reverts commit 04a2d0ff3b.
2023-11-23 15:27:31 +01:00
Maximilian Paß
04a2d0ff3b Inject Execution Debug Message
for measuring the performance of the until loop of the stderr connection.
2023-11-23 15:25:23 +01:00
Maximilian Paß
cb08787c7d Rephrase Evaluation channel log statement. 2023-11-23 13:22:13 +01:00
Maximilian Paß
c820ff99e6 Fix flaky TestWithSeparateStderr. 2023-11-16 12:10:57 +01:00
Maximilian Paß
543939e5cb Add independent environment reload
in the case that the prewarming pool is depleting (see PrewarmingPoolThreshold) and is still depleting after a timeout (PrewarmingPoolReloadTimeout).
2023-11-09 13:11:39 +01:00
Maximilian Paß
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
Maximilian Paß
90d591d4ec Change default behavior in Nomad Event Handling
to not propagate that pending runners are being stopped.
2023-09-18 00:54:26 +02:00
Maximilian Paß
2eb15c8d93 Fix loosing of rescheduled runners
that are rescheduled while the previous allocation was still pending.
We fix this by removing the race condition handling that should prevent Poseidon from throwing warnings of unexpected allocation stopping.
2023-09-18 00:54:26 +02:00
Maximilian Paß
788cb0f660 Add regression test for the recent lost runners. 2023-09-18 00:54:26 +02:00
Maximilian Paß
68cd8f43b4 Defuse data race condition of TestWithSeparateStderrReturnsCommandError. 2023-09-11 13:44:29 +02:00
Maximilian Paß
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
Maximilian Paß
354c16cc37 Fix missing rescheduled idle runners.
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.

The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
2023-09-05 15:15:39 +02:00
Maximilian Paß
8820938624 Increase severity of two log statements. 2023-09-05 15:15:39 +02:00
Maximilian Paß
90092c48c1 Fix incomplete debug message
that is created by sending SIGQUIT to the bash process
by not processing output after the the client disconnected / we have sent the SIGQUIT.
2023-08-14 11:37:51 +02:00
Maximilian Paß
4d661138e9 Revert "Insert debug message into execution tracing"
This reverts commit 72d926ef6c5e9f8ddd0da39dbd1492dad3621c15.
2023-08-14 11:37:51 +02:00
Maximilian Paß
6a1677dea0 Introduce reason for destroying runner
in order to return a specific error for OOM Killed Executions.
2023-07-21 15:30:21 +02:00
Maximilian Paß
40a5f2eca6 Insert debug message into execution tracing
to verify that the date command is sometimes returning an empty string with exit code 5.
2023-07-21 15:05:53 +02:00
Maximilian Paß
e7df777db4 Always log Runner and Environment ID.
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
Maximilian Paß
f031219cb8 Fix Nomad event race condition
that was triggered by simultaneous deletion of the runner due to inactivity, and the allocation being rescheduled due to a lost node.
It led to the allocation first being rescheduled, and then being stopped. This caused an unexpected stopping of a pending runner on a lower level.
To fix it we added communication from the upper level that the stop of the job was expected.
2023-06-13 14:20:20 +02:00
Maximilian Paß
b620d0fad7 Introduce Allocation State Tracking
in order to break down the current state and evaluate if it is invalid.
2023-06-13 14:20:20 +02:00
Maximilian Paß
1061b15c3e Fix Influx monitoring by renaming the time tag. 2023-05-12 18:36:34 +01:00
Maximilian Paß
bbc15d9b71 Monitor Job events
and add time to Nomad event monitoring.
2023-05-12 16:35:30 +01:00
Maximilian Paß
9300a82535 Fix missing idle runners.
In the context of #358 we identified that the event with the type `AllocationUpdated` and the client status `pending` is common but not always send by Nomad.

With this Commit we remove the condition that limits the evaluated Nomad events to the event with the type `AllocationUpdated`. Without the condition the event of the type `PlanResult` and the status `pending` will be evaluated equally. By now, this event seems to be sent every time.

This restriction led to started allocation not being registered when the `AllocationUpdated` event with client status `pending` was missing.
2023-05-12 16:25:43 +01:00
Maximilian Paß
f377b1376c Add Client Status to Nomad Allocation monitoring
Also add the Nomad Node name as additional debug information.
2023-05-10 19:09:31 +01:00
Maximilian Paß
8f89c14ea1 Cleanup logs for Allocation recovery
on startup. The changes do not have functional consequences as adding the allocation just overwrites the old one.
2023-05-10 18:56:51 +01:00
Maximilian Paß
5a147c4985 Add debug statements for allocation event handling 2023-05-10 18:56:51 +01:00
Maximilian Paß
42efebc194 Monitor the Nomad events
and send all Nomad events to Influxdb.
2023-05-09 00:13:58 +01:00
Maximilian Paß
d8d9abbddd Add Job ID to Nomad Allocation monitoring. 2023-04-23 12:54:57 +01:00
Maximilian Paß
801e4f489e Synchronize Sentry debug message handling. 2023-04-11 20:58:57 +01:00
Maximilian Paß
0c8fa9ccfa Add context to log statements. 2023-04-11 20:45:30 +01:00
Maximilian Paß
a720553dd1 Fix missing Runner-Delete events. 2023-04-01 19:27:09 +02:00
Maximilian Paß
8950ce29d8 Recover Runner Allocations on startup. 2023-04-01 19:27:09 +02:00
Maximilian Paß
038d71ff51 Nomad: Handle Container re-allocation 2023-03-31 14:42:55 +02:00
Maximilian Paß
c3e5afaad0 Fix Concurrent Map Write
when handling the Sentry Debug Messages asynchronously.
2023-03-22 10:36:38 +00:00
Maximilian Paß
e877cd1e52 Rename Sentry Span Descriptions. 2023-03-14 23:42:19 +01:00
Maximilian Paß
e0419c2e58 Fix Sentry Debug Regex
that was ignoring composed messages including a newline.
Also, add regression test.
2023-03-14 23:42:19 +01:00
Maximilian Paß
6e069f5d8a Fix Nomad Exit Code
Due to the wrapping of the command, the exit code could not have been retrieved correct anymore.
2023-03-14 23:42:19 +01:00
Maximilian Paß
7dadc5dfe9 Refactor Nomad Command Generation.
- Abstracting from the exec form while generating.
- Removal of single quotes (usage of only double-quotes).
- Bash-nesting using escaping of special characters.
2023-03-14 23:42:19 +01:00
Maximilian Paß
f309d0f70e Ensure sending of the Sentry End debug message. 2023-03-14 23:42:19 +01:00
Maximilian Paß
4fb6ab980b Implement merge request comments. 2023-03-14 23:42:19 +01:00
Maximilian Paß
cc0c425197 Add Sentry Spans for Bash execution. 2023-03-14 23:42:19 +01:00
Maximilian Paß
4550a4589e Dangerous Context Enrichment
by passing the Sentry Context down our abstraction stack.
This included changes in the complex context management of managing a Command Execution.
2023-02-03 10:29:18 +00:00
Maximilian Paß
0d3c474acc Enrich error message. 2023-01-02 11:23:02 +01:00
Maximilian Paß
8950ab3776 Add single quotes for inner command.
Change to bash as interpreter.
Forbid single quotes for user commands.
2022-11-04 15:15:43 +01:00
Maximilian Paß
4c25473c9e Hide Nomad specific environment variables
from the user environment.
2022-11-04 15:15:43 +01:00
Sebastian Serth
acb4d24c45 Change loglevel for context cancellation to DEBUG 2022-10-26 16:18:35 +02:00
Maximilian Paß
28fb0ca61c Catch context canceled error 2022-10-25 09:36:52 +02:00
Sebastian Serth
1a5a49d7c8 Explicitly switch user for code execution.
Co-authored-by: Maximilian Pass <maximilian.pass@student.hpi.uni-potsdam.de>
2022-09-24 23:09:23 +02:00
Sebastian Serth
7454e577e4 Allow using a local Docker image, e.g., for tests 2022-09-24 23:09:23 +02:00