19e0ae1583
Fix concurrent map write
...
in the Nomad `evaluations` map by replacing the simple map with our concurrency-ready storage object.
2024-04-17 13:19:49 +02:00
9646542499
Inject Execution Debug Message
...
for measuring the performance of the until loop of the stderr connection.
2023-11-23 16:17:30 +01:00
64412e1c4b
Revert "Inject Execution Debug Message"
...
This reverts commit 04a2d0ff3b
.
2023-11-23 15:27:31 +01:00
04a2d0ff3b
Inject Execution Debug Message
...
for measuring the performance of the until loop of the stderr connection.
2023-11-23 15:25:23 +01:00
cb08787c7d
Rephrase Evaluation channel log statement.
2023-11-23 13:22:13 +01:00
c820ff99e6
Fix flaky TestWithSeparateStderr.
2023-11-16 12:10:57 +01:00
543939e5cb
Add independent environment reload
...
in the case that the prewarming pool is depleting (see PrewarmingPoolThreshold) and is still depleting after a timeout (PrewarmingPoolReloadTimeout).
2023-11-09 13:11:39 +01:00
6b69a2d732
Refactor Nomad Recovery
...
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
90d591d4ec
Change default behavior in Nomad Event Handling
...
to not propagate that pending runners are being stopped.
2023-09-18 00:54:26 +02:00
2eb15c8d93
Fix loosing of rescheduled runners
...
that are rescheduled while the previous allocation was still pending.
We fix this by removing the race condition handling that should prevent Poseidon from throwing warnings of unexpected allocation stopping.
2023-09-18 00:54:26 +02:00
788cb0f660
Add regression test for the recent lost runners.
2023-09-18 00:54:26 +02:00
68cd8f43b4
Defuse data race condition of TestWithSeparateStderrReturnsCommandError.
2023-09-11 13:44:29 +02:00
3abd4d9a3d
Refactor all tests to use the MemoryLeakTestSuite.
2023-09-11 13:44:29 +02:00
354c16cc37
Fix missing rescheduled idle runners.
...
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.
The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
2023-09-05 15:15:39 +02:00
8820938624
Increase severity of two log statements.
2023-09-05 15:15:39 +02:00
90092c48c1
Fix incomplete debug message
...
that is created by sending SIGQUIT to the bash process
by not processing output after the the client disconnected / we have sent the SIGQUIT.
2023-08-14 11:37:51 +02:00
4d661138e9
Revert "Insert debug message into execution tracing"
...
This reverts commit 72d926ef6c5e9f8ddd0da39dbd1492dad3621c15.
2023-08-14 11:37:51 +02:00
6a1677dea0
Introduce reason for destroying runner
...
in order to return a specific error for OOM Killed Executions.
2023-07-21 15:30:21 +02:00
40a5f2eca6
Insert debug message into execution tracing
...
to verify that the date command is sometimes returning an empty string with exit code 5.
2023-07-21 15:05:53 +02:00
e7df777db4
Always log Runner and Environment ID.
...
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
f031219cb8
Fix Nomad event race condition
...
that was triggered by simultaneous deletion of the runner due to inactivity, and the allocation being rescheduled due to a lost node.
It led to the allocation first being rescheduled, and then being stopped. This caused an unexpected stopping of a pending runner on a lower level.
To fix it we added communication from the upper level that the stop of the job was expected.
2023-06-13 14:20:20 +02:00
b620d0fad7
Introduce Allocation State Tracking
...
in order to break down the current state and evaluate if it is invalid.
2023-06-13 14:20:20 +02:00
1061b15c3e
Fix Influx monitoring by renaming the time tag.
2023-05-12 18:36:34 +01:00
bbc15d9b71
Monitor Job events
...
and add time to Nomad event monitoring.
2023-05-12 16:35:30 +01:00
9300a82535
Fix missing idle runners.
...
In the context of #358 we identified that the event with the type `AllocationUpdated` and the client status `pending` is common but not always send by Nomad.
With this Commit we remove the condition that limits the evaluated Nomad events to the event with the type `AllocationUpdated`. Without the condition the event of the type `PlanResult` and the status `pending` will be evaluated equally. By now, this event seems to be sent every time.
This restriction led to started allocation not being registered when the `AllocationUpdated` event with client status `pending` was missing.
2023-05-12 16:25:43 +01:00
f377b1376c
Add Client Status to Nomad Allocation monitoring
...
Also add the Nomad Node name as additional debug information.
2023-05-10 19:09:31 +01:00
8f89c14ea1
Cleanup logs for Allocation recovery
...
on startup. The changes do not have functional consequences as adding the allocation just overwrites the old one.
2023-05-10 18:56:51 +01:00
5a147c4985
Add debug statements for allocation event handling
2023-05-10 18:56:51 +01:00
42efebc194
Monitor the Nomad events
...
and send all Nomad events to Influxdb.
2023-05-09 00:13:58 +01:00
d8d9abbddd
Add Job ID to Nomad Allocation monitoring.
2023-04-23 12:54:57 +01:00
801e4f489e
Synchronize Sentry debug message handling.
2023-04-11 20:58:57 +01:00
0c8fa9ccfa
Add context to log statements.
2023-04-11 20:45:30 +01:00
a720553dd1
Fix missing Runner-Delete events.
2023-04-01 19:27:09 +02:00
8950ce29d8
Recover Runner Allocations on startup.
2023-04-01 19:27:09 +02:00
038d71ff51
Nomad: Handle Container re-allocation
2023-03-31 14:42:55 +02:00
c3e5afaad0
Fix Concurrent Map Write
...
when handling the Sentry Debug Messages asynchronously.
2023-03-22 10:36:38 +00:00
e877cd1e52
Rename Sentry Span Descriptions.
2023-03-14 23:42:19 +01:00
e0419c2e58
Fix Sentry Debug Regex
...
that was ignoring composed messages including a newline.
Also, add regression test.
2023-03-14 23:42:19 +01:00
6e069f5d8a
Fix Nomad Exit Code
...
Due to the wrapping of the command, the exit code could not have been retrieved correct anymore.
2023-03-14 23:42:19 +01:00
7dadc5dfe9
Refactor Nomad Command Generation.
...
- Abstracting from the exec form while generating.
- Removal of single quotes (usage of only double-quotes).
- Bash-nesting using escaping of special characters.
2023-03-14 23:42:19 +01:00
f309d0f70e
Ensure sending of the Sentry End debug message.
2023-03-14 23:42:19 +01:00
4fb6ab980b
Implement merge request comments.
2023-03-14 23:42:19 +01:00
cc0c425197
Add Sentry Spans for Bash execution.
2023-03-14 23:42:19 +01:00
4550a4589e
Dangerous Context Enrichment
...
by passing the Sentry Context down our abstraction stack.
This included changes in the complex context management of managing a Command Execution.
2023-02-03 10:29:18 +00:00
0d3c474acc
Enrich error message.
2023-01-02 11:23:02 +01:00
8950ab3776
Add single quotes for inner command.
...
Change to bash as interpreter.
Forbid single quotes for user commands.
2022-11-04 15:15:43 +01:00
4c25473c9e
Hide Nomad specific environment variables
...
from the user environment.
2022-11-04 15:15:43 +01:00
acb4d24c45
Change loglevel for context cancellation to DEBUG
2022-10-26 16:18:35 +02:00
28fb0ca61c
Catch context canceled error
2022-10-25 09:36:52 +02:00
1a5a49d7c8
Explicitly switch user for code execution.
...
Co-authored-by: Maximilian Pass <maximilian.pass@student.hpi.uni-potsdam.de >
2022-09-24 23:09:23 +02:00