Commit Graph

24 Commits

Author SHA1 Message Date
Maximilian Paß
ab938bfc22 Refactor MemoryLeakTestSuite
as we identified two issues where the goroutine count from before differs from after the test.

1) It seemed like a Go runtime specific Goroutine appeared in rare cases before the test. To avoid this, we introduced a short timeout before looking up the Goroutines.
Another solution might be to do the lookup twice and check if the count matches.

2) A Goroutine that periodically monitors some storage unexpectedly got closed in rare cases. As we could not identify the cause for this, we removed the leaking Goroutines by properly cleaning up.
2024-02-28 11:52:51 +01:00
Maximilian Paß
eaa022282c Block Webserver during first Nomad recovery.
No requests are accepted while Poseidon is recovering Nomad environments and runners.
2024-01-15 16:05:35 +00:00
Maximilian Paß
7a446fee26 Fix flaky TestUpdateRunnersLogsErrorFromWatchAllocation. 2023-11-16 12:10:57 +01:00
Maximilian Paß
7b82300ff7 Refactor PrewarmingPoolAlert triggering
from route-based to Nomad-Event-Stream-based.
2023-11-09 13:11:39 +01:00
Maximilian Paß
543939e5cb Add independent environment reload
in the case that the prewarming pool is depleting (see PrewarmingPoolThreshold) and is still depleting after a timeout (PrewarmingPoolReloadTimeout).
2023-11-09 13:11:39 +01:00
Maximilian Paß
f259d65aa4 Add unit tests for runner recovery. 2023-10-31 15:49:56 +01:00
Maximilian Paß
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
Maximilian Paß
460b8b2065 Refactor TestReturnReturnsErrorWhenApiCallFailed
to handle the retry mechanism.
2023-09-11 13:44:29 +02:00
Maximilian Paß
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
Maximilian Paß
b708dddd23 Add Nomad Manager test case
that ensures that `onAllocationStopped` returns true when the runner was deleted before by the inactivity timer.
This feature is required for handling a race condition with the event handling of a rescheduled allocation.
2023-09-05 15:15:39 +02:00
Maximilian Paß
354c16cc37 Fix missing rescheduled idle runners.
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.

The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
2023-09-05 15:15:39 +02:00
Maximilian Paß
67297ec5a2 Add regression test for rescheduled idle runner. 2023-09-05 15:15:39 +02:00
Maximilian Paß
6a1677dea0 Introduce reason for destroying runner
in order to return a specific error for OOM Killed Executions.
2023-07-21 15:30:21 +02:00
Maximilian Paß
bfb5977d24 Destroy runner on allocation stopped
Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.
2023-07-21 15:30:21 +02:00
Maximilian Paß
527aaf713f Fix decreased prewarming pool due to inactivity timer.
When allocations fail and restart they are added again to the idle runners. The bug fixed with this commit is that the inactivity timer was not stopped at the restart. This led to the idle runner being removed when the timer expired.
2023-06-16 17:27:45 +01:00
Maximilian Paß
b620d0fad7 Introduce Allocation State Tracking
in order to break down the current state and evaluate if it is invalid.
2023-06-13 14:20:20 +02:00
Maximilian Paß
e0db1bafe8 Fix multiple user Runner use
A before unknown Nomad reload adds already known runner again to the idle runner - even if they are already in use.
2023-03-31 14:42:55 +02:00
Maximilian Paß
0c6c48c3cf #190 Add unit tests for runner recovery. 2022-11-26 13:33:44 +00:00
Maximilian Paß
160df3d9e6 Add retry-mechanism for sample, mark-as-used and return
of Nomad runners.
2022-10-24 22:12:09 +01:00
Maximilian Paß
c6e65c14bb Monitor Nomad allocation startup duration. 2022-07-31 19:42:35 +02:00
Maximilian Paß
498e8f5ff5 #110 Refactor influxdb monitoring
to use it as singleton.
This enables the possibility to monitor processes that are independent of an incoming request.
2022-07-01 15:29:31 +02:00
Maximilian Paß
34040162c2 #89 Generalise the three Storage interfaces and structs into one generic storage manager. 2022-06-29 16:21:19 +02:00
Maximilian Paß
0ef5a4e39f Make Execution Environment interface Nomad independent 2022-02-28 14:54:40 +01:00
Maximilian Paß
ba43f667c2 Add architecture for multiple managers
using the chain of responsibility pattern.
2022-02-28 14:54:40 +01:00