poseidon

Author	SHA1	Message	Date
Maximilian Paß	cbcd5f233e	Fix idle runner being memory leaked when its allocation is restarted by Nomad. Fix logic created in `354c16cc`.	2024-06-06 09:46:49 +02:00
Maximilian Paß	ab938bfc22	Refactor MemoryLeakTestSuite as we identified two issues where the goroutine count from before differs from after the test. 1) It seemed like a Go runtime specific Goroutine appeared in rare cases before the test. To avoid this, we introduced a short timeout before looking up the Goroutines. Another solution might be to do the lookup twice and check if the count matches. 2) A Goroutine that periodically monitors some storage unexpectedly got closed in rare cases. As we could not identify the cause for this, we removed the leaking Goroutines by properly cleaning up.	2024-02-28 11:52:51 +01:00
Maximilian Paß	eaa022282c	Block Webserver during first Nomad recovery. No requests are accepted while Poseidon is recovering Nomad environments and runners.	2024-01-15 16:05:35 +00:00
Maximilian Paß	7a446fee26	Fix flaky TestUpdateRunnersLogsErrorFromWatchAllocation.	2023-11-16 12:10:57 +01:00
Maximilian Paß	7b82300ff7	Refactor PrewarmingPoolAlert triggering from route-based to Nomad-Event-Stream-based.	2023-11-09 13:11:39 +01:00
Maximilian Paß	543939e5cb	Add independent environment reload in the case that the prewarming pool is depleting (see PrewarmingPoolThreshold) and is still depleting after a timeout (PrewarmingPoolReloadTimeout).	2023-11-09 13:11:39 +01:00
Maximilian Paß	f259d65aa4	Add unit tests for runner recovery.	2023-10-31 15:49:56 +01:00
Maximilian Paß	6b69a2d732	Refactor Nomad Recovery from an approach that loaded the runners only once at the startup to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.	2023-10-31 15:49:56 +01:00
Maximilian Paß	460b8b2065	Refactor TestReturnReturnsErrorWhenApiCallFailed to handle the retry mechanism.	2023-09-11 13:44:29 +02:00
Maximilian Paß	3abd4d9a3d	Refactor all tests to use the MemoryLeakTestSuite.	2023-09-11 13:44:29 +02:00
Maximilian Paß	b708dddd23	Add Nomad Manager test case that ensures that `onAllocationStopped` returns true when the runner was deleted before by the inactivity timer. This feature is required for handling a race condition with the event handling of a rescheduled allocation.	2023-09-05 15:15:39 +02:00
Maximilian Paß	354c16cc37	Fix missing rescheduled idle runners. In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon. The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.	2023-09-05 15:15:39 +02:00
Maximilian Paß	67297ec5a2	Add regression test for rescheduled idle runner.	2023-09-05 15:15:39 +02:00
Maximilian Paß	6a1677dea0	Introduce reason for destroying runner in order to return a specific error for OOM Killed Executions.	2023-07-21 15:30:21 +02:00
Maximilian Paß	bfb5977d24	Destroy runner on allocation stopped Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.	2023-07-21 15:30:21 +02:00
Maximilian Paß	527aaf713f	Fix decreased prewarming pool due to inactivity timer. When allocations fail and restart they are added again to the idle runners. The bug fixed with this commit is that the inactivity timer was not stopped at the restart. This led to the idle runner being removed when the timer expired.	2023-06-16 17:27:45 +01:00
Maximilian Paß	b620d0fad7	Introduce Allocation State Tracking in order to break down the current state and evaluate if it is invalid.	2023-06-13 14:20:20 +02:00
Maximilian Paß	e0db1bafe8	Fix multiple user Runner use A before unknown Nomad reload adds already known runner again to the idle runner - even if they are already in use.	2023-03-31 14:42:55 +02:00
Maximilian Paß	0c6c48c3cf	#190 Add unit tests for runner recovery.	2022-11-26 13:33:44 +00:00
Maximilian Paß	160df3d9e6	Add retry-mechanism for sample, mark-as-used and return of Nomad runners.	2022-10-24 22:12:09 +01:00
Maximilian Paß	c6e65c14bb	Monitor Nomad allocation startup duration.	2022-07-31 19:42:35 +02:00
Maximilian Paß	498e8f5ff5	#110 Refactor influxdb monitoring to use it as singleton. This enables the possibility to monitor processes that are independent of an incoming request.	2022-07-01 15:29:31 +02:00
Maximilian Paß	34040162c2	#89 Generalise the three Storage interfaces and structs into one generic storage manager.	2022-06-29 16:21:19 +02:00
Maximilian Paß	0ef5a4e39f	Make Execution Environment interface Nomad independent	2022-02-28 14:54:40 +01:00
Maximilian Paß	ba43f667c2	Add architecture for multiple managers using the chain of responsibility pattern.	2022-02-28 14:54:40 +01:00

25 Commits