poseidon

Author	SHA1	Message	Date
Elmar Kresse	12ff205bd2	added k8s stub adapter for execution environment	2024-09-18 10:43:38 +02:00
Maximilian Pass	8f819de2e0	Adjust Nomad restart and reschedule behavior to limit the number of reschedules as we cannot guarantee an error-free job definition. We also set reasonable delays and intervals to increase the likelihood that one deployment has enough time, but there is no second deployment within the interval.	2024-06-27 16:53:30 +02:00
Maximilian Pass	ead96f6f18	Remove Nomad Job Scaling option because our current Poseidon-Nomad architecture has a 1:1 runner-job relationship and there is no need to have more than the one task per job.	2024-06-27 16:53:30 +02:00
Maximilian Paß	cbcd5f233e	Fix idle runner being memory leaked when its allocation is restarted by Nomad. Fix logic created in `354c16cc`.	2024-06-06 09:46:49 +02:00
Maximilian Paß	ec3b2a93db	Fix Golangci-lint configuration	2024-05-07 14:48:17 +02:00
Maximilian Paß	19e0ae1583	Fix concurrent map write in the Nomad `evaluations` map by replacing the simple map with our concurrency-ready storage object.	2024-04-17 13:19:49 +02:00
Maximilian Paß	9deee186a7	Fix Runner DNS resolution by adding public nameservers to the CNI secure bridge configuration.	2024-04-03 10:14:24 +02:00
Maximilian Paß	ab938bfc22	Refactor MemoryLeakTestSuite as we identified two issues where the goroutine count from before differs from after the test. 1) It seemed like a Go runtime specific Goroutine appeared in rare cases before the test. To avoid this, we introduced a short timeout before looking up the Goroutines. Another solution might be to do the lookup twice and check if the count matches. 2) A Goroutine that periodically monitors some storage unexpectedly got closed in rare cases. As we could not identify the cause for this, we removed the leaking Goroutines by properly cleaning up.	2024-02-28 11:52:51 +01:00
Maximilian Paß	939904d406	Fix linter emptyStringTest rule by replacing the length check with a string comparison. This rule got introduced with the new GolangCI lint version.	2024-02-27 15:53:47 +01:00
Maximilian Paß	317590d3ea	Revert "Debug Health route latency." This reverts commit `213628b958`.	2024-01-26 22:51:55 +01:00
Maximilian Paß	213628b958	Debug Health route latency.	2024-01-26 14:36:16 +01:00
Maximilian Paß	57590457a8	Add logging filter token The token is used to filter out request logs when the user agent matches a randomly generated string.	2024-01-24 17:21:00 +01:00
Maximilian Paß	221a6ff1b2	Watchdog: Verify Server TLS Certificate	2024-01-24 17:21:00 +01:00
Maximilian Paß	eaddc65989	Configure Systemd Socket Activation as new way for Poseidon to accept connections. This should reduce our issues caused by deployments.	2024-01-15 16:05:35 +00:00
Maximilian Paß	eaa022282c	Block Webserver during first Nomad recovery. No requests are accepted while Poseidon is recovering Nomad environments and runners.	2024-01-15 16:05:35 +00:00
Maximilian Paß	9646542499	Inject Execution Debug Message for measuring the performance of the until loop of the stderr connection.	2023-11-23 16:17:30 +01:00
Maximilian Paß	64412e1c4b	Revert "Inject Execution Debug Message" This reverts commit `04a2d0ff3b`.	2023-11-23 15:27:31 +01:00
Maximilian Paß	04a2d0ff3b	Inject Execution Debug Message for measuring the performance of the until loop of the stderr connection.	2023-11-23 15:25:23 +01:00
Maximilian Paß	cb08787c7d	Rephrase Evaluation channel log statement.	2023-11-23 13:22:13 +01:00
Maximilian Paß	ab12c9046d	Decrease Log Severity of errors trying to read the request body.	2023-11-22 19:14:42 +01:00
Maximilian Paß	096ac9874f	Remove misleading warning.	2023-11-19 12:09:51 +00:00
Maximilian Paß	7a446fee26	Fix flaky TestUpdateRunnersLogsErrorFromWatchAllocation.	2023-11-16 12:10:57 +01:00
Maximilian Paß	3ce014f5f3	Fix flaky TestSendsSignalAfterTimeout.	2023-11-16 12:10:57 +01:00
Maximilian Paß	c820ff99e6	Fix flaky TestWithSeparateStderr.	2023-11-16 12:10:57 +01:00
Maximilian Paß	70c108aebf	Unify the representation of the three dots.	2023-11-09 13:11:39 +01:00
Maximilian Paß	0f7e98f78e	Refactor PrewarmingPoolAlert triggering from route-based to Nomad-Event-Stream-based.	2023-11-09 13:11:39 +01:00
Maximilian Paß	7b82300ff7	Refactor PrewarmingPoolAlert triggering from route-based to Nomad-Event-Stream-based.	2023-11-09 13:11:39 +01:00
Maximilian Paß	543939e5cb	Add independent environment reload in the case that the prewarming pool is depleting (see PrewarmingPoolThreshold) and is still depleting after a timeout (PrewarmingPoolReloadTimeout).	2023-11-09 13:11:39 +01:00
Maximilian Paß	c46a09eeae	Add Prewarming Pool Alert that checks for every environment if the filled share of the prewarmin pool is at least the specified threshold.	2023-11-09 13:11:39 +01:00
Sebastian Serth	1d93f3895f	Reduce severity of "Too many idle runners"	2023-11-01 18:42:45 +01:00
Maximilian Paß	f259d65aa4	Add unit tests for runner recovery.	2023-10-31 15:49:56 +01:00
Maximilian Paß	160a097d07	Fix flaky test TestDestroysRunnerAfterTimeoutAndSignal.	2023-10-31 15:49:56 +01:00
Maximilian Paß	d0dd5c08cb	Remove usage of context.DeadlineExceeded for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally. In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.	2023-10-31 15:49:56 +01:00
Maximilian Paß	6b69a2d732	Refactor Nomad Recovery from an approach that loaded the runners only once at the startup to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.	2023-10-31 15:49:56 +01:00
Maximilian Paß	b2898f9183	Fix List of the Environments with fetch. Before the List function dropped all idleRunners of all environments when fetch was set. Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.	2023-10-31 15:49:56 +01:00
Maximilian Paß	2713e8672c	Add error for empty list file system execution. Normally, the result of executing the `lsCommand` should never be empty. However, we have observed that CodeOcean sometimes receives an empty JSON result if the runner is being deleted while the list file system request is processed. Therefore, we add a check if something has been written to CodeOcean and otherwise report an error.	2023-10-29 15:23:40 +01:00
Maximilian Paß	14b012486d	Formalize Memory Monitoring by extracting the interval and threshold into configuration options. Related to `f670b07e`.	2023-10-12 16:16:46 +02:00
Maximilian Paß	3d252492fe	Fix rescheduled used runners being removed. As they are already rescheduled and therefore recreated they do not need to be removed, but can be handled as a new runner.	2023-09-18 01:06:35 +02:00
Maximilian Paß	6dc83ca7b5	Add regression test for rescheduled used runners being removed. As they are already rescheduled and therefore recreated they do not need to be removed, but can be handled as a new runner.	2023-09-18 01:06:35 +02:00
Maximilian Paß	90d591d4ec	Change default behavior in Nomad Event Handling to not propagate that pending runners are being stopped.	2023-09-18 00:54:26 +02:00
Maximilian Paß	2eb15c8d93	Fix loosing of rescheduled runners that are rescheduled while the previous allocation was still pending. We fix this by removing the race condition handling that should prevent Poseidon from throwing warnings of unexpected allocation stopping.	2023-09-18 00:54:26 +02:00
Maximilian Paß	788cb0f660	Add regression test for the recent lost runners.	2023-09-18 00:54:26 +02:00
Maximilian Paß	68cd8f43b4	Defuse data race condition of TestWithSeparateStderrReturnsCommandError.	2023-09-11 13:44:29 +02:00
Maximilian Paß	6159f2a045	Fix Goroutine Leak of Nomad execute command that was triggered when [the execution timeout got exceeded, the runner got destroyed, or the WebSocket connection to CodeOcean closed] and the Allocation did not react to the SIGQUIT within the grace period.	2023-09-11 13:44:29 +02:00
Maximilian Paß	59da36303c	Fix Goroutine Leak of Environment Get that was caused by creating an intermediate environment `fetchedEnvironment` when fetching the environments but not removing it in case that we just copy its configuration to the existing environment.	2023-09-11 13:44:29 +02:00
Maximilian Paß	460b8b2065	Refactor TestReturnReturnsErrorWhenApiCallFailed to handle the retry mechanism.	2023-09-11 13:44:29 +02:00
Maximilian Paß	3abd4d9a3d	Refactor all tests to use the MemoryLeakTestSuite.	2023-09-11 13:44:29 +02:00
Maximilian Paß	e3161637a9	Extract the WatchEventStream retry mechanism into the utils including all other retry mechanisms. With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).	2023-09-11 13:44:29 +02:00
Maximilian Paß	0d6b4f660c	Refactor NewAbstractManager to require a context used for the monitoring.	2023-09-11 13:44:29 +02:00
Maximilian Paß	b708dddd23	Add Nomad Manager test case that ensures that `onAllocationStopped` returns true when the runner was deleted before by the inactivity timer. This feature is required for handling a race condition with the event handling of a rescheduled allocation.	2023-09-05 15:15:39 +02:00

1 2 3 4 5

222 Commits