222 Commits

Author SHA1 Message Date
12ff205bd2 added k8s stub adapter for execution environment 2024-09-18 10:43:38 +02:00
8f819de2e0 Adjust Nomad restart and reschedule behavior
to limit the number of reschedules as we cannot guarantee an error-free job definition.
We also set reasonable delays and intervals to increase the likelihood that one deployment has enough time, but there is no second deployment within the interval.
2024-06-27 16:53:30 +02:00
ead96f6f18 Remove Nomad Job Scaling option
because our current Poseidon-Nomad architecture has a 1:1 runner-job relationship and there is no need to have more than the one task per job.
2024-06-27 16:53:30 +02:00
cbcd5f233e Fix idle runner being memory leaked
when its allocation is restarted by Nomad.

Fix logic created in 354c16cc.
2024-06-06 09:46:49 +02:00
ec3b2a93db Fix Golangci-lint configuration 2024-05-07 14:48:17 +02:00
19e0ae1583 Fix concurrent map write
in the Nomad `evaluations` map by replacing the simple map with our concurrency-ready storage object.
2024-04-17 13:19:49 +02:00
9deee186a7 Fix Runner DNS resolution
by adding public nameservers to the CNI secure bridge configuration.
2024-04-03 10:14:24 +02:00
ab938bfc22 Refactor MemoryLeakTestSuite
as we identified two issues where the goroutine count from before differs from after the test.

1) It seemed like a Go runtime specific Goroutine appeared in rare cases before the test. To avoid this, we introduced a short timeout before looking up the Goroutines.
Another solution might be to do the lookup twice and check if the count matches.

2) A Goroutine that periodically monitors some storage unexpectedly got closed in rare cases. As we could not identify the cause for this, we removed the leaking Goroutines by properly cleaning up.
2024-02-28 11:52:51 +01:00
939904d406 Fix linter emptyStringTest rule
by replacing the length check with a string comparison.
This rule got introduced with the new GolangCI lint version.
2024-02-27 15:53:47 +01:00
317590d3ea Revert "Debug Health route latency."
This reverts commit 213628b958.
2024-01-26 22:51:55 +01:00
213628b958 Debug Health route latency. 2024-01-26 14:36:16 +01:00
57590457a8 Add logging filter token
The token is used to filter out request logs when the user agent matches a randomly generated string.
2024-01-24 17:21:00 +01:00
221a6ff1b2 Watchdog: Verify Server TLS Certificate 2024-01-24 17:21:00 +01:00
eaddc65989 Configure Systemd Socket Activation
as new way for Poseidon to accept connections. This should reduce our issues caused by deployments.
2024-01-15 16:05:35 +00:00
eaa022282c Block Webserver during first Nomad recovery.
No requests are accepted while Poseidon is recovering Nomad environments and runners.
2024-01-15 16:05:35 +00:00
9646542499 Inject Execution Debug Message
for measuring the performance of the until loop of the stderr connection.
2023-11-23 16:17:30 +01:00
64412e1c4b Revert "Inject Execution Debug Message"
This reverts commit 04a2d0ff3b.
2023-11-23 15:27:31 +01:00
04a2d0ff3b Inject Execution Debug Message
for measuring the performance of the until loop of the stderr connection.
2023-11-23 15:25:23 +01:00
cb08787c7d Rephrase Evaluation channel log statement. 2023-11-23 13:22:13 +01:00
ab12c9046d Decrease Log Severity
of errors trying to read the request body.
2023-11-22 19:14:42 +01:00
096ac9874f Remove misleading warning. 2023-11-19 12:09:51 +00:00
7a446fee26 Fix flaky TestUpdateRunnersLogsErrorFromWatchAllocation. 2023-11-16 12:10:57 +01:00
3ce014f5f3 Fix flaky TestSendsSignalAfterTimeout. 2023-11-16 12:10:57 +01:00
c820ff99e6 Fix flaky TestWithSeparateStderr. 2023-11-16 12:10:57 +01:00
70c108aebf Unify the representation of the three dots. 2023-11-09 13:11:39 +01:00
0f7e98f78e Refactor PrewarmingPoolAlert triggering
from route-based to Nomad-Event-Stream-based.
2023-11-09 13:11:39 +01:00
7b82300ff7 Refactor PrewarmingPoolAlert triggering
from route-based to Nomad-Event-Stream-based.
2023-11-09 13:11:39 +01:00
543939e5cb Add independent environment reload
in the case that the prewarming pool is depleting (see PrewarmingPoolThreshold) and is still depleting after a timeout (PrewarmingPoolReloadTimeout).
2023-11-09 13:11:39 +01:00
c46a09eeae Add Prewarming Pool Alert
that checks for every environment if the filled share of the prewarmin pool is at least the specified threshold.
2023-11-09 13:11:39 +01:00
1d93f3895f Reduce severity of "Too many idle runners" 2023-11-01 18:42:45 +01:00
f259d65aa4 Add unit tests for runner recovery. 2023-10-31 15:49:56 +01:00
160a097d07 Fix flaky test TestDestroysRunnerAfterTimeoutAndSignal. 2023-10-31 15:49:56 +01:00
d0dd5c08cb Remove usage of context.DeadlineExceeded
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
2023-10-31 15:49:56 +01:00
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
b2898f9183 Fix List of the Environments with fetch.
Before the List function dropped all idleRunners of all environments when fetch was set.

Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.
2023-10-31 15:49:56 +01:00
2713e8672c Add error for empty list file system execution.
Normally, the result of executing the `lsCommand` should never be empty. However, we have observed that CodeOcean sometimes receives an empty JSON result if the runner is being deleted while the list file system request is processed. Therefore, we add a check if something has been written to CodeOcean and otherwise report an error.
2023-10-29 15:23:40 +01:00
14b012486d Formalize Memory Monitoring
by extracting the interval and threshold into configuration options.

Related to f670b07e.
2023-10-12 16:16:46 +02:00
3d252492fe Fix rescheduled used runners being removed.
As they are already rescheduled and therefore recreated they do not need to be removed, but can be handled as a new runner.
2023-09-18 01:06:35 +02:00
6dc83ca7b5 Add regression test for rescheduled used runners being removed.
As they are already rescheduled and therefore recreated they do not need to be removed, but can be handled as a new runner.
2023-09-18 01:06:35 +02:00
90d591d4ec Change default behavior in Nomad Event Handling
to not propagate that pending runners are being stopped.
2023-09-18 00:54:26 +02:00
2eb15c8d93 Fix loosing of rescheduled runners
that are rescheduled while the previous allocation was still pending.
We fix this by removing the race condition handling that should prevent Poseidon from throwing warnings of unexpected allocation stopping.
2023-09-18 00:54:26 +02:00
788cb0f660 Add regression test for the recent lost runners. 2023-09-18 00:54:26 +02:00
68cd8f43b4 Defuse data race condition of TestWithSeparateStderrReturnsCommandError. 2023-09-11 13:44:29 +02:00
6159f2a045 Fix Goroutine Leak of Nomad execute command
that was triggered when [the execution timeout got exceeded, the runner got destroyed, or the WebSocket connection to CodeOcean closed] and the Allocation did not react to the SIGQUIT within the grace period.
2023-09-11 13:44:29 +02:00
59da36303c Fix Goroutine Leak of Environment Get
that was caused by creating an intermediate environment `fetchedEnvironment` when fetching the environments but not removing it in case that we just copy its configuration to the existing environment.
2023-09-11 13:44:29 +02:00
460b8b2065 Refactor TestReturnReturnsErrorWhenApiCallFailed
to handle the retry mechanism.
2023-09-11 13:44:29 +02:00
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
e3161637a9 Extract the WatchEventStream retry mechanism
into the utils including all other retry mechanisms.

With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
2023-09-11 13:44:29 +02:00
0d6b4f660c Refactor NewAbstractManager
to require a context used for the monitoring.
2023-09-11 13:44:29 +02:00
b708dddd23 Add Nomad Manager test case
that ensures that `onAllocationStopped` returns true when the runner was deleted before by the inactivity timer.
This feature is required for handling a race condition with the event handling of a rescheduled allocation.
2023-09-05 15:15:39 +02:00