66 Commits

Author SHA1 Message Date
12ff205bd2 added k8s stub adapter for execution environment 2024-09-18 10:43:38 +02:00
8f819de2e0 Adjust Nomad restart and reschedule behavior
to limit the number of reschedules as we cannot guarantee an error-free job definition.
We also set reasonable delays and intervals to increase the likelihood that one deployment has enough time, but there is no second deployment within the interval.
2024-06-27 16:53:30 +02:00
ead96f6f18 Remove Nomad Job Scaling option
because our current Poseidon-Nomad architecture has a 1:1 runner-job relationship and there is no need to have more than the one task per job.
2024-06-27 16:53:30 +02:00
cbcd5f233e Fix idle runner being memory leaked
when its allocation is restarted by Nomad.

Fix logic created in 354c16cc.
2024-06-06 09:46:49 +02:00
9deee186a7 Fix Runner DNS resolution
by adding public nameservers to the CNI secure bridge configuration.
2024-04-03 10:14:24 +02:00
ab938bfc22 Refactor MemoryLeakTestSuite
as we identified two issues where the goroutine count from before differs from after the test.

1) It seemed like a Go runtime specific Goroutine appeared in rare cases before the test. To avoid this, we introduced a short timeout before looking up the Goroutines.
Another solution might be to do the lookup twice and check if the count matches.

2) A Goroutine that periodically monitors some storage unexpectedly got closed in rare cases. As we could not identify the cause for this, we removed the leaking Goroutines by properly cleaning up.
2024-02-28 11:52:51 +01:00
096ac9874f Remove misleading warning. 2023-11-19 12:09:51 +00:00
70c108aebf Unify the representation of the three dots. 2023-11-09 13:11:39 +01:00
c46a09eeae Add Prewarming Pool Alert
that checks for every environment if the filled share of the prewarmin pool is at least the specified threshold.
2023-11-09 13:11:39 +01:00
1d93f3895f Reduce severity of "Too many idle runners" 2023-11-01 18:42:45 +01:00
f259d65aa4 Add unit tests for runner recovery. 2023-10-31 15:49:56 +01:00
d0dd5c08cb Remove usage of context.DeadlineExceeded
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
2023-10-31 15:49:56 +01:00
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
b2898f9183 Fix List of the Environments with fetch.
Before the List function dropped all idleRunners of all environments when fetch was set.

Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.
2023-10-31 15:49:56 +01:00
59da36303c Fix Goroutine Leak of Environment Get
that was caused by creating an intermediate environment `fetchedEnvironment` when fetching the environments but not removing it in case that we just copy its configuration to the existing environment.
2023-09-11 13:44:29 +02:00
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
e3161637a9 Extract the WatchEventStream retry mechanism
into the utils including all other retry mechanisms.

With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
2023-09-11 13:44:29 +02:00
0d6b4f660c Refactor NewAbstractManager
to require a context used for the monitoring.
2023-09-11 13:44:29 +02:00
354c16cc37 Fix missing rescheduled idle runners.
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.

The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
2023-09-05 15:15:39 +02:00
13a9da95e5 Introduce a context for RetryExponential
as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts.
Additionally, we now stop retrying to create a runner when the environment got deleted.
2023-08-18 09:28:23 +02:00
73759f8a3c Retry Environment Recovery 2023-08-18 09:28:23 +02:00
e7df777db4 Always log Runner and Environment ID.
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
bcab46d746 Allow unlimited Nomad reschedules
With this measure, we want to avoid template jobs being removed on the second rescheduling.
2023-06-13 14:20:20 +02:00
f7339570ae Fix increased prewarming pool size
by checking the number of required runners before creating an additional runner.
2023-05-28 23:47:07 +01:00
2650efbb38 Sentry Tracing Identifier 2023-02-03 10:29:18 +00:00
f2c205a8ed Add additional performance spans 2023-02-03 10:29:18 +00:00
0c6c48c3cf #190 Add unit tests for runner recovery. 2022-11-26 13:33:44 +00:00
81d777c9cb Increase minimal memory usage
as we collected new insights about the actual memory usage.
2022-11-09 23:19:25 +01:00
160df3d9e6 Add retry-mechanism for sample, mark-as-used and return
of Nomad runners.
2022-10-24 22:12:09 +01:00
7119f3e012 Fix not canceling monitoring events for removed environments
and runners.
2022-10-24 13:15:14 +02:00
5d54b0f786 Fix wrong environment id at monitoring
data for created or updated environments.
2022-10-24 13:15:14 +02:00
d372e37d1a Add cni/secure-bridge to isolate host network 2022-09-18 19:02:04 +02:00
1eef26cc83 Add environment id to periodical monitoring events. 2022-08-20 09:17:43 +02:00
5590c50e14 #110 Add periodical monitoring events. 2022-08-19 20:48:46 +02:00
021530d5a7 Apply GoFmt fixes 2022-08-10 19:34:05 +02:00
18daa1152c Save the environment id for runner monitoring. 2022-07-31 19:42:35 +02:00
39cfdbf635 Apply suggestions from code review 2022-07-01 15:29:31 +02:00
498e8f5ff5 #110 Refactor influxdb monitoring
to use it as singleton.
This enables the possibility to monitor processes that are independent of an incoming request.
2022-07-01 15:29:31 +02:00
34040162c2 #89 Generalise the three Storage interfaces and structs into one generic storage manager. 2022-06-29 16:21:19 +02:00
0f8a1fa25a Specify AWS Functions as list
to conform with the yaml standard of list definition.
2022-06-08 09:01:46 +02:00
795c83f7b2 Fix deleting non existent environments
that is an error caused by throwing a panic when an environment is not found and a nonexistent runner manager at the end of the chain is asked for it.
2022-06-07 15:54:48 +02:00
3570f18202 Apply suggestions from code review
Co-authored-by: Sebastian Serth <MrSerth@users.noreply.github.com>
2022-04-09 16:35:53 +02:00
136f596dc2 Add aws environments to the statistics
but only with the field usedRunners.
2022-04-09 16:35:53 +02:00
a41659eed4 Enable memory oversubscription (#102)
* Enable memory oversubscription

* Fix and add e2e test
2022-03-18 08:31:27 +01:00
2cf890ab91 Implement review comments 2022-02-28 14:54:40 +01:00
4ffbb712ed Parametrize e2e tests to also check AWS environments.
- Fix destroy runner after timeout.
- Add file deletion
2022-02-28 14:54:40 +01:00
d603a8ebb0 Refactor static AWS functions
from a magic number in the code to a configurable list in configuration.yaml
2022-02-28 14:54:40 +01:00
f6d9a6ddbb Add unit tests 2022-02-28 14:54:40 +01:00
6123d20525 Implement core functionality of AWS integration 2022-02-28 14:54:40 +01:00
dd41e0d5c4 Generate structures for an AWS environment and runner 2022-02-28 14:54:40 +01:00