Commit Graph

60 Commits

Author SHA1 Message Date
096ac9874f Remove misleading warning. 2023-11-19 12:09:51 +00:00
70c108aebf Unify the representation of the three dots. 2023-11-09 13:11:39 +01:00
c46a09eeae Add Prewarming Pool Alert
that checks for every environment if the filled share of the prewarmin pool is at least the specified threshold.
2023-11-09 13:11:39 +01:00
1d93f3895f Reduce severity of "Too many idle runners" 2023-11-01 18:42:45 +01:00
f259d65aa4 Add unit tests for runner recovery. 2023-10-31 15:49:56 +01:00
d0dd5c08cb Remove usage of context.DeadlineExceeded
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
2023-10-31 15:49:56 +01:00
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
b2898f9183 Fix List of the Environments with fetch.
Before the List function dropped all idleRunners of all environments when fetch was set.

Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.
2023-10-31 15:49:56 +01:00
59da36303c Fix Goroutine Leak of Environment Get
that was caused by creating an intermediate environment `fetchedEnvironment` when fetching the environments but not removing it in case that we just copy its configuration to the existing environment.
2023-09-11 13:44:29 +02:00
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
e3161637a9 Extract the WatchEventStream retry mechanism
into the utils including all other retry mechanisms.

With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
2023-09-11 13:44:29 +02:00
0d6b4f660c Refactor NewAbstractManager
to require a context used for the monitoring.
2023-09-11 13:44:29 +02:00
354c16cc37 Fix missing rescheduled idle runners.
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.

The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
2023-09-05 15:15:39 +02:00
13a9da95e5 Introduce a context for RetryExponential
as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts.
Additionally, we now stop retrying to create a runner when the environment got deleted.
2023-08-18 09:28:23 +02:00
73759f8a3c Retry Environment Recovery 2023-08-18 09:28:23 +02:00
e7df777db4 Always log Runner and Environment ID.
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
bcab46d746 Allow unlimited Nomad reschedules
With this measure, we want to avoid template jobs being removed on the second rescheduling.
2023-06-13 14:20:20 +02:00
f7339570ae Fix increased prewarming pool size
by checking the number of required runners before creating an additional runner.
2023-05-28 23:47:07 +01:00
2650efbb38 Sentry Tracing Identifier 2023-02-03 10:29:18 +00:00
f2c205a8ed Add additional performance spans 2023-02-03 10:29:18 +00:00
0c6c48c3cf #190 Add unit tests for runner recovery. 2022-11-26 13:33:44 +00:00
81d777c9cb Increase minimal memory usage
as we collected new insights about the actual memory usage.
2022-11-09 23:19:25 +01:00
160df3d9e6 Add retry-mechanism for sample, mark-as-used and return
of Nomad runners.
2022-10-24 22:12:09 +01:00
7119f3e012 Fix not canceling monitoring events for removed environments
and runners.
2022-10-24 13:15:14 +02:00
5d54b0f786 Fix wrong environment id at monitoring
data for created or updated environments.
2022-10-24 13:15:14 +02:00
d372e37d1a Add cni/secure-bridge to isolate host network 2022-09-18 19:02:04 +02:00
1eef26cc83 Add environment id to periodical monitoring events. 2022-08-20 09:17:43 +02:00
5590c50e14 #110 Add periodical monitoring events. 2022-08-19 20:48:46 +02:00
021530d5a7 Apply GoFmt fixes 2022-08-10 19:34:05 +02:00
18daa1152c Save the environment id for runner monitoring. 2022-07-31 19:42:35 +02:00
39cfdbf635 Apply suggestions from code review 2022-07-01 15:29:31 +02:00
498e8f5ff5 #110 Refactor influxdb monitoring
to use it as singleton.
This enables the possibility to monitor processes that are independent of an incoming request.
2022-07-01 15:29:31 +02:00
34040162c2 #89 Generalise the three Storage interfaces and structs into one generic storage manager. 2022-06-29 16:21:19 +02:00
0f8a1fa25a Specify AWS Functions as list
to conform with the yaml standard of list definition.
2022-06-08 09:01:46 +02:00
795c83f7b2 Fix deleting non existent environments
that is an error caused by throwing a panic when an environment is not found and a nonexistent runner manager at the end of the chain is asked for it.
2022-06-07 15:54:48 +02:00
3570f18202 Apply suggestions from code review
Co-authored-by: Sebastian Serth <MrSerth@users.noreply.github.com>
2022-04-09 16:35:53 +02:00
136f596dc2 Add aws environments to the statistics
but only with the field usedRunners.
2022-04-09 16:35:53 +02:00
a41659eed4 Enable memory oversubscription (#102)
* Enable memory oversubscription

* Fix and add e2e test
2022-03-18 08:31:27 +01:00
2cf890ab91 Implement review comments 2022-02-28 14:54:40 +01:00
4ffbb712ed Parametrize e2e tests to also check AWS environments.
- Fix destroy runner after timeout.
- Add file deletion
2022-02-28 14:54:40 +01:00
d603a8ebb0 Refactor static AWS functions
from a magic number in the code to a configurable list in configuration.yaml
2022-02-28 14:54:40 +01:00
f6d9a6ddbb Add unit tests 2022-02-28 14:54:40 +01:00
6123d20525 Implement core functionality of AWS integration 2022-02-28 14:54:40 +01:00
dd41e0d5c4 Generate structures for an AWS environment and runner 2022-02-28 14:54:40 +01:00
0ef5a4e39f Make Execution Environment interface Nomad independent 2022-02-28 14:54:40 +01:00
ba43f667c2 Add architecture for multiple managers
using the chain of responsibility pattern.
2022-02-28 14:54:40 +01:00
c22b76720c Add documentation for guarding the Nomad tasks 2021-12-22 17:30:16 +01:00
d57a0c07b8 Implement review suggestions 2021-12-22 17:30:16 +01:00
0571b10b5c Recreate runners on execution environment update
Solves #69 and #48
2021-12-22 17:30:16 +01:00
2bf9b10564 Update default image in template-environment-job.hcl
* The image previously used is not available publicly and not maintained any longer
* The new base image is not bound to any specific programming environment
2021-12-22 15:54:46 +01:00