Commit Graph

31 Commits

Author SHA1 Message Date
d0dd5c08cb Remove usage of context.DeadlineExceeded
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
2023-10-31 15:49:56 +01:00
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
e3161637a9 Extract the WatchEventStream retry mechanism
into the utils including all other retry mechanisms.

With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
2023-09-11 13:44:29 +02:00
0d6b4f660c Refactor NewAbstractManager
to require a context used for the monitoring.
2023-09-11 13:44:29 +02:00
354c16cc37 Fix missing rescheduled idle runners.
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.

The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
2023-09-05 15:15:39 +02:00
a7d27e8f65 Add missing error log statements.
When "markRunnerAsUsed" fails, we silently ignored it. Only, when additionally the return of the runner failed, we threw the error.

When a Runner is destroyed, we are only notified that Nomad removed the allocation, but cannot tell about the reason.

For "the execution did not stop after SIGQUIT" we did not log the belonging runner id.
2023-08-21 22:40:37 +02:00
13cd19ed58 Refactor Nomad Event Stream log message. 2023-08-18 09:28:23 +02:00
73759f8a3c Retry Environment Recovery 2023-08-18 09:28:23 +02:00
eb818f92f7 Refactor Runner Destroy Reason Masking
and ignore expected reasons such when the runner got destroyed by an API request.
2023-07-24 11:48:14 +01:00
6a1677dea0 Introduce reason for destroying runner
in order to return a specific error for OOM Killed Executions.
2023-07-21 15:30:21 +02:00
bfb5977d24 Destroy runner on allocation stopped
Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.
2023-07-21 15:30:21 +02:00
e7df777db4 Always log Runner and Environment ID.
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
527aaf713f Fix decreased prewarming pool due to inactivity timer.
When allocations fail and restart they are added again to the idle runners. The bug fixed with this commit is that the inactivity timer was not stopped at the restart. This led to the idle runner being removed when the timer expired.
2023-06-16 17:27:45 +01:00
f031219cb8 Fix Nomad event race condition
that was triggered by simultaneous deletion of the runner due to inactivity, and the allocation being rescheduled due to a lost node.
It led to the allocation first being rescheduled, and then being stopped. This caused an unexpected stopping of a pending runner on a lower level.
To fix it we added communication from the upper level that the stop of the job was expected.
2023-06-13 14:20:20 +02:00
b620d0fad7 Introduce Allocation State Tracking
in order to break down the current state and evaluate if it is invalid.
2023-06-13 14:20:20 +02:00
8f89c14ea1 Cleanup logs for Allocation recovery
on startup. The changes do not have functional consequences as adding the allocation just overwrites the old one.
2023-05-10 18:56:51 +01:00
0c8fa9ccfa Add context to log statements. 2023-04-11 20:45:30 +01:00
038d71ff51 Nomad: Handle Container re-allocation 2023-03-31 14:42:55 +02:00
e0db1bafe8 Fix multiple user Runner use
A before unknown Nomad reload adds already known runner again to the idle runner - even if they are already in use.
2023-03-31 14:42:55 +02:00
a78ee22e67 Reduce time racetrack of delete and listFileSystem route. 2023-01-02 11:23:02 +01:00
160df3d9e6 Add retry-mechanism for sample, mark-as-used and return
of Nomad runners.
2022-10-24 22:12:09 +01:00
9677253b35 Change Influx field name for the startup duration
due to a currently not resolvable type mismatch.
2022-08-10 20:46:17 +02:00
89e15c5c2f Fix startup time format
Before it was a string. To use it efficiently we want it to be a number - in this case in nanoseconds.
2022-08-05 21:16:58 +02:00
c6e65c14bb Monitor Nomad allocation startup duration. 2022-07-31 19:42:35 +02:00
34040162c2 #89 Generalise the three Storage interfaces and structs into one generic storage manager. 2022-06-29 16:21:19 +02:00
b7a20e3114 Introduce method "Environment" to the Runners interface.
This way we can relate to which environment a runner belongs.
2022-04-18 13:17:49 +02:00
136f596dc2 Add aws environments to the statistics
but only with the field usedRunners.
2022-04-09 16:35:53 +02:00
6123d20525 Implement core functionality of AWS integration 2022-02-28 14:54:40 +01:00
dd41e0d5c4 Generate structures for an AWS environment and runner 2022-02-28 14:54:40 +01:00
0ef5a4e39f Make Execution Environment interface Nomad independent 2022-02-28 14:54:40 +01:00
ba43f667c2 Add architecture for multiple managers
using the chain of responsibility pattern.
2022-02-28 14:54:40 +01:00