poseidon

Author	SHA1	Message	Date
Maximilian Paß	d0dd5c08cb	Remove usage of context.DeadlineExceeded for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally. In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.	2023-10-31 15:49:56 +01:00
Maximilian Paß	6b69a2d732	Refactor Nomad Recovery from an approach that loaded the runners only once at the startup to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.	2023-10-31 15:49:56 +01:00
Maximilian Paß	e3161637a9	Extract the WatchEventStream retry mechanism into the utils including all other retry mechanisms. With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).	2023-09-11 13:44:29 +02:00
Maximilian Paß	0d6b4f660c	Refactor NewAbstractManager to require a context used for the monitoring.	2023-09-11 13:44:29 +02:00
Maximilian Paß	354c16cc37	Fix missing rescheduled idle runners. In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon. The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.	2023-09-05 15:15:39 +02:00
Maximilian Paß	a7d27e8f65	Add missing error log statements. When "markRunnerAsUsed" fails, we silently ignored it. Only, when additionally the return of the runner failed, we threw the error. When a Runner is destroyed, we are only notified that Nomad removed the allocation, but cannot tell about the reason. For "the execution did not stop after SIGQUIT" we did not log the belonging runner id.	2023-08-21 22:40:37 +02:00
Maximilian Paß	13cd19ed58	Refactor Nomad Event Stream log message.	2023-08-18 09:28:23 +02:00
Maximilian Paß	73759f8a3c	Retry Environment Recovery	2023-08-18 09:28:23 +02:00
Maximilian Paß	eb818f92f7	Refactor Runner Destroy Reason Masking and ignore expected reasons such when the runner got destroyed by an API request.	2023-07-24 11:48:14 +01:00
Maximilian Paß	6a1677dea0	Introduce reason for destroying runner in order to return a specific error for OOM Killed Executions.	2023-07-21 15:30:21 +02:00
Maximilian Paß	bfb5977d24	Destroy runner on allocation stopped Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.	2023-07-21 15:30:21 +02:00
Maximilian Paß	e7df777db4	Always log Runner and Environment ID. Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.	2023-07-15 21:46:56 +02:00
Maximilian Paß	527aaf713f	Fix decreased prewarming pool due to inactivity timer. When allocations fail and restart they are added again to the idle runners. The bug fixed with this commit is that the inactivity timer was not stopped at the restart. This led to the idle runner being removed when the timer expired.	2023-06-16 17:27:45 +01:00
Maximilian Paß	f031219cb8	Fix Nomad event race condition that was triggered by simultaneous deletion of the runner due to inactivity, and the allocation being rescheduled due to a lost node. It led to the allocation first being rescheduled, and then being stopped. This caused an unexpected stopping of a pending runner on a lower level. To fix it we added communication from the upper level that the stop of the job was expected.	2023-06-13 14:20:20 +02:00
Maximilian Paß	b620d0fad7	Introduce Allocation State Tracking in order to break down the current state and evaluate if it is invalid.	2023-06-13 14:20:20 +02:00
Maximilian Paß	8f89c14ea1	Cleanup logs for Allocation recovery on startup. The changes do not have functional consequences as adding the allocation just overwrites the old one.	2023-05-10 18:56:51 +01:00
Maximilian Paß	0c8fa9ccfa	Add context to log statements.	2023-04-11 20:45:30 +01:00
Maximilian Paß	038d71ff51	Nomad: Handle Container re-allocation	2023-03-31 14:42:55 +02:00
Maximilian Paß	e0db1bafe8	Fix multiple user Runner use A before unknown Nomad reload adds already known runner again to the idle runner - even if they are already in use.	2023-03-31 14:42:55 +02:00
Maximilian Paß	a78ee22e67	Reduce time racetrack of delete and listFileSystem route.	2023-01-02 11:23:02 +01:00
Maximilian Paß	160df3d9e6	Add retry-mechanism for sample, mark-as-used and return of Nomad runners.	2022-10-24 22:12:09 +01:00
Maximilian Paß	9677253b35	Change Influx field name for the startup duration due to a currently not resolvable type mismatch.	2022-08-10 20:46:17 +02:00
Maximilian Paß	89e15c5c2f	Fix startup time format Before it was a string. To use it efficiently we want it to be a number - in this case in nanoseconds.	2022-08-05 21:16:58 +02:00
Maximilian Paß	c6e65c14bb	Monitor Nomad allocation startup duration.	2022-07-31 19:42:35 +02:00
Maximilian Paß	34040162c2	#89 Generalise the three Storage interfaces and structs into one generic storage manager.	2022-06-29 16:21:19 +02:00
Maximilian Paß	b7a20e3114	Introduce method "Environment" to the Runners interface. This way we can relate to which environment a runner belongs.	2022-04-18 13:17:49 +02:00
Maximilian Paß	136f596dc2	Add aws environments to the statistics but only with the field usedRunners.	2022-04-09 16:35:53 +02:00
Maximilian Paß	6123d20525	Implement core functionality of AWS integration	2022-02-28 14:54:40 +01:00
Maximilian Paß	dd41e0d5c4	Generate structures for an AWS environment and runner	2022-02-28 14:54:40 +01:00
Maximilian Paß	0ef5a4e39f	Make Execution Environment interface Nomad independent	2022-02-28 14:54:40 +01:00
Maximilian Paß	ba43f667c2	Add architecture for multiple managers using the chain of responsibility pattern.	2022-02-28 14:54:40 +01:00

31 Commits