that was caused by creating an intermediate environment `fetchedEnvironment` when fetching the environments but not removing it in case that we just copy its configuration to the existing environment.
into the utils including all other retry mechanisms.
With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
that ensures that `onAllocationStopped` returns true when the runner was deleted before by the inactivity timer.
This feature is required for handling a race condition with the event handling of a rescheduled allocation.
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.
The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
Only the Sentry hook uses the values of the passed context. Therefore, we removed the values from our log statements when we shifted them from an extra `WithField` call to the context.
We fix this behavior by introducing a Logrus Hook that copies a fixed set of context values to the logging data.
When "markRunnerAsUsed" fails, we silently ignored it. Only, when additionally the return of the runner failed, we threw the error.
When a Runner is destroyed, we are only notified that Nomad removed the allocation, but cannot tell about the reason.
For "the execution did not stop after SIGQUIT" we did not log the belonging runner id.
as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts.
Additionally, we now stop retrying to create a runner when the environment got deleted.
With Go 1.21 the WithoutCancel context was introduced. This way we can keep the values passed in a new context without having the new context being canceled together with its parent. This behavior suits well for two occurrences where we explicitly had to copy one required value instead of implicitly keeping all values.
due to the Nomad request exiting before the allocation is stopped. We catch this behavior by introducing a time period for the allocation being stopped iff the exit code is 128.
Before, Nomad executions often got stopped because the runner was deleted.
With the previous commit, we cover the exception to this behaviour by stopping the execution Poseidon-side.
These different approaches lead to different context error messages.
In this commit, we move the check of the passed timeout, to respond with the corresponding client message again.
Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.
When allocations fail and restart they are added again to the idle runners. The bug fixed with this commit is that the inactivity timer was not stopped at the restart. This led to the idle runner being removed when the timer expired.
that was triggered by simultaneous deletion of the runner due to inactivity, and the allocation being rescheduled due to a lost node.
It led to the allocation first being rescheduled, and then being stopped. This caused an unexpected stopping of a pending runner on a lower level.
To fix it we added communication from the upper level that the stop of the job was expected.
In the context of #358 we identified that the event with the type `AllocationUpdated` and the client status `pending` is common but not always send by Nomad.
With this Commit we remove the condition that limits the evaluated Nomad events to the event with the type `AllocationUpdated`. Without the condition the event of the type `PlanResult` and the status `pending` will be evaluated equally. By now, this event seems to be sent every time.
This restriction led to started allocation not being registered when the `AllocationUpdated` event with client status `pending` was missing.