to limit the number of reschedules as we cannot guarantee an error-free job definition.
We also set reasonable delays and intervals to increase the likelihood that one deployment has enough time, but there is no second deployment within the interval.
as we identified two issues where the goroutine count from before differs from after the test.
1) It seemed like a Go runtime specific Goroutine appeared in rare cases before the test. To avoid this, we introduced a short timeout before looking up the Goroutines.
Another solution might be to do the lookup twice and check if the count matches.
2) A Goroutine that periodically monitors some storage unexpectedly got closed in rare cases. As we could not identify the cause for this, we removed the leaking Goroutines by properly cleaning up.
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
Before the List function dropped all idleRunners of all environments when fetch was set.
Additionally, the replaced environment was not destroyed properly so that a goroutine for it and for all its idle runners remained running.
Normally, the result of executing the `lsCommand` should never be empty. However, we have observed that CodeOcean sometimes receives an empty JSON result if the runner is being deleted while the list file system request is processed. Therefore, we add a check if something has been written to CodeOcean and otherwise report an error.
that are rescheduled while the previous allocation was still pending.
We fix this by removing the race condition handling that should prevent Poseidon from throwing warnings of unexpected allocation stopping.
that was triggered when [the execution timeout got exceeded, the runner got destroyed, or the WebSocket connection to CodeOcean closed] and the Allocation did not react to the SIGQUIT within the grace period.
that was caused by creating an intermediate environment `fetchedEnvironment` when fetching the environments but not removing it in case that we just copy its configuration to the existing environment.
into the utils including all other retry mechanisms.
With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
that ensures that `onAllocationStopped` returns true when the runner was deleted before by the inactivity timer.
This feature is required for handling a race condition with the event handling of a rescheduled allocation.