In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.
The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
that was triggered by simultaneous deletion of the runner due to inactivity, and the allocation being rescheduled due to a lost node.
It led to the allocation first being rescheduled, and then being stopped. This caused an unexpected stopping of a pending runner on a lower level.
To fix it we added communication from the upper level that the stop of the job was expected.
In the context of #358 we identified that the event with the type `AllocationUpdated` and the client status `pending` is common but not always send by Nomad.
With this Commit we remove the condition that limits the evaluated Nomad events to the event with the type `AllocationUpdated`. Without the condition the event of the type `PlanResult` and the status `pending` will be evaluated equally. By now, this event seems to be sent every time.
This restriction led to started allocation not being registered when the `AllocationUpdated` event with client status `pending` was missing.
- Abstracting from the exec form while generating.
- Removal of single quotes (usage of only double-quotes).
- Bash-nesting using escaping of special characters.
* Close evaluation stream for Nomad Job creation
when set event handler have been finished
* Remove evaluation event stream requests
by handling the events via the main Nomad event handler.
* #9 Implement routes to list, get and delete execution environments.
A refactoring was required to introduce the ExecutionEnvironment interface.
* Fix MR comments, linting issues and bug that lead to e2e test failure
* Add e2e tests
* Add unit tests
When the context passed to Nomad Allocation Exec is cancelled, the
process is not terminated. Instead, just the WebSocket connection is
closed. In order to terminate long-running processes, a special
character is injected into the standard input stream. This character is
parsed by the tty line discipline (tty has to be true). The line
discipline sends a SIGQUIT signal to the process, terminating it and
producing a core dump (in a file called 'core'). The SIGQUIT signal can
be caught but isn't by default, which is why the runner is destroyed if
the program does not terminate during a grace period after the signal
was sent.
We previously didn't really had any structure in our project apart
from creating a new folder for each package in our project root.
Now that we have accumulated some packages, we use the well-known
Golang project layout in order to clearly communicate our intent
with packages. See https://github.com/golang-standards/project-layout