Previously, the template job HCL file was hardcoded using go:embed
in the binary. However, this did not allow users running Poseidon
to change its content. Now, users can change the content of the
template job HCL file using the configuration option.
As of version 1.1.2 of Nomad, the CLI monitors job deployments by
default until they are finished. Thus our custom job deployment
watcher script is not required anymore.
Previously, the server sometimes crashed due to concurrent writes
to the websocket connection. Now, we ensure that only one concurrent
function writes to the websocket at a time by enclosing the WriteMessage
function with a mutex.
When the context passed to Nomad Allocation Exec is cancelled, the
process is not terminated. Instead, just the WebSocket connection is
closed. In order to terminate long-running processes, a special
character is injected into the standard input stream. This character is
parsed by the tty line discipline (tty has to be true). The line
discipline sends a SIGQUIT signal to the process, terminating it and
producing a core dump (in a file called 'core'). The SIGQUIT signal can
be caught but isn't by default, which is why the runner is destroyed if
the program does not terminate during a grace period after the signal
was sent.
The TestCreateOrUpdateEnvironment function would previously use
the python:latest Docker image in its execution environment request.
However, this lead to pull rate limiting by Docker Hub in our CI.
Previously we used this file to deploy a job on Nomad that our API
used for e2e tests. Now that we create the environments in the e2e
tests, we don't need the demo job anymore.
Previously we had a dependency to the tests package. As the
nullreader package is in the pkg directory it should be publicly
available. However, having the tests dependency could lead to a
transitive dependency to an internal package, if the tests package
would import one. Thus, we removed it.
We previously didn't really had any structure in our project apart
from creating a new folder for each package in our project root.
Now that we have accumulated some packages, we use the well-known
Golang project layout in order to clearly communicate our intent
with packages. See https://github.com/golang-standards/project-layout
We now store the mapped ports returned by Nomad locally in our runner
struct and return them when requesting the runner. The returned ip
address is in most Nomad setups not reachable from external users.
Previously we would create as much runners as needed based on the
local idleRunnersCount and the desiredIdleRunnersCount. This is
problematic if two runners are claimed shortly after one another.
As we only add a runner to the idleRunners list once we get the
event from Nomad, the second runner claim in a short timeframe
would create two new runners. This has been fixed now.
To be able to restore the runner timeouts even after a Poseidon restart,
the timeout is stored in the Nomad metadata. The timeout will restart,
but at least the runner will be returned at all.
By removing runners after a specified timeout they no longer stay
around indefinitely and block Nomads capacities. The timeout can be set
individually per runner when requesting the provide route. If it is set
to 0, the runner is never removed automatically.
The timeout is reset when activity is detected. Currently that is when
something gets executed or the filesystem gets modified.
ChannelReceivesSomething (formerly WaitForChannel) originally was
located in the helpers package.
This move was done to remove a cyclic dependency with the nomand package.
When running an execution, Nomad continuously reads from the stdin
reader. Because the readers we implemented (codeOceanToRawReader and
nullReader) return zero if there is no input available, this leads to
busy waiting and a high CPU load on Poseidon. By waiting indefinitely in
case of the nullReader and for at least one byte on case of the
codeOceanToRawReader before returning, we prevent this issue.
As the environment is no longer stored in the meta information,
Poseidon wasn't able to recover environments. It expected the
environment id to be found in the meta data. We now recover
the environment id from the job id.
Previously, low level Nomad job creation was done in the environment manager.
It used many functions of the nomad package so we felt like this logic
better belongs to the nomad package.
As we can't control which allocations are destroyed when downscaling a job, we decided
to use Nomad jobs as our runners. Thus for each runner we prewarm for an environment,
a corresponding job is created in Nomad. We create a default job that serves as a template
for the runners. Using this, already existing execution environments can easily be restored,
once Poseidon is restarted.
Previously the stderr fifo would not be removed, leaving unwanted
artifacts from the execution behind. We now remove the stderr fifo
after the command finished.
When running a command interactively, we previously would get stdout
and stderr both served on stdout by Nomad. To circumvent this issue,
we now start a separate execution inside the allocation to split
both streams.