Commit Graph

182 Commits

Author SHA1 Message Date
a1366a9f76 Split README documentation into multiple files inside the docs folder 2021-07-29 15:03:41 +00:00
de6edeedcc Add docs on how to avoid Nomad rate limiting
Without this configuration, Nomad caps the maximum concurrent connections
of a unique client to 100. This is not sufficient for our use case.
2021-07-29 14:07:22 +00:00
f03d07cd54 Add LICENSE 2021-07-29 13:42:39 +00:00
c21b85b32a Add missing copyright header 2021-07-29 13:42:39 +00:00
c9d6cd5996 Move runner interactivity timer to own file
Previously, the interactivity timer was implemented in the same file
as the runner. This made the file long and the project structure more
complicated.
2021-07-29 13:30:46 +00:00
5c9f975285 Update api.tpl.nomad to allow configuration Nomad ACL Token for Poseidon 2021-07-29 12:49:17 +00:00
67ebdbd650 Add option to configure template job HCL file
Previously, the template job HCL file was hardcoded using go:embed
in the binary. However, this did not allow users running Poseidon
to change its content. Now, users can change the content of the
template job HCL file using the configuration option.
2021-07-29 11:54:36 +00:00
12da813081 Describe how Poseidon abstracts from Nomad 2021-07-29 11:32:52 +00:00
81eccbdf9c Remove custom deployment watcher script
As of version 1.1.2 of Nomad, the CLI monitors job deployments by
default until they are finished. Thus our custom job deployment
watcher script is not required anymore.
2021-07-29 09:57:04 +00:00
3564cf767e Update Nomad dependencies to 1.1.2 2021-07-29 09:57:04 +00:00
210a048b5e Update api.tpl.nomad to allow configuring TLS to Nomad through gitlab 2021-07-29 09:43:21 +00:00
01d16600b0 Document activating TLS between Poseidon and Nomad 2021-07-29 09:43:21 +00:00
6a60b6cd89 Add config option to enable (m)TLS between Poseidon and Nomad 2021-07-29 09:43:21 +00:00
e2d71a11ad Avoid concurrent writes to the websocket connection
Previously, the server sometimes crashed due to concurrent writes
to the websocket connection. Now, we ensure that only one concurrent
function writes to the websocket at a time by enclosing the WriteMessage
function with a mutex.
2021-07-29 09:21:15 +00:00
6929169cb5 Add test for nullio.ReadWriter 2021-07-29 10:28:47 +02:00
8d24bda61a Send SIGQUIT when cancelling an execution
When the context passed to Nomad Allocation Exec is cancelled, the
process is not terminated. Instead, just the WebSocket connection is
closed. In order to terminate long-running processes, a special
character is injected into the standard input stream. This character is
parsed by the tty line discipline (tty has to be true). The line
discipline sends a SIGQUIT signal to the process, terminating it and
producing a core dump (in a file called 'core'). The SIGQUIT signal can
be caught but isn't by default, which is why the runner is destroyed if
the program does not terminate during a grace period after the signal
was sent.
2021-07-29 10:28:47 +02:00
91537a7364 Use test docker image in e2e tests
The TestCreateOrUpdateEnvironment function would previously use
the python:latest Docker image in its execution environment request.
However, this lead to pull rate limiting by Docker Hub in our CI.
2021-07-27 15:26:53 +00:00
fe240c82b4 Remove demo job HCL file
Previously we used this file to deploy a job on Nomad that our API
used for e2e tests. Now that we create the environments in the e2e
tests, we don't need the demo job anymore.
2021-07-27 16:31:03 +02:00
f323bdf169 Add documentation on authenticating against Nomad 2021-07-27 11:35:55 +00:00
3aa1227db6 Use authentication token from config for communication with Nomad 2021-07-27 11:35:55 +00:00
23b726cef9 Correct behavior when WebSocket closes. 2021-07-26 06:47:55 +00:00
909f347d2f Remove tests dependency from nullreader test
Previously we had a dependency to the tests package. As the
nullreader package is in the pkg directory it should be publicly
available. However, having the tests dependency could lead to a
transitive dependency to an internal package, if the tests package
would import one. Thus, we removed it.
2021-07-21 12:55:35 +02:00
8b26ecbe5f Restructure project
We previously didn't really had any structure in our project apart
from creating a new folder for each package in our project root.
Now that we have accumulated some packages, we use the well-known
Golang project layout in order to clearly communicate our intent
with packages. See https://github.com/golang-standards/project-layout
2021-07-21 12:55:35 +02:00
2f1383b743 Add tests for returning mapped ports of runners 2021-07-21 08:22:10 +02:00
64764a9809 Return mapped ports when requesting runners
We now store the mapped ports returned by Nomad locally in our runner
struct and return them when requesting the runner. The returned ip
address is in most Nomad setups not reachable from external users.
2021-07-20 23:22:58 +02:00
d7c1787b57 Disable allow-failure for linting pipeline
Now that all linting issues are fixed, we disable allow-failure for
the linting step to ensure that later commits adhere to the linter.
2021-07-13 08:59:25 +02:00
c7606f3d5f Fix a lot of linting issues
After we introduced the linter we haven't really touched the old code.
This commit now fixes all linting issue that exist right now.
2021-07-13 08:59:25 +02:00
bd7fb53385 Fix bug that the count of the default task group is set to the prewarming pool size 2021-07-07 09:21:57 +02:00
68eacae7fe Fix bug that config task group is not added to the template job (and the faulty tests) 2021-07-06 10:09:36 +02:00
bbc1ce12ca Delete idle runners when the environment is scaled down 2021-07-02 13:00:13 +02:00
66d04fde2a Remove unused function ScaleAllEnvironments 2021-07-01 09:21:09 +00:00
50a2a22b74 Only create exactly one new runner when one runner is claimed
Previously we would create as much runners as needed based on the
local idleRunnersCount and the desiredIdleRunnersCount. This is
problematic if two runners are claimed shortly after one another.
As we only add a runner to the idleRunners list once we get the
event from Nomad, the second runner claim in a short timeframe
would create two new runners. This has been fixed now.
2021-06-29 09:11:21 +02:00
e0e254a6af Persist runner timeout in metadata
To be able to restore the runner timeouts even after a Poseidon restart,
the timeout is stored in the Nomad metadata. The timeout will restart,
but at least the runner will be returned at all.
2021-06-23 11:07:17 +02:00
ae08e37106 Add end to end test for inactivity timeout 2021-06-23 11:04:19 +02:00
6c887de6f1 Move NullReader from nomad to util package. 2021-06-23 11:04:19 +02:00
14f8a096eb Add unit and integration tests for runner inactivity timeout. 2021-06-23 11:04:19 +02:00
4b2cae0bd1 Add inactivity timeout for runners.
By removing runners after a specified timeout they no longer stay
around indefinitely and block Nomads capacities. The timeout can be set
individually per runner when requesting the provide route. If it is set
to 0, the runner is never removed automatically.

The timeout is reset when activity is detected. Currently that is when
something gets executed or the filesystem gets modified.
2021-06-23 11:04:18 +02:00
c7ed54942d Move ChannelReceivesSomething to tests package.
ChannelReceivesSomething (formerly WaitForChannel) originally was
located in the helpers package.
This move was done to remove a cyclic dependency with the nomand package.
2021-06-21 10:54:07 +02:00
92f1af83ae Add tests for codeOceanToRaw and null readers
The tests ensure the readers do not return when there is no data
available.
2021-06-21 08:20:04 +00:00
17c1e379c2 Fix busy waiting on stdin
When running an execution, Nomad continuously reads from the stdin
reader. Because the readers we implemented (codeOceanToRawReader and
nullReader) return zero if there is no input available, this leads to
busy waiting and a high CPU load on Poseidon. By waiting indefinitely in
case of the nullReader and for at least one byte on case of the
codeOceanToRawReader before returning, we prevent this issue.
2021-06-21 08:20:04 +00:00
0b9e5a5ba5 Update README
* Update port to 7200
* Update linter instructions
* Update Docker instructions
2021-06-18 07:31:24 +00:00
f5f7521a18 Fix environment recovery
As the environment is no longer stored in the meta information,
Poseidon wasn't able to recover environments. It expected the
environment id to be found in the meta data. We now recover
the environment id from the job id.
2021-06-18 08:39:54 +02:00
2e4a975588 Implement even more merge request comments 2021-06-15 12:05:51 +02:00
ff582805b4 Move Nomad job creation to Nomad package
Previously, low level Nomad job creation was done in the environment manager.
It used many functions of the nomad package so we felt like this logic
better belongs to the nomad package.
2021-06-15 11:38:02 +02:00
87f823756b Implement merge request comments 2021-06-15 11:37:47 +02:00
25d78df557 Restore existing jobs and fix rebase (7c99eff3) issues 2021-06-15 11:37:35 +02:00
0020590c96 Update all runners when updating environment
Previously only the default job would be updated to the newest specs.
Now all Nomad jobs that belong to the given environment are updated
accordingly.
2021-06-15 11:35:59 +02:00
c7d59810e5 Use Nomad jobs as runners instead of allocations
As we can't control which allocations are destroyed when downscaling a job, we decided
to use Nomad jobs as our runners. Thus for each runner we prewarm for an environment,
a corresponding job is created in Nomad. We create a default job that serves as a template
for the runners. Using this, already existing execution environments can easily be restored,
once Poseidon is restarted.
2021-06-15 11:35:54 +02:00
8de489929e Remove stderr fifo after interactive execution with stderr finished
Previously the stderr fifo would not be removed, leaving unwanted
artifacts from the execution behind. We now remove the stderr fifo
after the command finished.
2021-06-14 15:04:09 +02:00
d3300e839e Add unit tests for separate stdout and stderr on execution 2021-06-11 08:47:25 +00:00