Commit Graph

114 Commits

Author SHA1 Message Date
cbcd5f233e Fix idle runner being memory leaked
when its allocation is restarted by Nomad.

Fix logic created in 354c16cc.
2024-06-06 09:46:49 +02:00
ec3b2a93db Fix Golangci-lint configuration 2024-05-07 14:48:17 +02:00
ab938bfc22 Refactor MemoryLeakTestSuite
as we identified two issues where the goroutine count from before differs from after the test.

1) It seemed like a Go runtime specific Goroutine appeared in rare cases before the test. To avoid this, we introduced a short timeout before looking up the Goroutines.
Another solution might be to do the lookup twice and check if the count matches.

2) A Goroutine that periodically monitors some storage unexpectedly got closed in rare cases. As we could not identify the cause for this, we removed the leaking Goroutines by properly cleaning up.
2024-02-28 11:52:51 +01:00
eaa022282c Block Webserver during first Nomad recovery.
No requests are accepted while Poseidon is recovering Nomad environments and runners.
2024-01-15 16:05:35 +00:00
7a446fee26 Fix flaky TestUpdateRunnersLogsErrorFromWatchAllocation. 2023-11-16 12:10:57 +01:00
3ce014f5f3 Fix flaky TestSendsSignalAfterTimeout. 2023-11-16 12:10:57 +01:00
70c108aebf Unify the representation of the three dots. 2023-11-09 13:11:39 +01:00
0f7e98f78e Refactor PrewarmingPoolAlert triggering
from route-based to Nomad-Event-Stream-based.
2023-11-09 13:11:39 +01:00
7b82300ff7 Refactor PrewarmingPoolAlert triggering
from route-based to Nomad-Event-Stream-based.
2023-11-09 13:11:39 +01:00
543939e5cb Add independent environment reload
in the case that the prewarming pool is depleting (see PrewarmingPoolThreshold) and is still depleting after a timeout (PrewarmingPoolReloadTimeout).
2023-11-09 13:11:39 +01:00
f259d65aa4 Add unit tests for runner recovery. 2023-10-31 15:49:56 +01:00
160a097d07 Fix flaky test TestDestroysRunnerAfterTimeoutAndSignal. 2023-10-31 15:49:56 +01:00
d0dd5c08cb Remove usage of context.DeadlineExceeded
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
2023-10-31 15:49:56 +01:00
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
2713e8672c Add error for empty list file system execution.
Normally, the result of executing the `lsCommand` should never be empty. However, we have observed that CodeOcean sometimes receives an empty JSON result if the runner is being deleted while the list file system request is processed. Therefore, we add a check if something has been written to CodeOcean and otherwise report an error.
2023-10-29 15:23:40 +01:00
3d252492fe Fix rescheduled used runners being removed.
As they are already rescheduled and therefore recreated they do not need to be removed, but can be handled as a new runner.
2023-09-18 01:06:35 +02:00
6dc83ca7b5 Add regression test for rescheduled used runners being removed.
As they are already rescheduled and therefore recreated they do not need to be removed, but can be handled as a new runner.
2023-09-18 01:06:35 +02:00
6159f2a045 Fix Goroutine Leak of Nomad execute command
that was triggered when [the execution timeout got exceeded, the runner got destroyed, or the WebSocket connection to CodeOcean closed] and the Allocation did not react to the SIGQUIT within the grace period.
2023-09-11 13:44:29 +02:00
59da36303c Fix Goroutine Leak of Environment Get
that was caused by creating an intermediate environment `fetchedEnvironment` when fetching the environments but not removing it in case that we just copy its configuration to the existing environment.
2023-09-11 13:44:29 +02:00
460b8b2065 Refactor TestReturnReturnsErrorWhenApiCallFailed
to handle the retry mechanism.
2023-09-11 13:44:29 +02:00
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
e3161637a9 Extract the WatchEventStream retry mechanism
into the utils including all other retry mechanisms.

With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
2023-09-11 13:44:29 +02:00
0d6b4f660c Refactor NewAbstractManager
to require a context used for the monitoring.
2023-09-11 13:44:29 +02:00
b708dddd23 Add Nomad Manager test case
that ensures that `onAllocationStopped` returns true when the runner was deleted before by the inactivity timer.
This feature is required for handling a race condition with the event handling of a rescheduled allocation.
2023-09-05 15:15:39 +02:00
354c16cc37 Fix missing rescheduled idle runners.
In today's unattended upgrade, we have seen how the prewarming pool size dropped to (near) zero. This was based on lost Nomad allocations. The allocations got rescheduled, but not added again to Poseidon.

The reason for this is a miscommunication between the Event Handling and the Nomad Manager. `removedByPoseidon` was true even if the runner was not removed by the manager, but an idle runner.
2023-09-05 15:15:39 +02:00
67297ec5a2 Add regression test for rescheduled idle runner. 2023-09-05 15:15:39 +02:00
c0a3fb12c3 Fix UpdateFileSystem Context
to be done when either the runner is destroyed (case ignored before) or the request is interrupted.
2023-08-21 22:49:09 +02:00
306512bf9c Fix Context Values are not logged.
Only the Sentry hook uses the values of the passed context. Therefore, we removed the values from our log statements when we shifted them from an extra `WithField` call to the context.
We fix this behavior by introducing a Logrus Hook that copies a fixed set of context values to the logging data.
2023-08-21 22:40:37 +02:00
a7d27e8f65 Add missing error log statements.
When "markRunnerAsUsed" fails, we silently ignored it. Only, when additionally the return of the runner failed, we threw the error.

When a Runner is destroyed, we are only notified that Nomad removed the allocation, but cannot tell about the reason.

For "the execution did not stop after SIGQUIT" we did not log the belonging runner id.
2023-08-21 22:40:37 +02:00
13cd19ed58 Refactor Nomad Event Stream log message. 2023-08-18 09:28:23 +02:00
73759f8a3c Retry Environment Recovery 2023-08-18 09:28:23 +02:00
eb818f92f7 Refactor Runner Destroy Reason Masking
and ignore expected reasons such when the runner got destroyed by an API request.
2023-07-24 11:48:14 +01:00
8ef5f4e7c5 Fix OOM Kill race condition
due to the Nomad request exiting before the allocation is stopped. We catch this behavior by introducing a time period for the allocation being stopped iff the exit code is 128.
2023-07-21 15:30:21 +02:00
6a1677dea0 Introduce reason for destroying runner
in order to return a specific error for OOM Killed Executions.
2023-07-21 15:30:21 +02:00
b3fedf274c Handle Runner Timeout
Before, Nomad executions often got stopped because the runner was deleted.
With the previous commit, we cover the exception to this behaviour by stopping the execution Poseidon-side.
These different approaches lead to different context error messages.
In this commit, we move the check of the passed timeout, to respond with the corresponding client message again.
2023-07-21 15:30:21 +02:00
bfb5977d24 Destroy runner on allocation stopped
Destroying the runner when Nomad informs us about its allocation being stopped, fixes the error of executions running into their timeout even if the allocation was stopped long ago.
2023-07-21 15:30:21 +02:00
e7df777db4 Always log Runner and Environment ID.
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
527aaf713f Fix decreased prewarming pool due to inactivity timer.
When allocations fail and restart they are added again to the idle runners. The bug fixed with this commit is that the inactivity timer was not stopped at the restart. This led to the idle runner being removed when the timer expired.
2023-06-16 17:27:45 +01:00
f031219cb8 Fix Nomad event race condition
that was triggered by simultaneous deletion of the runner due to inactivity, and the allocation being rescheduled due to a lost node.
It led to the allocation first being rescheduled, and then being stopped. This caused an unexpected stopping of a pending runner on a lower level.
To fix it we added communication from the upper level that the stop of the job was expected.
2023-06-13 14:20:20 +02:00
b620d0fad7 Introduce Allocation State Tracking
in order to break down the current state and evaluate if it is invalid.
2023-06-13 14:20:20 +02:00
8f89c14ea1 Cleanup logs for Allocation recovery
on startup. The changes do not have functional consequences as adding the allocation just overwrites the old one.
2023-05-10 18:56:51 +01:00
0c8fa9ccfa Add context to log statements. 2023-04-11 20:45:30 +01:00
038d71ff51 Nomad: Handle Container re-allocation 2023-03-31 14:42:55 +02:00
e0db1bafe8 Fix multiple user Runner use
A before unknown Nomad reload adds already known runner again to the idle runner - even if they are already in use.
2023-03-31 14:42:55 +02:00
7dadc5dfe9 Refactor Nomad Command Generation.
- Abstracting from the exec form while generating.
- Removal of single quotes (usage of only double-quotes).
- Bash-nesting using escaping of special characters.
2023-03-14 23:42:19 +01:00
4550a4589e Dangerous Context Enrichment
by passing the Sentry Context down our abstraction stack.
This included changes in the complex context management of managing a Command Execution.
2023-02-03 10:29:18 +00:00
a9581ac1d9 Performance for ListFileSystem 2023-02-03 10:29:18 +00:00
f2c205a8ed Add additional performance spans 2023-02-03 10:29:18 +00:00
a78ee22e67 Reduce time racetrack of delete and listFileSystem route. 2023-01-02 11:23:02 +01:00
0c6c48c3cf #190 Add unit tests for runner recovery. 2022-11-26 13:33:44 +00:00