89 Commits

Author SHA1 Message Date
12ff205bd2 added k8s stub adapter for execution environment 2024-09-18 10:43:38 +02:00
895dd8879f Revert "Debug HTTPLoggingMiddleware latency."
This reverts commit ae86b1c261.
2024-02-06 19:34:45 +00:00
08c3a3d53d Decouple InfluxDB writings from request handling.
With #451, we found that writing an InfluxDB data point might block and lead to high latencies.
2024-01-28 10:57:01 +01:00
ae86b1c261 Debug HTTPLoggingMiddleware latency. 2024-01-26 22:51:55 +01:00
57590457a8 Add logging filter token
The token is used to filter out request logs when the user agent matches a randomly generated string.
2024-01-24 17:21:00 +01:00
e3a8d202ac Adjust Influxdb buffering
as we have experienced silent package drops. This issue is not fixed, it is just made less probable.
2023-12-03 01:27:49 +01:00
c9922e2539 Decrease Log severity
of failing requests because it's likely that another error with more information has already been reported.
2023-11-30 16:44:22 +01:00
ab12c9046d Decrease Log Severity
of errors trying to read the request body.
2023-11-22 19:14:42 +01:00
70c108aebf Unify the representation of the three dots. 2023-11-09 13:11:39 +01:00
c46a09eeae Add Prewarming Pool Alert
that checks for every environment if the filled share of the prewarmin pool is at least the specified threshold.
2023-11-09 13:11:39 +01:00
d0dd5c08cb Remove usage of context.DeadlineExceeded
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
2023-10-31 15:49:56 +01:00
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
e3161637a9 Extract the WatchEventStream retry mechanism
into the utils including all other retry mechanisms.

With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
2023-09-11 13:44:29 +02:00
b28b87d56f Refactor periodicallySendMonitoringData
in order to return directly when the context is done and not just at the next iteration.
2023-09-11 13:44:29 +02:00
188d012bc4 Fix Memory Leak caused by the merge_context.
The now removed statement of sending an empty struct into the channel blocked the goroutine until the channel of Done got listened for. This led to a goroutine leak as one does not necessarily has to call the Done function of a context.

We fix this issue by removing this value. It was unnecessary anyway as a closed channel always returns the null-value of the returned type.
2023-08-26 22:51:22 +02:00
09604997a7 Implement MergeContext
that has multiple contexts as parent and chooses the earliest deadline.
2023-08-21 22:49:09 +02:00
306512bf9c Fix Context Values are not logged.
Only the Sentry hook uses the values of the passed context. Therefore, we removed the values from our log statements when we shifted them from an extra `WithField` call to the context.
We fix this behavior by introducing a Logrus Hook that copies a fixed set of context values to the logging data.
2023-08-21 22:40:37 +02:00
13a9da95e5 Introduce a context for RetryExponential
as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts.
Additionally, we now stop retrying to create a runner when the environment got deleted.
2023-08-18 09:28:23 +02:00
73759f8a3c Retry Environment Recovery 2023-08-18 09:28:23 +02:00
0fd6e42487 Add regression e2e test for incomplete debug message.
See #325.
2023-08-14 11:37:51 +02:00
731b60acd6 Remove Sentry Exceptions
as workaround for having a usable title for the issue groups (not the error type).
2023-07-25 21:07:02 +01:00
75f2f9b290 Add Sentry Stack Traces
and exceptions for logs containing errors.
2023-07-25 21:07:02 +01:00
ee26cf13e5 Sentry: Make runner and environment searchable
by converting it into a Sentry Tag.

Also, replace the unstructured Extra attribute by using a Sentry Context.
2023-07-15 21:46:56 +02:00
e7df777db4 Always log Runner and Environment ID.
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
0bfef5e105 Degrade InfluxDB Retry Write log. 2023-07-14 18:54:57 +02:00
5b64725faa Fix golangci-lint errors
that appeared due to the new version v1.53.1.
2023-06-04 11:54:42 +01:00
f377b1376c Add Client Status to Nomad Allocation monitoring
Also add the Nomad Node name as additional debug information.
2023-05-10 19:09:31 +01:00
42efebc194 Monitor the Nomad events
and send all Nomad events to Influxdb.
2023-05-09 00:13:58 +01:00
d8d9abbddd Add Job ID to Nomad Allocation monitoring. 2023-04-23 12:54:57 +01:00
0c8fa9ccfa Add context to log statements. 2023-04-11 20:45:30 +01:00
43221c717e Add context to Sentry Hook.
With this context, tracing information stored in the context can be associated with sentry events/issues.
2023-04-11 20:45:30 +01:00
038d71ff51 Nomad: Handle Container re-allocation 2023-03-31 14:42:55 +02:00
e0db1bafe8 Fix multiple user Runner use
A before unknown Nomad reload adds already known runner again to the idle runner - even if they are already in use.
2023-03-31 14:42:55 +02:00
e877cd1e52 Rename Sentry Span Descriptions. 2023-03-14 23:42:19 +01:00
7dadc5dfe9 Refactor Nomad Command Generation.
- Abstracting from the exec form while generating.
- Removal of single quotes (usage of only double-quotes).
- Bash-nesting using escaping of special characters.
2023-03-14 23:42:19 +01:00
a4599f2cf9 Fix panic on influx shutdown.
Influx was shutdown before Poseidon was terminated. In that mean time the Profiling data has been written. Also in that mean time, a periodical influx event triggers a panic since influx is already shutdown.

We implemented two changes, each fixing this scenario.
2023-03-13 15:21:24 +01:00
aa9d4d30e2 Actual retry sending InfluxDB data
Previously, we always logged the error on first failure and (nevertheless) tried to send the data within 3 minutes (default configuration).

Fixes POSEIDON-1H
Closes #262
2023-02-28 23:47:35 +01:00
2650efbb38 Sentry Tracing Identifier 2023-02-03 10:29:18 +00:00
a9581ac1d9 Performance for ListFileSystem 2023-02-03 10:29:18 +00:00
8950ab3776 Add single quotes for inner command.
Change to bash as interpreter.
Forbid single quotes for user commands.
2022-11-04 15:15:43 +01:00
5e5e13806e Monitor file download. 2022-10-26 01:33:26 +02:00
160df3d9e6 Add retry-mechanism for sample, mark-as-used and return
of Nomad runners.
2022-10-24 22:12:09 +01:00
b9c923da8a Remove unused and deprecated Storer interface. 2022-10-24 22:12:09 +01:00
7119f3e012 Fix not canceling monitoring events for removed environments
and runners.
2022-10-24 13:15:14 +02:00
3509109b6f Fix Ls2JsonWriter
by allowing more spaces in the ls response.
by sending the error response of the list file system route only when no content has been written.
2022-10-05 12:11:47 +01:00
195f88177e Add Content-Length and Content-Disposition Header
for GetFileContent route.
2022-10-05 12:11:47 +01:00
847e5cda65 Extend ls2json reader
by also parsing the link target, permissions, group and owner.
2022-10-05 12:11:47 +01:00
fc77f11d4d Enquote file path for shell execution.
Also, fix json of 500 response.
2022-10-05 12:11:47 +01:00
152b77afe5 Add listing of runners file system. 2022-10-05 12:11:47 +01:00