89 Commits

Author SHA1 Message Date
Elmar Kresse
12ff205bd2 added k8s stub adapter for execution environment 2024-09-18 10:43:38 +02:00
Maximilian Paß
895dd8879f Revert "Debug HTTPLoggingMiddleware latency."
This reverts commit ae86b1c261.
2024-02-06 19:34:45 +00:00
Maximilian Paß
08c3a3d53d Decouple InfluxDB writings from request handling.
With #451, we found that writing an InfluxDB data point might block and lead to high latencies.
2024-01-28 10:57:01 +01:00
Maximilian Paß
ae86b1c261 Debug HTTPLoggingMiddleware latency. 2024-01-26 22:51:55 +01:00
Maximilian Paß
57590457a8 Add logging filter token
The token is used to filter out request logs when the user agent matches a randomly generated string.
2024-01-24 17:21:00 +01:00
Maximilian Paß
e3a8d202ac Adjust Influxdb buffering
as we have experienced silent package drops. This issue is not fixed, it is just made less probable.
2023-12-03 01:27:49 +01:00
Maximilian Paß
c9922e2539 Decrease Log severity
of failing requests because it's likely that another error with more information has already been reported.
2023-11-30 16:44:22 +01:00
Maximilian Paß
ab12c9046d Decrease Log Severity
of errors trying to read the request body.
2023-11-22 19:14:42 +01:00
Maximilian Paß
70c108aebf Unify the representation of the three dots. 2023-11-09 13:11:39 +01:00
Maximilian Paß
c46a09eeae Add Prewarming Pool Alert
that checks for every environment if the filled share of the prewarmin pool is at least the specified threshold.
2023-11-09 13:11:39 +01:00
Maximilian Paß
d0dd5c08cb Remove usage of context.DeadlineExceeded
for internal decisions as this error is strongly used by other packages. By checking such wrapped errors the internal decision can be influenced accidentally.
In this case the retry mechanism checked if the error is context.DeadlineExceeded and assumed it would be created by the internal context. This assumption was wrong.
2023-10-31 15:49:56 +01:00
Maximilian Paß
6b69a2d732 Refactor Nomad Recovery
from an approach that loaded the runners only once at the startup
to a method that will be repeated i.e. if the Nomad Event Stream connection interrupts.
2023-10-31 15:49:56 +01:00
Maximilian Paß
3abd4d9a3d Refactor all tests to use the MemoryLeakTestSuite. 2023-09-11 13:44:29 +02:00
Maximilian Paß
e3161637a9 Extract the WatchEventStream retry mechanism
into the utils including all other retry mechanisms.

With this change we fix that the WatchEventStream goroutine does not stop directly when the context is done (but previously only one second after).
2023-09-11 13:44:29 +02:00
Maximilian Paß
b28b87d56f Refactor periodicallySendMonitoringData
in order to return directly when the context is done and not just at the next iteration.
2023-09-11 13:44:29 +02:00
Maximilian Paß
188d012bc4 Fix Memory Leak caused by the merge_context.
The now removed statement of sending an empty struct into the channel blocked the goroutine until the channel of Done got listened for. This led to a goroutine leak as one does not necessarily has to call the Done function of a context.

We fix this issue by removing this value. It was unnecessary anyway as a closed channel always returns the null-value of the returned type.
2023-08-26 22:51:22 +02:00
Maximilian Paß
09604997a7 Implement MergeContext
that has multiple contexts as parent and chooses the earliest deadline.
2023-08-21 22:49:09 +02:00
Maximilian Paß
306512bf9c Fix Context Values are not logged.
Only the Sentry hook uses the values of the passed context. Therefore, we removed the values from our log statements when we shifted them from an extra `WithField` call to the context.
We fix this behavior by introducing a Logrus Hook that copies a fixed set of context values to the logging data.
2023-08-21 22:40:37 +02:00
Maximilian Paß
13a9da95e5 Introduce a context for RetryExponential
as second criteria (next to the maximum number of attempts) for canceling the retrying. This is required as we started with the previous commit to retry the nomad environment recovery. This always fails for unit tests (as they are not connected to an Nomad cluster). Before, we ignored the one error but the retrying leads to unit test timeouts.
Additionally, we now stop retrying to create a runner when the environment got deleted.
2023-08-18 09:28:23 +02:00
Maximilian Paß
73759f8a3c Retry Environment Recovery 2023-08-18 09:28:23 +02:00
Maximilian Paß
0fd6e42487 Add regression e2e test for incomplete debug message.
See #325.
2023-08-14 11:37:51 +02:00
Maximilian Paß
731b60acd6 Remove Sentry Exceptions
as workaround for having a usable title for the issue groups (not the error type).
2023-07-25 21:07:02 +01:00
Maximilian Paß
75f2f9b290 Add Sentry Stack Traces
and exceptions for logs containing errors.
2023-07-25 21:07:02 +01:00
Maximilian Paß
ee26cf13e5 Sentry: Make runner and environment searchable
by converting it into a Sentry Tag.

Also, replace the unstructured Extra attribute by using a Sentry Context.
2023-07-15 21:46:56 +02:00
Maximilian Paß
e7df777db4 Always log Runner and Environment ID.
Systematically log the runner id and the environment id by adding the information at the findRunnerMiddleware.
2023-07-15 21:46:56 +02:00
Maximilian Paß
0bfef5e105 Degrade InfluxDB Retry Write log. 2023-07-14 18:54:57 +02:00
Maximilian Paß
5b64725faa Fix golangci-lint errors
that appeared due to the new version v1.53.1.
2023-06-04 11:54:42 +01:00
Maximilian Paß
f377b1376c Add Client Status to Nomad Allocation monitoring
Also add the Nomad Node name as additional debug information.
2023-05-10 19:09:31 +01:00
Maximilian Paß
42efebc194 Monitor the Nomad events
and send all Nomad events to Influxdb.
2023-05-09 00:13:58 +01:00
Maximilian Paß
d8d9abbddd Add Job ID to Nomad Allocation monitoring. 2023-04-23 12:54:57 +01:00
Maximilian Paß
0c8fa9ccfa Add context to log statements. 2023-04-11 20:45:30 +01:00
Maximilian Paß
43221c717e Add context to Sentry Hook.
With this context, tracing information stored in the context can be associated with sentry events/issues.
2023-04-11 20:45:30 +01:00
Maximilian Paß
038d71ff51 Nomad: Handle Container re-allocation 2023-03-31 14:42:55 +02:00
Maximilian Paß
e0db1bafe8 Fix multiple user Runner use
A before unknown Nomad reload adds already known runner again to the idle runner - even if they are already in use.
2023-03-31 14:42:55 +02:00
Maximilian Paß
e877cd1e52 Rename Sentry Span Descriptions. 2023-03-14 23:42:19 +01:00
Maximilian Paß
7dadc5dfe9 Refactor Nomad Command Generation.
- Abstracting from the exec form while generating.
- Removal of single quotes (usage of only double-quotes).
- Bash-nesting using escaping of special characters.
2023-03-14 23:42:19 +01:00
Maximilian Paß
a4599f2cf9 Fix panic on influx shutdown.
Influx was shutdown before Poseidon was terminated. In that mean time the Profiling data has been written. Also in that mean time, a periodical influx event triggers a panic since influx is already shutdown.

We implemented two changes, each fixing this scenario.
2023-03-13 15:21:24 +01:00
Sebastian Serth
aa9d4d30e2 Actual retry sending InfluxDB data
Previously, we always logged the error on first failure and (nevertheless) tried to send the data within 3 minutes (default configuration).

Fixes POSEIDON-1H
Closes #262
2023-02-28 23:47:35 +01:00
Maximilian Paß
2650efbb38 Sentry Tracing Identifier 2023-02-03 10:29:18 +00:00
Maximilian Paß
a9581ac1d9 Performance for ListFileSystem 2023-02-03 10:29:18 +00:00
Maximilian Paß
8950ab3776 Add single quotes for inner command.
Change to bash as interpreter.
Forbid single quotes for user commands.
2022-11-04 15:15:43 +01:00
Maximilian Paß
5e5e13806e Monitor file download. 2022-10-26 01:33:26 +02:00
Maximilian Paß
160df3d9e6 Add retry-mechanism for sample, mark-as-used and return
of Nomad runners.
2022-10-24 22:12:09 +01:00
Maximilian Paß
b9c923da8a Remove unused and deprecated Storer interface. 2022-10-24 22:12:09 +01:00
Maximilian Paß
7119f3e012 Fix not canceling monitoring events for removed environments
and runners.
2022-10-24 13:15:14 +02:00
Maximilian Paß
3509109b6f Fix Ls2JsonWriter
by allowing more spaces in the ls response.
by sending the error response of the list file system route only when no content has been written.
2022-10-05 12:11:47 +01:00
Maximilian Paß
195f88177e Add Content-Length and Content-Disposition Header
for GetFileContent route.
2022-10-05 12:11:47 +01:00
Maximilian Paß
847e5cda65 Extend ls2json reader
by also parsing the link target, permissions, group and owner.
2022-10-05 12:11:47 +01:00
Maximilian Paß
fc77f11d4d Enquote file path for shell execution.
Also, fix json of 500 response.
2022-10-05 12:11:47 +01:00
Maximilian Paß
152b77afe5 Add listing of runners file system. 2022-10-05 12:11:47 +01:00