Sender | Message | Time |
---|---|---|
14 Apr 2021 | ||
22:36:20 | ||
22:36:35 | ||
* Question: I've got a statefulset running three pods with an ingress controller (AWS ALB) in front of the set. Each pod tends to have long running connections (mostly large uploads and downloads which take seconds to minutes). When I delete a pod and the statefulset brings it back up, it seems that there is no connection draining happening when there is an explicit delete. Specifically, it seems that sends the pod shutdown signal prior to draining all connections first, so the ingress controller responds with 502 codes when traffic is routed to the deleting pod because the service is shut down. So I suppose the question is: when deleting a pod, is traffic drained by the delete action (it seems like this is not the case), or by the liveness probe failing at the threshold count? | 22:45:04 | |
22:51:10 | ||
15 Apr 2021 | ||
The problem is that these things are unaware of each other. | 06:07:42 | |
In reply to @a:oper.io
Both. | 07:26:53 | |
07:46:18 | ||
08:06:59 | ||
Any of you folks using k3s on a single node setup? What's your experience on it? | 09:16:19 | |
it's great, I use it mainly as a replacement of systemd. Same api for everything. | 09:24:01 | |
yep. Will test it around a bit. | 09:28:52 | |
Vadim Rutkovsky: Ah, yes, thanks for sending that link over! It looks like I can more intelligently handle the situation with a carefully crafted preStop hook script that monitors active socket counts, waiting for the number to reduce to 0. So in theory:
| 13:56:58 | |
* Vadim Rutkovsky: Ah, yes, thanks for sending that link over! It looks like I can more intelligently handle the situation with a carefully crafted preStop hook script that monitors active socket counts, waiting for the number to reduce to 0. So in theory:
| 13:57:37 | |
Sounds like a hacky way to do it. Don't you have control over the app and let it drain connections on SIGTERM ? | 13:59:43 | |
Sadly no. It's a vendor app (and written in Java no less 😰). Sigterm gracelessly terminates the service, which kills active connections. | 14:09:48 | |
A-a-ron: Any chance you could use an external session handler? | 14:10:19 | |
I checked the official vendor helm chart and comically, it also doesn't gracefully handle shutdown. lol | 14:10:20 | |
That's what I did with some apps (redis/nutcracker). This way the session pool remained. only the currently active data exchanging session can't be recovered. | 14:11:05 | |
So sessions can float between pods. The trick here is that there are long running file transfers (Artifactory is the software). Folks are downloading files, so if the service shuts down, it can't transition to a new pod. The active connection just terminates when SIGTERM is received. | 14:13:23 | |
That will not transition well. Don't know of a way to do it actually. | 14:14:06 | |
Maybe do a quick calc of the times when this could happen and see if it still falls within your SLO. If it does maybe don't even bother... | 14:17:52 | |
Once things stabilize, pod deletion probably won't happen too often. The stateful set right now is set up with the OnDelete update strategy because this is so problematic. There is unfortunately no downtime windows for these systems as far too many things depend on them to have a convenient window to shut them down (>100k requests per hour). We just have to choose a window that minimizes impact. Even if it's a bit hacky, having a way to block pod service termination until connections reach 0 or a timeout threshold is reached would be fantastic. | 14:31:09 | |
K8s is like windows95. You look at the setting everything ok, yet it doesn't work. Delete the pod it restarts or readd the egress ip. Everything works again... | 17:29:06 | |
True | 17:29:55 | |
19:52:09 | ||
19:56:53 | ||
16 Apr 2021 | ||
17:00:46 | ||
18:31:48 | ||
17 Apr 2021 | ||
00:13:03 | ||
05:20:22 |