!YjuQUNkcLAURsubBFF:matrix.org

Kubernetes

918 Members
Container orchestration from Google http://kubernetes.io/119 Servers

Load older messages


SenderMessageTime
14 Apr 2021
@a:oper.ioA-a-ron removed their profile picture.22:36:20
@a:oper.ioA-a-ron set a profile picture.22:36:35
@a:oper.ioA-a-ron * Question: I've got a statefulset running three pods with an ingress controller (AWS ALB) in front of the set. Each pod tends to have long running connections (mostly large uploads and downloads which take seconds to minutes). When I delete a pod and the statefulset brings it back up, it seems that there is no connection draining happening when there is an explicit delete. Specifically, it seems that sends the pod shutdown signal prior to draining all connections first, so the ingress controller responds with 502 codes when traffic is routed to the deleting pod because the service is shut down. So I suppose the question is: when deleting a pod, is traffic drained by the delete action (it seems like this is not the case), or by the liveness probe failing at the threshold count?22:45:04
@burpshirt:matrix.orgburpshirt left the room.22:51:10
15 Apr 2021
@folken:kabelsalat.chFolkenThe problem is that these things are unaware of each other. 06:07:42
@vadim:vrutkovs.euVadim Rutkovsky
In reply to @a:oper.io
Question: I've got a statefulset running three pods with an ingress controller (AWS ALB) in front of the set. Each pod tends to have long running connections (mostly large uploads and downloads which take seconds to minutes). When I delete a pod and the statefulset brings it back up, it seems that there is no connection draining happening when there is an explicit delete. Specifically, it seems that sends the pod shutdown signal prior to draining all connections first, so the ingress controller responds with 502 codes when traffic is routed to the deleting pod because the service is shut down.
So I suppose the question is: when deleting a pod, is traffic drained by the delete action (it seems like this is not the case), or by the liveness probe failing at the threshold count?

So I suppose the question is: when deleting a pod, is traffic drained by the delete action (it seems like this is not the case), or by the liveness probe failing at the threshold count?

Both.
When pod is deleted the container should handle SIGTERM and drain connections withing terminationGracePeriod (see sequence of events).
When pod doesn't pass liveness checks it won't be included in endpoint list and won't receive new connections

07:26:53
@rubencabrera:matrix.orgrubencabrera joined the room.07:46:18
@ingvin:matrix.orgIngvin joined the room.08:06:59
@neo:solsys.orgHaleyAny of you folks using k3s on a single node setup? What's your experience on it?09:16:19
@rio:rio.systemscxz38it's great, I use it mainly as a replacement of systemd. Same api for everything.09:24:01
@neo:solsys.orgHaleyyep. Will test it around a bit.09:28:52
@a:oper.ioA-a-ron

Vadim Rutkovsky: Ah, yes, thanks for sending that link over! It looks like I can more intelligently handle the situation with a carefully crafted preStop hook script that monitors active socket counts, waiting for the number to reduce to 0. So in theory:

  1. Execute pod delete
  2. Pod goes into a "Terminating" state, which removes the pod from all services, not allowing new connections, but allowing existing.
  3. preStop hook executes (my script - presumably waiting for socket counts to go down). This blocks until the script exits (within the grace period)
  4. SIGTERM sent, app shuts down (normally not graceful, just killing all remaining connections, which should now be 0)
  5. (rest of the process I don't care too much about)
13:56:58
@a:oper.ioA-a-ron *

Vadim Rutkovsky: Ah, yes, thanks for sending that link over! It looks like I can more intelligently handle the situation with a carefully crafted preStop hook script that monitors active socket counts, waiting for the number to reduce to 0. So in theory:

  1. Execute pod delete
  2. Pod goes into a "Terminating" state, which removes the pod from all services, not allowing new connections, but allowing existing.
  3. preStop hook executes (my script - presumably waiting for socket counts to go down). This blocks until the script exits (within the grace period)
  4. SIGTERM sent, app shuts down (normally not graceful, just killing all remaining connections, which should now be 0)
  5. (rest of the process I don't care too much about)
13:57:37
@rio:rio.systemscxz38 Sounds like a hacky way to do it. Don't you have control over the app and let it drain connections on SIGTERM? 13:59:43
@a:oper.ioA-a-ronSadly no. It's a vendor app (and written in Java no less ­čś░). Sigterm gracelessly terminates the service, which kills active connections.14:09:48
@neo:solsys.orgHaleyA-a-ron: Any chance you could use an external session handler?14:10:19
@a:oper.ioA-a-ronI checked the official vendor helm chart and comically, it also doesn't gracefully handle shutdown. lol14:10:20
@neo:solsys.orgHaleyThat's what I did with some apps (redis/nutcracker). This way the session pool remained. only the currently active data exchanging session can't be recovered.14:11:05
@a:oper.ioA-a-ronSo sessions can float between pods. The trick here is that there are long running file transfers (Artifactory is the software). Folks are downloading files, so if the service shuts down, it can't transition to a new pod. The active connection just terminates when SIGTERM is received.14:13:23
@neo:solsys.orgHaleyThat will not transition well. Don't know of a way to do it actually.14:14:06
@rio:rio.systemscxz38Maybe do a quick calc of the times when this could happen and see if it still falls within your SLO. If it does maybe don't even bother...14:17:52
@a:oper.ioA-a-ron Once things stabilize, pod deletion probably won't happen too often. The stateful set right now is set up with the OnDelete update strategy because this is so problematic. There is unfortunately no downtime windows for these systems as far too many things depend on them to have a convenient window to shut them down (>100k requests per hour). We just have to choose a window that minimizes impact. Even if it's a bit hacky, having a way to block pod service termination until connections reach 0 or a timeout threshold is reached would be fantastic. 14:31:09
@folken:kabelsalat.chFolkenK8s is like windows95. You look at the setting everything ok, yet it doesn't work. Delete the pod it restarts or readd the egress ip. Everything works again... 17:29:06
@snedi:matrix.orgsnediTrue17:29:55
@j0ta:matrix.org@j0ta:matrix.org joined the room.19:52:09
@j0ta:matrix.org@j0ta:matrix.org left the room.19:56:53
16 Apr 2021
@emdevhci:matrix.orgemdevhci joined the room.17:00:46
@dramagods:matrix.orgcryptochained changed their display name from dramagods to cryptochained.18:31:48
17 Apr 2021
@yamax:matrix.orgyamax joined the room.00:13:03
@nausiyanmeow:matrix.orgNausiyan Nyan joined the room.05:20:22

There are no newer messages yet.


Back to Room List