kubeflow-kfserving - Public Room Timeline

	kubeflow-kfserving	433 Members
		2 Servers

Load older messages

Sender	Message	Time
26 Oct 2021
Benjamin Tan	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org is it possible to remove the taints first? on the node and in the yaml	15:41:24
Benjamin Tan	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org on the node and in the yaml then at least we know that the service can run on GPU	15:41:34
Alexandre Brown	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org then at least we know that the service can run on GPU Yes it works on GPU without the taint	15:44:55
Benjamin Tan	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org Yes it works on GPU without the taint Great success!	15:45:29
Benjamin Tan	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org Great success! At least you got the hard part (GPU!) down	15:45:36
Alexandre Brown	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org At least you got the hard part (GPU!) down Yes!	15:46:19
Benjamin Tan	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org Yes! I guess you could try on a simple pod to see if the node tolerations/taints work	15:46:38
Benjamin Tan	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org I guess you could try on a simple pod to see if the node tolerations/taints work just to eliminate it's not a kfserving thing (i don't think so)	15:46:59
Alexandre Brown	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org just to eliminate it's not a kfserving thing (i don't think so) Right, I'll try on a notebook good idea	15:47:18
Benjamin Tan	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org Right, I'll try on a notebook good idea Good luck! 😄	15:48:22
Alexandre Brown	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org Good luck! 😄 Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes. I realized that `kubectl describe no` would not show any GPU resources after adding the taint back. Removing the taint and now it shows back. The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node! The solution was to use a conventional ExtendedResourceToleration to my node, so adding the taint "nvidia.com/gpu" with effect equal to "NoSchedule". The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html And here is the commit that added the toleration to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7 It seems that GKE automatically adds the taint `nvidia.com/gpu` to a GPU node when added to a non-GPU node pool. Sources: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool This does not seem to be the case for AWS (I was using AWS in my tests). Dan Sun I suggest we document these findings as it can save people a lot of time. I can create a pull request for it, what do you think?	18:59:57
Alexandre Brown	In reply to undefined (edited) ... <https://github.com/kubernetes/kubernetes/pull/55839\|ExtendedResourceToleration> to my node, ... => ... <https://github.com/kubernetes/kubernetes/pull/55839\|ExtendedResourceToleration> for my GPU node, ...	19:00:42
Alexandre Brown	In reply to undefined (edited) ... taint "<http://nvidia.com/gpu\|nvidia.com/gpu>" with effect equal to "NoSchedule". The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration to ... => ... taint `<http://nvidia.com/gpu\|nvidia.com/gpu>` with effect equal to `NoSchedule`. The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu\|nvidia.com/gpu>` taint to ...	19:02:03
Ian Miller	Anyone created a KF Component for KServe yet? I know it isn't certified yet or anything. Just looking for a starting spot before building my own 🙂	19:05:42
Alexandre Brown	In reply to @_slack_kubeflow_U01FC4Y6QBB:matrix.org Anyone created a KF Component for KServe yet? I know it isn't certified yet or anything. Just looking for a starting spot before building my own 🙂 Hello Ian, that's a good question, from what I know there is already a component for KFServing, if you really meant KServe as in the 0.7 version then not to my knowledge : https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml See this sample : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb You can also try to write a simple component that uses the KFServingClient maybe (I have to figure out which way I prefer as well). See step 3 here : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb Hopefully this helps,	19:28:26
Alexandre Brown	In reply to undefined (edited) ... this helps, => ... this helps, Cheers	19:28:34
Alexandre Brown	In reply to undefined (edited) ... "1"``` Dan Sun I did as you suggested and enabled the knative flag in the config map for the toleration and node affinity The node with the affinity is a p3.8xlarge instance (aws) which has 4 GPUs available, I'm only requesting 1 tho. Any ... => ... "1"``` Any ...	19:29:48
Alexandre Brown	In reply to undefined (edited) ... a GPU. I'm ... => ... a GPU with affinity & toleration. I'm ...	19:30:04
Ian Miller	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org Hello Ian, that's a good question, from what I know there is already a component for KFServing, if you really meant KServe as in the 0.7 version then not to my knowledge : https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml See this sample : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb You can also try to write a simple component that uses the KFServingClient maybe (I have to figure out which way I prefer as well). See step 3 here : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb Hopefully this helps, Cheers Thanks for the response Alexandre! Yeah we use a version of your first link today with our current KFServing deployment. Looking to do something similar while testing out KServe 0.7. I'll likely just make my own in the style of the existing KFServing one but updated for the new CRDs and such. Thanks!	19:32:52
	_slack_kubeflow_U02K07X2ALB joined the room.	19:59:36
Alexandre Brown	In reply to undefined (edited) ... time. I can create a pull request for it, what do ... => ... time. What do ...	21:54:29
Alexandre Brown	In reply to undefined (edited) ... time. What do ... => ... time. I am open to contribute if possible. What do ...	22:15:56
Alexandre Brown	In reply to undefined (edited) ... so adding the taint `<http://nvidia.com/gpu\|nvidia.com/gpu>` with effect equal to `NoSchedule`. The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu\|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> It seems that GKE automatically adds the taint `<http://nvidia.com/gpu\|nvidia.com/gpu>` to a GPU node when added to a non-GPU node pool. Sources: <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool> This does not seem to be the case for AWS (I was using AWS in my tests). Dan Sun I suggest we document these findings as it can save people a lot of time. I am open to contribute if possible. What do you think? => ... so by letting the device-plugin add the taint `<http://nvidia.com/gpu\|nvidia.com/gpu>` with effect equal to `NoSchedule`. The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu\|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7>	23:20:04
Dan Sun	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes. I realized that `kubectl describe no` would not show any GPU resources after adding the taint back. Removing the taint and now it shows back. The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node! The solution was to use a conventional ExtendedResourceToleration for my GPU node, so adding the taint `nvidia.com/gpu` with effect equal to `NoSchedule`. The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html And here is the commit that added the toleration for the `nvidia.com/gpu` taint to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7 It seems that GKE automatically adds the taint `nvidia.com/gpu` to a GPU node when added to a non-GPU node pool. Sources: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool This does not seem to be the case for AWS (I was using AWS in my tests). Dan Sun I suggest we document these findings as it can save people a lot of time. I am open to contribute if possible. What do you think? Alexandre Brown Benjamin Tan Awesome work for the investigations, we can add an example for deploying models on GPU ?	23:20:46
Dan Sun	In reply to @_slack_kubeflow_UFVUV2UFP:matrix.org Alexandre Brown Benjamin Tan Awesome work for the investigations, we can add an example for deploying models on GPU ? Add a section here for the GPU example	23:21:54
Alexandre Brown	In reply to @_slack_kubeflow_UFVUV2UFP:matrix.org Add a section here for the GPU example I will make sure to include all the required information tomorrow Dan Sun	23:22:45
Dan Sun	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org I will make sure to include all the required information tomorrow Dan Sun Thanks!! that would be really useful for a lot of the users	23:24:03
27 Oct 2021
Benjamin Tan	In reply to @_slack_kubeflow_UFVUV2UFP:matrix.org Thanks!! that would be really useful for a lot of the users Thanks for that detailed solution 💙♥️💛💛💚❤️. Totally not obvious at all	00:32:13
iamlovingit	In reply to @_slack_kubeflow_U02JVFFP213:matrix.org iamlovingit Apologies, but I don't quite understand your question. Sorry for unclear question, I mean that the domain example.com can be correctly parsed to your ingress IP.	01:08:34
iamlovingit	In reply to undefined (edited) ... ingress IP. => ... ingress IP?	01:08:54

Show newer messages

Back to Room ListRoom Version: 6