26 Oct 2021 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org is it possible to remove the taints first? on the node and in the yaml | 15:41:24 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org on the node and in the yaml then at least we know that the service can run on GPU | 15:41:34 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org then at least we know that the service can run on GPU Yes it works on GPU without the taint | 15:44:55 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Yes it works on GPU without the taint Great success! | 15:45:29 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Great success! At least you got the hard part (GPU!) down | 15:45:36 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org At least you got the hard part (GPU!) down Yes! | 15:46:19 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Yes! I guess you could try on a simple pod to see if the node tolerations/taints work | 15:46:38 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org I guess you could try on a simple pod to see if the node tolerations/taints work just to eliminate it's not a kfserving thing (i don't think so) | 15:46:59 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org just to eliminate it's not a kfserving thing (i don't think so) Right, I'll try on a notebook good idea | 15:47:18 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Right, I'll try on a notebook good idea Good luck! 😄 | 15:48:22 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Good luck! 😄 Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes.
I realized that kubectl describe no would not show any GPU resources after adding the taint back. Removing the taint and now it shows back.
The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node!
The solution was to use a conventional ExtendedResourceToleration to my node, so adding the taint "nvidia.com/gpu" with effect equal to "NoSchedule". The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html
And here is the commit that added the toleration to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7
It seems that GKE automatically adds the taint nvidia.com/gpu to a GPU node when added to a non-GPU node pool.
Sources: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool
This does not seem to be the case for AWS (I was using AWS in my tests).
Dan Sun I suggest we document these findings as it can save people a lot of time. I can create a pull request for it, what do you think? | 18:59:57 |
Alexandre Brown | In reply toundefined
(edited) ... <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> to my node, ... => ... <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for my GPU node, ... | 19:00:42 |
Alexandre Brown | In reply toundefined
(edited) ... taint "<http://nvidia.com/gpu|nvidia.com/gpu>" with effect equal to "NoSchedule". The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html>
And here is the commit that added the toleration to ... => ... taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html>
And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to ... | 19:02:03 |
Ian Miller | Anyone created a KF Component for KServe yet? I know it isn't certified yet or anything. Just looking for a starting spot before building my own 🙂 | 19:05:42 |
Alexandre Brown | In reply to@_slack_kubeflow_U01FC4Y6QBB:matrix.org Anyone created a KF Component for KServe yet? I know it isn't certified yet or anything. Just looking for a starting spot before building my own 🙂 Hello Ian, that's a good question, from what I know there is already a component for KFServing, if you really meant KServe as in the 0.7 version then not to my knowledge : https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml
See this sample : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb
You can also try to write a simple component that uses the KFServingClient maybe (I have to figure out which way I prefer as well). See step 3 here : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb
Hopefully this helps, | 19:28:26 |
Alexandre Brown | In reply toundefined
(edited) ... this helps, => ... this helps,
Cheers | 19:28:34 |
Alexandre Brown | In reply toundefined
(edited) ... "1"```
Dan Sun I did as you suggested and enabled the knative flag in the config map for the toleration and node affinity
The node with the affinity is a p3.8xlarge instance (aws) which has 4 GPUs available, I'm only requesting 1 tho.
Any ... => ... "1"```
Any ... | 19:29:48 |
Alexandre Brown | In reply toundefined
(edited) ... a GPU.
I'm ... => ... a GPU *with affinity & toleration*.
I'm ... | 19:30:04 |
Ian Miller | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Hello Ian, that's a good question, from what I know there is already a component for KFServing, if you really meant KServe as in the 0.7 version then not to my knowledge : https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml
See this sample : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb
You can also try to write a simple component that uses the KFServingClient maybe (I have to figure out which way I prefer as well). See step 3 here : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb
Hopefully this helps,
Cheers Thanks for the response Alexandre! Yeah we use a version of your first link today with our current KFServing deployment. Looking to do something similar while testing out KServe 0.7. I'll likely just make my own in the style of the existing KFServing one but updated for the new CRDs and such. Thanks! | 19:32:52 |
| _slack_kubeflow_U02K07X2ALB joined the room. | 19:59:36 |
Alexandre Brown | In reply toundefined
(edited) ... time. I can create a pull request for it, what do ... => ... time. What do ... | 21:54:29 |
Alexandre Brown | In reply toundefined
(edited) ... time. What do ... => ... time. I am open to contribute if possible.
What do ... | 22:15:56 |
Alexandre Brown | In reply toundefined
(edited) ... so adding the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html>
And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7>
It seems that GKE automatically adds the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` to a GPU node when added to a non-GPU node pool.
Sources: <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool>
This does not seem to be the case for AWS (I was using AWS in my tests).
Dan Sun I suggest we document these findings as it can save people a lot of time. I am open to contribute if possible.
What do you think? => ... so by letting the device-plugin add the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html>
And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> | 23:20:04 |
Dan Sun | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes.
I realized that kubectl describe no would not show any GPU resources after adding the taint back. Removing the taint and now it shows back.
The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node!
The solution was to use a conventional ExtendedResourceToleration for my GPU node, so adding the taint nvidia.com/gpu with effect equal to NoSchedule . The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html
And here is the commit that added the toleration for the nvidia.com/gpu taint to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7
It seems that GKE automatically adds the taint nvidia.com/gpu to a GPU node when added to a non-GPU node pool.
Sources: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool
This does not seem to be the case for AWS (I was using AWS in my tests).
Dan Sun I suggest we document these findings as it can save people a lot of time. I am open to contribute if possible.
What do you think? Alexandre Brown Benjamin Tan Awesome work for the investigations, we can add an example for deploying models on GPU ? | 23:20:46 |
Dan Sun | In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org Alexandre Brown Benjamin Tan Awesome work for the investigations, we can add an example for deploying models on GPU ? Add a section here for the GPU example | 23:21:54 |
Alexandre Brown | In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org Add a section here for the GPU example I will make sure to include all the required information tomorrow Dan Sun | 23:22:45 |
Dan Sun | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org I will make sure to include all the required information tomorrow Dan Sun Thanks!! that would be really useful for a lot of the users | 23:24:03 |
27 Oct 2021 |
Benjamin Tan | In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org Thanks!! that would be really useful for a lot of the users Thanks for that detailed solution 💙♥️💛💛💚❤️. Totally not obvious at all | 00:32:13 |
iamlovingit | In reply to@_slack_kubeflow_U02JVFFP213:matrix.org iamlovingit Apologies, but I don't quite understand your question. Sorry for unclear question, I mean that the domain example.com can be correctly parsed to your ingress IP. | 01:08:34 |
iamlovingit | In reply toundefined
(edited) ... ingress IP. => ... ingress IP? | 01:08:54 |