!LuUSGaeArTeoOgUpwk:matrix.org

kubeflow-kfserving

433 Members
2 Servers

Load older messages


SenderMessageTime
26 Oct 2021
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
is it possible to remove the taints first?
on the node and in the yaml
15:41:24
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
on the node and in the yaml
then at least we know that the service can run on GPU
15:41:34
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
then at least we know that the service can run on GPU
Yes it works on GPU without the taint
15:44:55
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Yes it works on GPU without the taint
Great success!
15:45:29
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Great success!
At least you got the hard part (GPU!) down
15:45:36
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
At least you got the hard part (GPU!) down
Yes!
15:46:19
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Yes!
I guess you could try on a simple pod to see if the node tolerations/taints work
15:46:38
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
I guess you could try on a simple pod to see if the node tolerations/taints work
just to eliminate it's not a kfserving thing (i don't think so)
15:46:59
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
just to eliminate it's not a kfserving thing (i don't think so)
Right, I'll try on a notebook good idea
15:47:18
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Right, I'll try on a notebook good idea
Good luck! 😄
15:48:22
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Good luck! 😄
Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes. I realized that kubectl describe no would not show any GPU resources after adding the taint back. Removing the taint and now it shows back. The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node! The solution was to use a conventional ExtendedResourceToleration to my node, so adding the taint "nvidia.com/gpu" with effect equal to "NoSchedule". The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html And here is the commit that added the toleration to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7 It seems that GKE automatically adds the taint nvidia.com/gpu to a GPU node when added to a non-GPU node pool. Sources: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool This does not seem to be the case for AWS (I was using AWS in my tests). Dan Sun I suggest we document these findings as it can save people a lot of time. I can create a pull request for it, what do you think?
18:59:57
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> to my node, ... => ... <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for my GPU node, ...
19:00:42
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... taint "<http://nvidia.com/gpu|nvidia.com/gpu>" with effect equal to "NoSchedule". The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration to ... => ... taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to ...
19:02:03
@_slack_kubeflow_U01FC4Y6QBB:matrix.orgIan Miller Anyone created a KF Component for KServe yet? I know it isn't certified yet or anything. Just looking for a starting spot before building my own 🙂 19:05:42
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_U01FC4Y6QBB:matrix.org
Anyone created a KF Component for KServe yet? I know it isn't certified yet or anything. Just looking for a starting spot before building my own 🙂
Hello Ian, that's a good question, from what I know there is already a component for KFServing, if you really meant KServe as in the 0.7 version then not to my knowledge : https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml See this sample : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb You can also try to write a simple component that uses the KFServingClient maybe (I have to figure out which way I prefer as well). See step 3 here : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb Hopefully this helps,
19:28:26
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... this helps, => ... this helps, Cheers
19:28:34
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... "1"``` Dan Sun I did as you suggested and enabled the knative flag in the config map for the toleration and node affinity The node with the affinity is a p3.8xlarge instance (aws) which has 4 GPUs available, I'm only requesting 1 tho. Any ... => ... "1"``` Any ...
19:29:48
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... a GPU. I'm ... => ... a GPU *with affinity & toleration*. I'm ...
19:30:04
@_slack_kubeflow_U01FC4Y6QBB:matrix.orgIan Miller
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Hello Ian, that's a good question, from what I know there is already a component for KFServing, if you really meant KServe as in the 0.7 version then not to my knowledge : https://raw.githubusercontent.com/kubeflow/pipelines/master/components/kubeflow/kfserving/component.yaml See this sample : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb You can also try to write a simple component that uses the KFServingClient maybe (I have to figure out which way I prefer as well). See step 3 here : https://github.com/kubeflow/kfp-tekton/blob/master/samples/e2e-mnist/mnist.ipynb Hopefully this helps, Cheers
Thanks for the response Alexandre! Yeah we use a version of your first link today with our current KFServing deployment. Looking to do something similar while testing out KServe 0.7. I'll likely just make my own in the style of the existing KFServing one but updated for the new CRDs and such. Thanks!
19:32:52
@_slack_kubeflow_U02K07X2ALB:matrix.org_slack_kubeflow_U02K07X2ALB joined the room.19:59:36
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... time. I can create a pull request for it, what do ... => ... time. What do ...
21:54:29
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... time. What do ... => ... time. I am open to contribute if possible. What do ...
22:15:56
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... so adding the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> It seems that GKE automatically adds the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` to a GPU node when added to a non-GPU node pool. Sources: <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool> This does not seem to be the case for AWS (I was using AWS in my tests). Dan Sun I suggest we document these findings as it can save people a lot of time. I am open to contribute if possible. What do you think? => ... so by letting the device-plugin add the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7>
23:20:04
@_slack_kubeflow_UFVUV2UFP:matrix.orgDan Sun
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes. I realized that kubectl describe no would not show any GPU resources after adding the taint back. Removing the taint and now it shows back. The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node! The solution was to use a conventional ExtendedResourceToleration for my GPU node, so adding the taint nvidia.com/gpu with effect equal to NoSchedule. The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html And here is the commit that added the toleration for the nvidia.com/gpu taint to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7 It seems that GKE automatically adds the taint nvidia.com/gpu to a GPU node when added to a non-GPU node pool. Sources: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_pool This does not seem to be the case for AWS (I was using AWS in my tests). Dan Sun I suggest we document these findings as it can save people a lot of time. I am open to contribute if possible. What do you think?
Alexandre Brown Benjamin Tan Awesome work for the investigations, we can add an example for deploying models on GPU ?
23:20:46
@_slack_kubeflow_UFVUV2UFP:matrix.orgDan Sun
In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org
Alexandre Brown Benjamin Tan Awesome work for the investigations, we can add an example for deploying models on GPU ?
Add a section here for the GPU example
23:21:54
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org
Add a section here for the GPU example
I will make sure to include all the required information tomorrow Dan Sun
23:22:45
@_slack_kubeflow_UFVUV2UFP:matrix.orgDan Sun
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
I will make sure to include all the required information tomorrow Dan Sun
Thanks!! that would be really useful for a lot of the users
23:24:03
27 Oct 2021
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org
Thanks!! that would be really useful for a lot of the users
Thanks for that detailed solution 💙♥️💛💛💚❤️. Totally not obvious at all
00:32:13
@_slack_kubeflow_U0104H1616Z:matrix.orgiamlovingit
In reply to@_slack_kubeflow_U02JVFFP213:matrix.org
iamlovingit Apologies, but I don't quite understand your question.
Sorry for unclear question, I mean that the domain example.com can be correctly parsed to your ingress IP.
01:08:34
@_slack_kubeflow_U0104H1616Z:matrix.orgiamlovingit
In reply toundefined
(edited) ... ingress IP. => ... ingress IP?
01:08:54

Show newer messages


Back to Room ListRoom Version: 6