27 Oct 2021 |
Mark Winter | In reply to@_slack_kubeflow_U9673D1KJ:matrix.org Ian Miller please update and contribute back - that will be great. Tommy Li owns the original one There seems to be one in progress here but it's not been updated in a couple weeks https://github.com/kubeflow/pipelines/pull/6716 | 12:02:34 |
Jevgeni Martjušev | is there any piece of functionality, which is present in KServe, but not present in Seldon Core? Or more generally - anything which is done better/simpler in KServe vs Seldon Core? | 14:58:43 |
| Marcin Zabłocki joined the room. | 15:11:51 |
Marcin Zabłocki | In reply to@_slack_kubeflow_U01HS89M1U6:matrix.org is there any piece of functionality, which is present in KServe, but not present in Seldon Core? Or more generally - anything which is done better/simpler in KServe vs Seldon Core? One of the examples would be scale to zero | 15:11:51 |
Dan Sun | Hello everyone, KServe community is happening in meet.google.com/mue-gtwz-fhg | 16:03:18 |
| Paul Van Eck joined the room. | 16:08:07 |
Paul Van Eck | In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org Hello everyone, KServe community is happening in meet.google.com/mue-gtwz-fhg Dan Sun Hey, I'm stuck in the "Asking to join..." lobby for the meeting. Are you able to let people in? | 16:08:07 |
Vedant Padwal | In reply to@_slack_kubeflow_U017QCZSQ48:matrix.org Dan Sun Hey, I'm stuck in the "Asking to join..." lobby for the meeting. Are you able to let people in? Dan Sun I'm also stuck you should
make it so that everybody can join | 16:10:48 |
| Sriharan Manogaran joined the room. | 16:12:46 |
Sriharan Manogaran | In reply to@_slack_kubeflow_U027LHY3610:matrix.org Dan Sun I'm also stuck you should
make it so that everybody can join Same here | 16:12:47 |
| Tommy Li changed their profile picture. | 16:58:10 |
Tommy Li | In reply to@_slack_kubeflow_U01T25HRREK:matrix.org There seems to be one in progress here but it's not been updated in a couple weeks https://github.com/kubeflow/pipelines/pull/6716 We have an issue to track the progress for the new kserve component. Right now no one has taken this issue full time yet. If you want to work on it, feel free to assign it to yourself and I’m happy to help along the way.
https://github.com/kserve/kserve/issues/1829 | 16:58:10 |
Alexandre Brown | In reply toundefined
(edited) ... conventional <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for my GPU node, so by letting the device-plugin add the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The ... => ... conventional taint, the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint. Your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for GPU nodes. The ... | 19:05:57 |
| John Daciuk joined the room. | 19:09:08 |
| John Daciuk changed their display name from _slack_kubeflow_U02C41UBRT3 to John Daciuk. | 19:10:51 |
| John Daciuk set a profile picture. | 19:10:53 |
John Daciuk | I’m wondering if anyone has insight into the trade offs between using a custom vs triton kserve model. | 19:10:53 |
Alexandre Brown | In reply toundefined
(edited) ... taint. Your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for GPU nodes. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html>
And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> => ... taint. On AWS, your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for GPU nodes (I was using 1.18 but it is only enabled under 1.19+ and this makes everything much easier). The *nvidia-device-plugin* tolerates the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html>
And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7>
I will contribute to the kubeflow doc and include these information | 21:15:37 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Making sure that the port is the right one too Benjamin Tan Here is what I did :
1. port-forward kubeflow to http://localhost:8080
2. kubectl apply -f monitoring-core.yaml
3. kubectl apply -f monitoring-metrics-prometheus.yaml
4. Create metrics-virtual-service.yaml with the content you provided from the doc and changed the port to 3000
5. kubectl apply -f metrics-virtual-service.yaml
6. Create an inference service (dummy from the doc)
No metrics tab is showing up 😕
(base) user@user-desktop:~/Documents/Kubeflow-install$ kubectl get pods -n knative-monitoring
NAME READY STATUS RESTARTS AGE
grafana-69d8d8dc47-h8p6r 1/1 Running 0 14m
kube-state-metrics-7d4df85595-ckmfm 1/1 Running 0 14m
node-exporter-dvhn4 2/2 Running 0 14m
node-exporter-ktr6d 2/2 Running 0 14m
node-exporter-tz6x4 2/2 Running 0 14m
node-exporter-wcd9d 2/2 Running 0 14m
prometheus-system-0 1/1 Running 0 14m
prometheus-system-1 1/1 Running 0 14m
I'm I missing something? | 21:41:43 |
Alexandre Brown | In reply toundefined
(edited) ... 14m```
I'm ... => ... 14m```
```Containers:
grafana:
Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5
Image: grafana/grafana:6.3.3
Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88>
Port: 3000/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 27 Oct 2021 17:26:26 -0400
Ready: True```
I'm ... | 22:07:46 |
Alexandre Brown | In reply toundefined
(edited) ... 14m```
```Containers:
grafana:
Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5
Image: grafana/grafana:6.3.3
Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88>
Port: 3000/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 27 Oct 2021 17:26:26 -0400
Ready: True```
I'm ... => ... 14m```
kubectl describe pod grafana-69d8d8dc47-h8p6r -n knative-monitoring
...
```Containers:
grafana:
Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5
Image: grafana/grafana:6.3.3
Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88>
Port: 3000/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 27 Oct 2021 17:26:26 -0400
Ready: True```
...
I'm ... | 22:08:23 |
Alexandre Brown | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes.
I realized that kubectl describe no would not show any GPU resources after adding the taint back. Removing the taint and now it shows back.
The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node!
The solution was to use a conventional taint, the nvidia.com/gpu taint. On AWS, your cluster must use EKS 1.19+ to take advantage of the ExtendedResourceToleration for GPU nodes (I was using 1.18 but it is only enabled under 1.19+ and this makes everything much easier). The nvidia-device-plugin tolerates the nvidia.com/gpu taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node).
The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html
And here is the commit that added the toleration for the nvidia.com/gpu taint to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7
I will contribute to the kubeflow doc and include these information Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to KServe and the notebook component as well since you can setup notebook with node affinity & tolerations.
Maybe we should add a section like "Using GPU with Kubeflow" ?
Let me know what's best | 23:57:01 |
Alexandre Brown | In reply toundefined
(edited) ... to KServe and the notebook component as well since you can setup notebook with node affinity & tolerations.
Maybe we should add a section like "Using GPU with Kubeflow" ? ... => ... to any model server under KServe (not just tensorflow) and also to the Notebook Server component as well since you can setup notebook with node affinity & tolerations and face the same issue.
Maybe we should add a section like "Using GPU with Kubeflow" to the kubeflow website ? ... | 23:58:15 |
28 Oct 2021 |
Dan Sun | In reply toundefined
(edited) ... community is ... => ... community meeting is ... | 00:07:50 |
Alexandre Brown | In reply toundefined
(edited) Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to any model server under KServe (not just tensorflow) and ... => docs/modelserving/v1beta1/tensorflow/) and ... | 00:11:26 |
Dan Sun | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to any model server under KServe (not just tensorflow, I'm saying this because the url is docs/modelserving/v1beta1/tensorflow/) and also to the Notebook Server component as well since you can setup notebook with node affinity & tolerations and face the same issue.
Maybe we should add a section like "Using GPU with Kubeflow" to the kubeflow website ?
Let me know what's best Let's stick to kserve website, we have not certified KServe with Kubeflow yet which would hopefully land in Kubeflow 1.5. | 00:13:28 |
Dan Sun | In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org Let's stick to kserve website, we have not certified KServe with Kubeflow yet which would hopefully land in Kubeflow 1.5. We can create a separate tutorial example for deploying model on GPU for KServe | 00:14:08 |
Alexandre Brown | In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org We can create a separate tutorial example for deploying model on GPU for KServe Ok great let's do as you say.
Do you think we could create a page that's all about GPU with KServe that would include the existing http://127.0.0.1:8000/docs/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource
and the new information I would add ? | 00:16:33 |
Alexandre Brown | In reply toundefined
(edited) ... existing <http://127.0.0.1:8000/docs/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource>
and ... => ... existing <https://kserve.github.io/website/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource>
and ... | 00:17:33 |
Alexandre Brown | In reply toundefined
(edited) ... we could create ... => ... we should create ... | 00:17:52 |