!LuUSGaeArTeoOgUpwk:matrix.org

kubeflow-kfserving

433 Members
2 Servers

Load older messages


SenderMessageTime
27 Oct 2021
@_slack_kubeflow_U01T25HRREK:matrix.orgMark Winter
In reply to@_slack_kubeflow_U9673D1KJ:matrix.org
Ian Miller please update and contribute back - that will be great. Tommy Li owns the original one
There seems to be one in progress here but it's not been updated in a couple weeks https://github.com/kubeflow/pipelines/pull/6716
12:02:34
@_slack_kubeflow_U01HS89M1U6:matrix.orgJevgeni Martjušev is there any piece of functionality, which is present in KServe, but not present in Seldon Core? Or more generally - anything which is done better/simpler in KServe vs Seldon Core? 14:58:43
@_slack_kubeflow_U02E1KYRD9U:matrix.orgMarcin Zabłocki joined the room.15:11:51
@_slack_kubeflow_U02E1KYRD9U:matrix.orgMarcin Zabłocki
In reply to@_slack_kubeflow_U01HS89M1U6:matrix.org
is there any piece of functionality, which is present in KServe, but not present in Seldon Core? Or more generally - anything which is done better/simpler in KServe vs Seldon Core?
One of the examples would be scale to zero
15:11:51
@_slack_kubeflow_UFVUV2UFP:matrix.orgDan Sun Hello everyone, KServe community is happening in meet.google.com/mue-gtwz-fhg 16:03:18
@_slack_kubeflow_U017QCZSQ48:matrix.orgPaul Van Eck joined the room.16:08:07
@_slack_kubeflow_U017QCZSQ48:matrix.orgPaul Van Eck
In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org
Hello everyone, KServe community is happening in meet.google.com/mue-gtwz-fhg
Dan Sun Hey, I'm stuck in the "Asking to join..." lobby for the meeting. Are you able to let people in?
16:08:07
@_slack_kubeflow_U027LHY3610:matrix.orgVedant Padwal
In reply to@_slack_kubeflow_U017QCZSQ48:matrix.org
Dan Sun Hey, I'm stuck in the "Asking to join..." lobby for the meeting. Are you able to let people in?
Dan Sun I'm also stuck you should make it so that everybody can join
16:10:48
@_slack_kubeflow_U01HAB3DC85:matrix.orgSriharan Manogaran joined the room.16:12:46
@_slack_kubeflow_U01HAB3DC85:matrix.orgSriharan Manogaran
In reply to@_slack_kubeflow_U027LHY3610:matrix.org
Dan Sun I'm also stuck you should make it so that everybody can join
Same here
16:12:47
@_slack_kubeflow_U9FNKAAD9:matrix.orgTommy Li changed their profile picture.16:58:10
@_slack_kubeflow_U9FNKAAD9:matrix.orgTommy Li
In reply to@_slack_kubeflow_U01T25HRREK:matrix.org
There seems to be one in progress here but it's not been updated in a couple weeks https://github.com/kubeflow/pipelines/pull/6716
We have an issue to track the progress for the new kserve component. Right now no one has taken this issue full time yet. If you want to work on it, feel free to assign it to yourself and I’m happy to help along the way. https://github.com/kserve/kserve/issues/1829
16:58:10
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... conventional <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for my GPU node, so by letting the device-plugin add the taint `<http://nvidia.com/gpu|nvidia.com/gpu>` with effect equal to `NoSchedule`. The ... => ... conventional taint, the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint. Your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for GPU nodes. The ...
19:05:57
@_slack_kubeflow_U02C41UBRT3:matrix.orgJohn Daciuk joined the room.19:09:08
@_slack_kubeflow_U02C41UBRT3:matrix.orgJohn Daciuk changed their display name from _slack_kubeflow_U02C41UBRT3 to John Daciuk.19:10:51
@_slack_kubeflow_U02C41UBRT3:matrix.orgJohn Daciuk set a profile picture.19:10:53
@_slack_kubeflow_U02C41UBRT3:matrix.orgJohn Daciuk I’m wondering if anyone has insight into the trade offs between using a custom vs triton kserve model. 19:10:53
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... taint. Your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for GPU nodes. The *nvidia-device-plugin* tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> => ... taint. On AWS, your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839|ExtendedResourceToleration> for GPU nodes (I was using 1.18 but it is only enabled under 1.19+ and this makes everything much easier). The *nvidia-device-plugin* tolerates the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> I will contribute to the kubeflow doc and include these information
21:15:37
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Making sure that the port is the right one too
Benjamin Tan Here is what I did : 1. port-forward kubeflow to http://localhost:8080 2. kubectl apply -f monitoring-core.yaml 3. kubectl apply -f monitoring-metrics-prometheus.yaml 4. Create metrics-virtual-service.yaml with the content you provided from the doc and changed the port to 3000 5. kubectl apply -f metrics-virtual-service.yaml 6. Create an inference service (dummy from the doc) No metrics tab is showing up 😕
(base) user@user-desktop:~/Documents/Kubeflow-install$ kubectl get pods -n knative-monitoring
NAME                                  READY   STATUS    RESTARTS   AGE
grafana-69d8d8dc47-h8p6r              1/1     Running   0          14m
kube-state-metrics-7d4df85595-ckmfm   1/1     Running   0          14m
node-exporter-dvhn4                   2/2     Running   0          14m
node-exporter-ktr6d                   2/2     Running   0          14m
node-exporter-tz6x4                   2/2     Running   0          14m
node-exporter-wcd9d                   2/2     Running   0          14m
prometheus-system-0                   1/1     Running   0          14m
prometheus-system-1                   1/1     Running   0          14m
I'm I missing something?
21:41:43
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... 14m``` I'm ... => ... 14m``` ```Containers: grafana: Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5 Image: grafana/grafana:6.3.3 Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88> Port: 3000/TCP Host Port: 0/TCP State: Running Started: Wed, 27 Oct 2021 17:26:26 -0400 Ready: True``` I'm ...
22:07:46
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... 14m``` ```Containers: grafana: Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5 Image: grafana/grafana:6.3.3 Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88> Port: 3000/TCP Host Port: 0/TCP State: Running Started: Wed, 27 Oct 2021 17:26:26 -0400 Ready: True``` I'm ... => ... 14m``` kubectl describe pod grafana-69d8d8dc47-h8p6r -n knative-monitoring ... ```Containers: grafana: Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5 Image: grafana/grafana:6.3.3 Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88> Port: 3000/TCP Host Port: 0/TCP State: Running Started: Wed, 27 Oct 2021 17:26:26 -0400 Ready: True``` ... I'm ...
22:08:23
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes. I realized that kubectl describe no would not show any GPU resources after adding the taint back. Removing the taint and now it shows back. The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node! The solution was to use a conventional taint, the nvidia.com/gpu taint. On AWS, your cluster must use EKS 1.19+ to take advantage of the ExtendedResourceToleration for GPU nodes (I was using 1.18 but it is only enabled under 1.19+ and this makes everything much easier). The nvidia-device-plugin tolerates the nvidia.com/gpu taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html And here is the commit that added the toleration for the nvidia.com/gpu taint to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7 I will contribute to the kubeflow doc and include these information
Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to KServe and the notebook component as well since you can setup notebook with node affinity & tolerations. Maybe we should add a section like "Using GPU with Kubeflow" ? Let me know what's best
23:57:01
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... to KServe and the notebook component as well since you can setup notebook with node affinity & tolerations. Maybe we should add a section like "Using GPU with Kubeflow" ? ... => ... to any model server under KServe (not just tensorflow) and also to the Notebook Server component as well since you can setup notebook with node affinity & tolerations and face the same issue. Maybe we should add a section like "Using GPU with Kubeflow" to the kubeflow website ? ...
23:58:15
28 Oct 2021
@_slack_kubeflow_UFVUV2UFP:matrix.orgDan Sun
In reply toundefined
(edited) ... community is ... => ... community meeting is ...
00:07:50
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to any model server under KServe (not just tensorflow) and ... => docs/modelserving/v1beta1/tensorflow/) and ...
00:11:26
@_slack_kubeflow_UFVUV2UFP:matrix.orgDan Sun
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to any model server under KServe (not just tensorflow, I'm saying this because the url is docs/modelserving/v1beta1/tensorflow/) and also to the Notebook Server component as well since you can setup notebook with node affinity & tolerations and face the same issue. Maybe we should add a section like "Using GPU with Kubeflow" to the kubeflow website ? Let me know what's best
Let's stick to kserve website, we have not certified KServe with Kubeflow yet which would hopefully land in Kubeflow 1.5.
00:13:28
@_slack_kubeflow_UFVUV2UFP:matrix.orgDan Sun
In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org
Let's stick to kserve website, we have not certified KServe with Kubeflow yet which would hopefully land in Kubeflow 1.5.
We can create a separate tutorial example for deploying model on GPU for KServe
00:14:08
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UFVUV2UFP:matrix.org
We can create a separate tutorial example for deploying model on GPU for KServe
Ok great let's do as you say. Do you think we could create a page that's all about GPU with KServe that would include the existing http://127.0.0.1:8000/docs/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource and the new information I would add ?
00:16:33
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... existing <http://127.0.0.1:8000/docs/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource> and ... => ... existing <https://kserve.github.io/website/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource> and ...
00:17:33
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply toundefined
(edited) ... we could create ... => ... we should create ...
00:17:52

Show newer messages


Back to Room ListRoom Version: 6