kubeflow-kfserving - Public Room Timeline

	kubeflow-kfserving	433 Members
		2 Servers

Load older messages

Sender	Message	Time
27 Oct 2021
Mark Winter	In reply to @_slack_kubeflow_U9673D1KJ:matrix.org Ian Miller please update and contribute back - that will be great. Tommy Li owns the original one There seems to be one in progress here but it's not been updated in a couple weeks https://github.com/kubeflow/pipelines/pull/6716	12:02:34
Jevgeni Martjušev	is there any piece of functionality, which is present in KServe, but not present in Seldon Core? Or more generally - anything which is done better/simpler in KServe vs Seldon Core?	14:58:43
	Marcin Zabłocki joined the room.	15:11:51
Marcin Zabłocki	In reply to @_slack_kubeflow_U01HS89M1U6:matrix.org is there any piece of functionality, which is present in KServe, but not present in Seldon Core? Or more generally - anything which is done better/simpler in KServe vs Seldon Core? One of the examples would be scale to zero	15:11:51
Dan Sun	Hello everyone, KServe community is happening in meet.google.com/mue-gtwz-fhg	16:03:18
	Paul Van Eck joined the room.	16:08:07
Paul Van Eck	In reply to @_slack_kubeflow_UFVUV2UFP:matrix.org Hello everyone, KServe community is happening in meet.google.com/mue-gtwz-fhg Dan Sun Hey, I'm stuck in the "Asking to join..." lobby for the meeting. Are you able to let people in?	16:08:07
Vedant Padwal	In reply to @_slack_kubeflow_U017QCZSQ48:matrix.org Dan Sun Hey, I'm stuck in the "Asking to join..." lobby for the meeting. Are you able to let people in? Dan Sun I'm also stuck you should make it so that everybody can join	16:10:48
	Sriharan Manogaran joined the room.	16:12:46
Sriharan Manogaran	In reply to @_slack_kubeflow_U027LHY3610:matrix.org Dan Sun I'm also stuck you should make it so that everybody can join Same here	16:12:47
	Tommy Li changed their profile picture.	16:58:10
Tommy Li	In reply to @_slack_kubeflow_U01T25HRREK:matrix.org There seems to be one in progress here but it's not been updated in a couple weeks https://github.com/kubeflow/pipelines/pull/6716 We have an issue to track the progress for the new kserve component. Right now no one has taken this issue full time yet. If you want to work on it, feel free to assign it to yourself and I’m happy to help along the way. https://github.com/kserve/kserve/issues/1829	16:58:10
Alexandre Brown	In reply to undefined (edited) ... conventional <https://github.com/kubernetes/kubernetes/pull/55839\|ExtendedResourceToleration> for my GPU node, so by letting the device-plugin add the taint `<http://nvidia.com/gpu\|nvidia.com/gpu>` with effect equal to `NoSchedule`. The ... => ... conventional taint, the `<http://nvidia.com/gpu\|nvidia.com/gpu>` taint. Your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839\|ExtendedResourceToleration> for GPU nodes. The ...	19:05:57
	John Daciuk joined the room.	19:09:08
	John Daciuk changed their display name from _slack_kubeflow_U02C41UBRT3 to John Daciuk.	19:10:51
	John Daciuk set a profile picture.	19:10:53
John Daciuk	I’m wondering if anyone has insight into the trade offs between using a custom vs triton kserve model.	19:10:53
Alexandre Brown	In reply to undefined (edited) ... taint. Your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839\|ExtendedResourceToleration> for GPU nodes. The nvidia-device-plugin tolerates this taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu\|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> => ... taint. On AWS, your cluster must use EKS 1.19+ to take advantage of the <https://github.com/kubernetes/kubernetes/pull/55839\|ExtendedResourceToleration> for GPU nodes (I was using 1.18 but it is only enabled under 1.19+ and this makes everything much easier). The nvidia-device-plugin tolerates the `<http://nvidia.com/gpu\|nvidia.com/gpu>` taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : <https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html> And here is the commit that added the toleration for the `<http://nvidia.com/gpu\|nvidia.com/gpu>` taint to the nvidia-device-plugin : <https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7> I will contribute to the kubeflow doc and include these information	21:15:37
Alexandre Brown	In reply to @_slack_kubeflow_UM56LA7N3:matrix.org Making sure that the port is the right one too Benjamin Tan Here is what I did : 1. port-forward kubeflow to http://localhost:8080 2. `kubectl apply -f monitoring-core.yaml` 3. `kubectl apply -f monitoring-metrics-prometheus.yaml` 4. Create metrics-virtual-service.yaml with the content you provided from the doc and changed the port to 3000 5. `kubectl apply -f metrics-virtual-service.yaml` 6. Create an inference service (dummy from the doc) No metrics tab is showing up 😕 (base) user@user-desktop:~/Documents/Kubeflow-install$ kubectl get pods -n knative-monitoring NAME READY STATUS RESTARTS AGE grafana-69d8d8dc47-h8p6r 1/1 Running 0 14m kube-state-metrics-7d4df85595-ckmfm 1/1 Running 0 14m node-exporter-dvhn4 2/2 Running 0 14m node-exporter-ktr6d 2/2 Running 0 14m node-exporter-tz6x4 2/2 Running 0 14m node-exporter-wcd9d 2/2 Running 0 14m prometheus-system-0 1/1 Running 0 14m prometheus-system-1 1/1 Running 0 14m I'm I missing something?	21:41:43
Alexandre Brown	In reply to undefined (edited) ... 14m``` I'm ... => ... 14m``` ```Containers: grafana: Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5 Image: grafana/grafana:6.3.3 Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88> Port: 3000/TCP Host Port: 0/TCP State: Running Started: Wed, 27 Oct 2021 17:26:26 -0400 Ready: True``` I'm ...	22:07:46
Alexandre Brown	In reply to undefined (edited) ... 14m``` ```Containers: grafana: Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5 Image: grafana/grafana:6.3.3 Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88> Port: 3000/TCP Host Port: 0/TCP State: Running Started: Wed, 27 Oct 2021 17:26:26 -0400 Ready: True``` I'm ... => ... 14m``` kubectl describe pod grafana-69d8d8dc47-h8p6r -n knative-monitoring ... ```Containers: grafana: Container ID: <docker://b3f09c0071f9521211d6209a37da1c09a9b024e926f573b7b77033b1e22ced2>5 Image: grafana/grafana:6.3.3 Image ID: <docker-pullable://grafana/grafana@sha256:926446fd803964b7aa57684a4a3a42c76eac8ecaf7ed8b80bad9013706496d88> Port: 3000/TCP Host Port: 0/TCP State: Running Started: Wed, 27 Oct 2021 17:26:26 -0400 Ready: True``` ... I'm ...	22:08:23
Alexandre Brown	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org Benjamin Tan After a few hours of digging I was finally able to get the taint and toleration to work with GPU nodes. I realized that `kubectl describe no` would not show any GPU resources after adding the taint back. Removing the taint and now it shows back. The issue was that the nvidia-device-plugin does not have the toleration for my custom taint and therefore it cannot be scheduled on my gpu node. Since the nvidia-device-plugin could not be scheduled on my node this means my "gpu" node could not expose its gpus making it a plain CPU node! The solution was to use a conventional taint, the `nvidia.com/gpu` taint. On AWS, your cluster must use EKS 1.19+ to take advantage of the ExtendedResourceToleration for GPU nodes (I was using 1.18 but it is only enabled under 1.19+ and this makes everything much easier). The nvidia-device-plugin tolerates the `nvidia.com/gpu` taint and automatically adds a toleration to nodes with a node affinity of the matching node and requesting GPU (if not requesting GPU then it is not applied which is awesome because it prevents CPU only workload to ever be scheduled on a gpu node). The solution is explained in more details here : https://notes.rohitagarwal.org/2017/12/17/dedicated-node-pools-and-ExtendedResourceToleration-admission-controller.html And here is the commit that added the toleration for the `nvidia.com/gpu` taint to the nvidia-device-plugin : https://github.com/NVIDIA/k8s-device-plugin/commit/2d569648dac03252088b67f6333cb9df7c4059a7 I will contribute to the kubeflow doc and include these information Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to KServe and the notebook component as well since you can setup notebook with node affinity & tolerations. Maybe we should add a section like "Using GPU with Kubeflow" ? Let me know what's best	23:57:01
Alexandre Brown	In reply to undefined (edited) ... to KServe and the notebook component as well since you can setup notebook with node affinity & tolerations. Maybe we should add a section like "Using GPU with Kubeflow" ? ... => ... to any model server under KServe (not just tensorflow) and also to the Notebook Server component as well since you can setup notebook with node affinity & tolerations and face the same issue. Maybe we should add a section like "Using GPU with Kubeflow" to the kubeflow website ? ...	23:58:15
28 Oct 2021
Dan Sun	In reply to undefined (edited) ... community is ... => ... community meeting is ...	00:07:50
Alexandre Brown	In reply to undefined (edited) Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to any model server under KServe (not just tensorflow) and ... => docs/modelserving/v1beta1/tensorflow/) and ...	00:11:26
Dan Sun	In reply to @_slack_kubeflow_U02AYBVSLSK:matrix.org Dan Sun Should I create a PR for both the KServe website and for the Kubeflow Website ? I feel like this is information can apply to any model server under KServe (not just tensorflow, I'm saying this because the url is docs/modelserving/v1beta1/tensorflow/) and also to the Notebook Server component as well since you can setup notebook with node affinity & tolerations and face the same issue. Maybe we should add a section like "Using GPU with Kubeflow" to the kubeflow website ? Let me know what's best Let's stick to kserve website, we have not certified KServe with Kubeflow yet which would hopefully land in Kubeflow 1.5.	00:13:28
Dan Sun	In reply to @_slack_kubeflow_UFVUV2UFP:matrix.org Let's stick to kserve website, we have not certified KServe with Kubeflow yet which would hopefully land in Kubeflow 1.5. We can create a separate tutorial example for deploying model on GPU for KServe	00:14:08
Alexandre Brown	In reply to @_slack_kubeflow_UFVUV2UFP:matrix.org We can create a separate tutorial example for deploying model on GPU for KServe Ok great let's do as you say. Do you think we could create a page that's all about GPU with KServe that would include the existing http://127.0.0.1:8000/docs/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource and the new information I would add ?	00:16:33
Alexandre Brown	In reply to undefined (edited) ... existing <http://127.0.0.1:8000/docs/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource> and ... => ... existing <https://kserve.github.io/website/modelserving/autoscaling/autoscaling/#create-the-inferenceservice-with-gpu-resource> and ...	00:17:33
Alexandre Brown	In reply to undefined (edited) ... we could create ... => ... we should create ...	00:17:52

Show newer messages

Back to Room ListRoom Version: 6