!LuUSGaeArTeoOgUpwk:matrix.org

kubeflow-kfserving

433 Members
2 Servers

Load older messages


SenderMessageTime
26 Oct 2021
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
  Normal  Scheduled  110s  default-scheduler  Successfully assigned kube-system/nvidia-device-plugin-daemonset-rbvh9 to ip-192-168-175-237.ec2.internal
  Normal  Pulling    109s  kubelet            Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
  Normal  Pulled     104s  kubelet            Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
  Normal  Created    102s  kubelet            Created container nvidia-device-plugin-ctr
  Normal  Started    101s  kubelet            Started container nvidia-device-plugin-ctr
what about the logs itself
15:17:20
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
what about the logs itself
so kubectl logs nvidia po d name -n nvidia-pod-namespace
15:17:41
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
so kubectl logs nvidia po d name -n nvidia-pod-namespace
2021/10/26 15:11:56 Loading NVML
2021/10/26 15:11:56 Starting FS watcher.
2021/10/26 15:11:56 Starting OS watcher.
2021/10/26 15:11:56 Retreiving plugins.
2021/10/26 15:11:57 Starting GRPC server for 'nvidia.com/gpu'
2021/10/26 15:11:57 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/10/26 15:11:57 Registered device plugin for 'nvidia.com/gpu' with Kubelet
15:21:13
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
2021/10/26 15:11:56 Loading NVML
2021/10/26 15:11:56 Starting FS watcher.
2021/10/26 15:11:56 Starting OS watcher.
2021/10/26 15:11:56 Retreiving plugins.
2021/10/26 15:11:57 Starting GRPC server for 'nvidia.com/gpu'
2021/10/26 15:11:57 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/10/26 15:11:57 Registered device plugin for 'nvidia.com/gpu' with Kubelet
excellent this is good
15:21:27
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
excellent this is good
But wait I think I got something
15:21:50
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
But wait I think I got something
if you do a kubectl describe no see if you can find the gpu
15:21:52
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
if you do a kubectl describe no see if you can find the gpu
Yes I see it!
15:22:42
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Yes I see it!
But I think that by re-applying the nvidia-plugin it fixed it, let me try to create the inference service
15:23:02
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
But I think that by re-applying the nvidia-plugin it fixed it, let me try to create the inference service
Because after doing kubectl describe to the nvidia plugin pod we see it is installed for a specific node ip and since in my case it was installed on a node that I had to delete because of zone availability, maybe I had to re-apply the plugin with the new node alive.
15:24:20
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Because after doing kubectl describe to the nvidia plugin pod we see it is installed for a specific node ip and since in my case it was installed on a node that I had to delete because of zone availability, maybe I had to re-apply the plugin with the new node alive.
Let me confirm or infirm this
15:24:26
@_slack_kubeflow_U02K0HH7CLV:matrix.org_slack_kubeflow_U02K0HH7CLV joined the room.15:24:48
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Let me confirm or infirm this
Yeah I hit into this too, sometimes deleted the daemonset and starting over helps like in your case.
15:25:58
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Yeah I hit into this too, sometimes deleted the daemonset and starting over helps like in your case.
Benjamin Tan I can create the inference service if I delete the taint, but with the taint I get the error
15:26:00
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Benjamin Tan I can create the inference service if I delete the taint, but with the taint I get the error
0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {train-gpu-taint: true}, that the pod didn't tolerate, 3 Insufficient nvidia.com/gpu.
15:26:12
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {train-gpu-taint: true}, that the pod didn't tolerate, 3 Insufficient nvidia.com/gpu.
interesting
15:26:19
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
interesting
this is insufficient CPU
15:26:22
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
this is insufficient CPU
not GPU
15:26:24
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
not GPU
Ah wait I saw the rest
15:26:34
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Ah wait I saw the rest
we're getting close!
15:27:34
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
we're getting close!
indeed
15:27:45
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
indeed
gimme a sec lemme eyeball the yaml a little bit
15:27:54
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
gimme a sec lemme eyeball the yaml a little bit
It's probably an issue with the tolerations definition or something
15:29:30
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
It's probably an issue with the tolerations definition or something
Yeah it looks legit at first glance
15:31:16
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Yeah it looks legit at first glance
(base) user@user-desktop:~/Documents/Kubeflow-install$ kubectl get nodes -o json , jq '.items[].spec'
{
  "providerID": "aws:///us-east-1b/i-0cd69fa5348fb062d",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "train-gpu-taint",
      "value": "true"
    }
  ]
}
{
  "providerID": "aws:///us-east-1c/i-09586cfcd8e5da983",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "notebook-cpu-taint",
      "value": "true"
    }
  ]
}
15:31:19
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
(base) user@user-desktop:~/Documents/Kubeflow-install$ kubectl get nodes -o json , jq '.items[].spec'
{
  "providerID": "aws:///us-east-1b/i-0cd69fa5348fb062d",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "train-gpu-taint",
      "value": "true"
    }
  ]
}
{
  "providerID": "aws:///us-east-1c/i-09586cfcd8e5da983",
  "taints": [
    {
      "effect": "NoSchedule",
      "key": "notebook-cpu-taint",
      "value": "true"
    }
  ]
}
Does it work without the nodeAffinity block?
15:33:57
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Does it work without the nodeAffinity block?
Oh and make sure that the inference service is using the GPU version (i'm forgot which one is the default)
15:36:41
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
Oh and make sure that the inference service is using the GPU version (i'm forgot which one is the default)
No removing the nodeAffinity didnt help
15:37:32
@_slack_kubeflow_U02AYBVSLSK:matrix.orgAlexandre Brown
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
No removing the nodeAffinity didnt help
The default seems to be the GPU image
15:38:10
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
The default seems to be the GPU image
oh cool
15:41:00
@_slack_kubeflow_UM56LA7N3:matrix.orgBenjamin Tan
In reply to@_slack_kubeflow_UM56LA7N3:matrix.org
oh cool
is it possible to remove the taints first?
15:41:12

Show newer messages


Back to Room ListRoom Version: 6