26 Oct 2021 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
Normal Scheduled 110s default-scheduler Successfully assigned kube-system/nvidia-device-plugin-daemonset-rbvh9 to ip-192-168-175-237.ec2.internal
Normal Pulling 109s kubelet Pulling image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
Normal Pulled 104s kubelet Successfully pulled image "nvcr.io/nvidia/k8s-device-plugin:v0.9.0"
Normal Created 102s kubelet Created container nvidia-device-plugin-ctr
Normal Started 101s kubelet Started container nvidia-device-plugin-ctr what about the logs itself | 15:17:20 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org what about the logs itself so kubectl logs nvidia po d name -n nvidia-pod-namespace | 15:17:41 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org so kubectl logs nvidia po d name -n nvidia-pod-namespace 2021/10/26 15:11:56 Loading NVML
2021/10/26 15:11:56 Starting FS watcher.
2021/10/26 15:11:56 Starting OS watcher.
2021/10/26 15:11:56 Retreiving plugins.
2021/10/26 15:11:57 Starting GRPC server for 'nvidia.com/gpu'
2021/10/26 15:11:57 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/10/26 15:11:57 Registered device plugin for 'nvidia.com/gpu' with Kubelet | 15:21:13 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
2021/10/26 15:11:56 Loading NVML
2021/10/26 15:11:56 Starting FS watcher.
2021/10/26 15:11:56 Starting OS watcher.
2021/10/26 15:11:56 Retreiving plugins.
2021/10/26 15:11:57 Starting GRPC server for 'nvidia.com/gpu'
2021/10/26 15:11:57 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2021/10/26 15:11:57 Registered device plugin for 'nvidia.com/gpu' with Kubelet excellent this is good | 15:21:27 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org excellent this is good But wait I think I got something | 15:21:50 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org But wait I think I got something if you do a kubectl describe no see if you can find the gpu | 15:21:52 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org if you do a kubectl describe no see if you can find the gpu Yes I see it! | 15:22:42 |
Alexandre Brown | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Yes I see it! But I think that by re-applying the nvidia-plugin it fixed it, let me try to create the inference service | 15:23:02 |
Alexandre Brown | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org But I think that by re-applying the nvidia-plugin it fixed it, let me try to create the inference service Because after doing kubectl describe to the nvidia plugin pod we see it is installed for a specific node ip and since in my case it was installed on a node that I had to delete because of zone availability, maybe I had to re-apply the plugin with the new node alive. | 15:24:20 |
Alexandre Brown | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Because after doing kubectl describe to the nvidia plugin pod we see it is installed for a specific node ip and since in my case it was installed on a node that I had to delete because of zone availability, maybe I had to re-apply the plugin with the new node alive. Let me confirm or infirm this | 15:24:26 |
| _slack_kubeflow_U02K0HH7CLV joined the room. | 15:24:48 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Let me confirm or infirm this Yeah I hit into this too, sometimes deleted the daemonset and starting over helps like in your case. | 15:25:58 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Yeah I hit into this too, sometimes deleted the daemonset and starting over helps like in your case. Benjamin Tan I can create the inference service if I delete the taint, but with the taint I get the error | 15:26:00 |
Alexandre Brown | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org Benjamin Tan I can create the inference service if I delete the taint, but with the taint I get the error 0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {train-gpu-taint: true}, that the pod didn't tolerate, 3 Insufficient nvidia.com/gpu. | 15:26:12 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org 0/4 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {train-gpu-taint: true}, that the pod didn't tolerate, 3 Insufficient nvidia.com/gpu. interesting | 15:26:19 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org interesting this is insufficient CPU | 15:26:22 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org this is insufficient CPU not GPU | 15:26:24 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org not GPU Ah wait I saw the rest | 15:26:34 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Ah wait I saw the rest we're getting close! | 15:27:34 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org we're getting close! indeed | 15:27:45 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org indeed gimme a sec lemme eyeball the yaml a little bit | 15:27:54 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org gimme a sec lemme eyeball the yaml a little bit It's probably an issue with the tolerations definition or something | 15:29:30 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org It's probably an issue with the tolerations definition or something Yeah it looks legit at first glance | 15:31:16 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Yeah it looks legit at first glance (base) user@user-desktop:~/Documents/Kubeflow-install$ kubectl get nodes -o json , jq '.items[].spec'
{
"providerID": "aws:///us-east-1b/i-0cd69fa5348fb062d",
"taints": [
{
"effect": "NoSchedule",
"key": "train-gpu-taint",
"value": "true"
}
]
}
{
"providerID": "aws:///us-east-1c/i-09586cfcd8e5da983",
"taints": [
{
"effect": "NoSchedule",
"key": "notebook-cpu-taint",
"value": "true"
}
]
} | 15:31:19 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org
(base) user@user-desktop:~/Documents/Kubeflow-install$ kubectl get nodes -o json , jq '.items[].spec'
{
"providerID": "aws:///us-east-1b/i-0cd69fa5348fb062d",
"taints": [
{
"effect": "NoSchedule",
"key": "train-gpu-taint",
"value": "true"
}
]
}
{
"providerID": "aws:///us-east-1c/i-09586cfcd8e5da983",
"taints": [
{
"effect": "NoSchedule",
"key": "notebook-cpu-taint",
"value": "true"
}
]
} Does it work without the nodeAffinity block? | 15:33:57 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Does it work without the nodeAffinity block? Oh and make sure that the inference service is using the GPU version (i'm forgot which one is the default) | 15:36:41 |
Alexandre Brown | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org Oh and make sure that the inference service is using the GPU version (i'm forgot which one is the default) No removing the nodeAffinity didnt help | 15:37:32 |
Alexandre Brown | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org No removing the nodeAffinity didnt help The default seems to be the GPU image | 15:38:10 |
Benjamin Tan | In reply to@_slack_kubeflow_U02AYBVSLSK:matrix.org The default seems to be the GPU image oh cool | 15:41:00 |
Benjamin Tan | In reply to@_slack_kubeflow_UM56LA7N3:matrix.org oh cool is it possible to remove the taints first? | 15:41:12 |