NVIDIA GPU Operator on OpenShift4

Nvidia GPU Operator v1.9 on OpenShift 4.9.9 包含以上版本，安裝不用再進行額外權限配置]

OpenShift 4.9.9 或更高的版本 [1]

針對 driver toolkit 取消必要安裝要求:

Set up an entitlement
Mirror the RPM packages in a disconnected environment
Configure a proxy to access the package repository

[1] https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/steps-overview.html#entitlement-free-supported-versions

OpenShift 4.9.8 與以下版本 [2]

需手動操作獲取 OCP 憑證，創建 MachineConfig 來認證 OCP 叢集，擴大授權 Images 使用權限範圍，來安裝 Nvidia Operator :

從 Red Hat Customer Portal 下載 Red Hat OpenShift Container Platform 訂閱憑證 (啟用權限需要登入 OCP 憑證）。
創建一個 MachineConfig 啟用訂閱管理平台並提供有效訂閱憑證。等待 MachineConfigOperator 重啟節點並完成 MachineConfig。
驗證叢集所有節點更新權限是否正常。

補充 - NVIDIA GPU Operator 安裝會部署幾個 Pod 服務，用於管理和啟用 GPU 在 OpenShift 中運作。其中一些 Pod 需要 OpenShift 使用一些非 Universal Base Image (UBI) 默認授權的 Images。必須在 OpenShift Cluster 中啟用信任的授權 Images，來啟動 NVIDIA GPU 驅動程式的容器運行。

[1] https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/steps-overview.html#entitlement-free-supported-versions [2] https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/cluster-entitlement.html#enabling-a-cluster-wide-entitlemenent

Install Operator - Node Feature Discovery

1
2
3
4
5
6
7
[lab-user@bastion ~]$ oc get no
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-135-51.us-east-2.compute.internal    Ready    worker   3h5m    v1.22.3+e790d7f
ip-10-0-142-219.us-east-2.compute.internal   Ready    master   3h14m   v1.22.3+e790d7f
ip-10-0-167-35.us-east-2.compute.internal    Ready    worker   3h5m    v1.22.3+e790d7f
ip-10-0-186-251.us-east-2.compute.internal   Ready    master   3h14m   v1.22.3+e790d7f
ip-10-0-213-103.us-east-2.compute.internal   Ready    master   3h14m   v1.22.3+e790d7f

要驗證實例是否已創建，請運行：

1
2
3
4
5
6
7
8
[lab-user@bastion ~]$ oc get pods -n openshift-nfd
NAME                                      READY   STATUS    RESTARTS   AGE
nfd-controller-manager-6f65f47cf6-tg6gj   2/2     Running   0          24m
nfd-master-d7cqw                          1/1     Running   0          35s
nfd-master-j42m9                          1/1     Running   0          35s
nfd-master-r64nv                          1/1     Running   0          35s
nfd-worker-24tzn                          1/1     Running   0          35s
nfd-worker-5rsg2                          1/1     Running   0          35s

成功的部署會顯示一個Running狀態。

Installing the NVIDIA GPU Operator

With the Node Feature Discovery Operator installed you can continue with the final step and install the NVIDIA GPU Operator.

As a cluster administrator, you can install the NVIDIA GPU Operator using the OpenShift Container Platform CLI or the web console.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
kind: ClusterPolicy
apiVersion: nvidia.com/v1
metadata:
  name: gpu-cluster-policy
spec:
  dcgmExporter:
    config:
      name: ''
  dcgm:
    enabled: true
  daemonsets: {}
  devicePlugin: {}
  driver:
    enabled: true
    use_ocp_driver_toolkit: true
    repoConfig:
      configMapName: ''
    certConfig:
      name: ''
    licensingConfig:
      nlsEnabled: false
      configMapName: ''
    virtualTopology:
      config: ''
  gfd: {}
  migManager:
    enabled: true
  nodeStatusExporter:
    enabled: true
  operator:
    defaultRuntime: crio
    deployGFD: true
    initContainer: {}
  mig:
    strategy: single
  toolkit:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'

Create the ClusterPolicy custom resource. This CRD will create several OCP resources. It will evaluate all the labels for the each node in the cluster and look for this:

1
2
$ oc project nvidia-gpu-operator 
$ oc get pod -o wide

Validating the GPU availability

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[lab-user@bastion ~]$ oc get pod | grep nvidia-device-plugin-daemonset
nvidia-device-plugin-daemonset-bspfh                 1/1     Running     0          21m
nvidia-device-plugin-daemonset-n62dm                 1/1     Running     0          21m
[lab-user@bastion ~]$ oc exec -ti nvidia-device-plugin-daemonset-bspfh -- nvidia-smi
Defaulted container "nvidia-device-plugin-ctr" out of: nvidia-device-plugin-ctr, toolkit-validation (init)
Fri Jan 28 06:47:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

...
...
$ oc exec -ti nvidia-device-plugin-daemonset-n62dm -- nvidia-smi
Defaulted container "nvidia-device-plugin-ctr" out of: nvidia-device-plugin-ctr, toolkit-validation (init)
Fri Jan 28 06:47:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P0    24W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Running a sample GPU Application

Run a simple CUDA VectorAdd sample, which adds two vectors together to ensure the GPUs have bootstrapped correctly.

Run the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF
pod/cuda-vectoradd created

Check the logs of the container:

1
2
3
4
5
6
7
[lab-user@bastion ~]$ oc logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Getting information about the GPU¶

The nvidia-smi shows memory usage, GPU utilization and the temperature of the GPU. Test the GPU access by running the popular nvidia-smi command within the pod.

To view GPU utilization, run nvidia-smi from a pod in the GPU Operator daemonset.

Change to the nvidia-gpu-operator project:

1
$ oc project nvidia-gpu-operator

Run the following command to view these new pods:

1
$ oc get pod -owide -lopenshift.driver-toolkit=true 

1
2
3
NAME                                                 READY   STATUS    RESTARTS   AGE   IP             NODE                                        NOMINATED NODE   READINESS GATES
nvidia-driver-daemonset-49.84.202201102104-0-gl557   2/2     Running   0          26m   10.131.0.106   ip-10-0-167-35.us-east-2.compute.internal   <none>           <none>
nvidia-driver-daemonset-49.84.202201102104-0-k9sg5   2/2     Running   0          26m   10.128.2.17    ip-10-0-135-51.us-east-2.compute.internal   <none>           <none>

Run the nvidia-smi command within the pod:

1
$ oc exec -it nvidia-driver-daemonset-48.84.202110270303-0-9df9j -- nvidia-smi