为深度学习环境配置 CUDA

最近在折腾 OpenNMT 框架，CUDA 与 PyTorch 都是需要的依赖。虽然网络上已经有大量的 CUDA 安装教程，相信也比我写得全面且详细，但手头上的服务器正巧没有安装 CUDA，索性记录下来以备不时之需。

准备工作

在普遍情况下，设备都安装了相应的显卡驱动，那么使用 nvidia-smi 即可查看驱动信息，确定支持的 CUDA 版本。

若没有安装驱动，那么需要在终端中输入 lspci | grep -i nvidia 检测设备的 GPU 型号，接着查找适用的 CUDA 版本，在安装 CUDA 时也会自动安装上匹配的驱动。

CUDA 文档中给出了GPU 驱动与 CUDA 版本的对应关系，一定要选择适配的版本。

在安装 CUDA 前，还要确保设备安装了 gcc 和 make，如果需要使用 C++ 进行 CUDA 编程，还需要安装 g++。这里我只安装 gcc 和 make：

sudo apt update             # 更新 apt
sudo apt install gcc make   # 安装 gcc 和 make

下载 CUDA

确定需要安装的 CUDA 版本后，例如我的设备是 11.4，直接搜索 CUDA 11.4 进入下载页面。

在下载页面中需要选择相应的架构和发行版，如果不确定设备架构可以在终端中输入 uname -m，然后根据输出结果选择即可。

我的设备是 Ubuntu 20.04，选择 Installer Type 为 runfile(local)，网页就给出了下载命令：

# 下载 CUDA
wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.42.01_linux.run
# 安装 CUDA
sudo sh cuda_11.4.0_470.42.01_linux.run

安装 CUDA

下载需要较长时间，待下载完成后执行文件开始安装。若已经安装 GPU 驱动，会有以下提示，选择 Continue。

┌──────────────────────────────────────────────────────────────────────┐
│ Existing package manager installation of the driver found. It is     │
│ strongly recommended that you remove this before continuing.         │
│ Abort                                                                │
│ Continue                                                             │
│                                                                      │
│                                                                      │
│ Up/Down: Move | 'Enter': Select                                      │
└──────────────────────────────────────────────────────────────────────┘

接下来输入 accept 接受用户条款。

┌──────────────────────────────────────────────────────────────────────┐
│  End User License Agreement                                          │
│  --------------------------                                          │
│                                                                      │
│  NVIDIA Software License Agreement and CUDA Supplement to            │
│  Software License Agreement. Last updated: October 8, 2021           │
│                                                                      │
│  The CUDA Toolkit End User License Agreement applies to the          │
│  NVIDIA CUDA Toolkit, the NVIDIA CUDA Samples, the NVIDIA            │
│  Display Driver, NVIDIA Nsight tools (Visual Studio Edition),        │
│  and the associated documentation on CUDA APIs, programming          │
│  model and development tools. If you do not agree with the           │
│  terms and conditions of the license agreement, then do not          │
│  download or use the software.                                       │
│                                                                      │
│  Last updated: October 8, 2021.                                      │
│                                                                      │
│                                                                      │
│  Preface                                                             │
│  -------                                                             │
│                                                                      │
│──────────────────────────────────────────────────────────────────────│
│ Do you accept the above EULA? (accept/decline/quit):                 │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

然后出现下方的界面，需要选择安装的内容。若设备已经安装驱动，在这里需要取消 Driver 的选择，否则会出现安装错误。在这个步骤中，我只选择了 CUDA Toolkit 11.4 与 CUDA Documentation 11.4，选择 Intsall 开始安装。

┌──────────────────────────────────────────────────────────────────────┐
│ CUDA Installer                                                       │
│ - [ ] Driver                                                         │
│      [ ] 470.42.01                                                   │
│ + [X] CUDA Toolkit 11.4                                              │
│   [ ] CUDA Samples 11.4                                              │
│   [ ] CUDA Demo Suite 11.4                                           │
│   [X] CUDA Documentation 11.4                                        │
│   Options                                                            │
│   Install                                                            │
│                                                                      │
│                                                                      │
│ Up/Down: Move | Left/Right: Expand | 'Enter': Select |               │
│ 'A': Advanced options                                                │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Note 安装过程中出现的 boost::filesystem::copy_file: No such file or directory 错误是由于存储空间不足导致的。

环境变量

当输出 Summary 信息后即表示安装完成，在输出信息中还会提示修改环境变量:

Please make sure that
 -   PATH includes /usr/local/cuda-11.4/
 -   LD_LIBRARY_PATH includes /usr/local/cuda-11.4/lib64, or, add /usr/local/cuda-11.4/lib64 to /etc/ld.so.conf and run ldconfig as root

使用 vim ~/.bashrc，在文件末添加以下内容并保存：

# CUDA
export PATH="/usr/local/cuda-11.4/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.4/lib64:$ LD_LIBRARY_PATH"

最后使用 source ~/.bashrc 更新修改内容，在终端中输入 nvcc -V，输出 nvcc 的版本信息就说明安装完成。

References

Linux 下的 CUDA 安装和使用指南 - 知乎