配置前:显示nvidia-smi没驱动和nvcc --version没命令。
1、先自动下载的cuda-tool-kit; sudo安装了一下
结果:nvcc有命令了,但nvidia-smi仍然没有驱动
2、去cuda官网,按显卡3900下载驱动文件,开始安装
2.1报错:The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
此时,由于在/etc/modprobe.d/目录下已经有了将nouveau设为黑名单的conf,所以只需要:
sudo update-initramfs -u
sudo reboot
ref:Ubuntu 安装 NVIDIA 显卡驱动详细步骤(ERROR: The Nouveau kernel driver is currently in use by your system)_wohu1104的专栏-CSDN博客_ubuntu安装nouveau
2.2 关于选项:
Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later
------No
Nvidia's 32-bit compatibility libraries?
------No
Would you like to run the nvidia-xconfigutility to automatically update your x configuration so that the NVIDIA x driver will be used when you restart x? Any pre-existing x confile will be backed up
------Yes
3、在conda里,报错:NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch instal
安装的101驱动,提供的算力最高只能到sm_70,提供不了sm_86的算力,需要重新安装高版本驱动。
先卸载了nvidia的驱动,包括: sudo apt-get install autoremove --purge nvidia*
在下载对应的455版本的驱动时,非常慢,找个合适的网址就能下的非常快,我猜测是有唯一编码的下载的快。
Linux x64 (AMD64/EM64T) Display Driver | 455.38 | Linux 64-bit | NVIDIA
1、重新安装时的错误
解决参照:
sudo systemctl isolate multi-user.targetsudo modprobe -r nvidia-drm
https://zhuanlan.zhihu.com/p/135875408#:~:text=An%20NVIDIA%20kernel%20module%20%27nvidia-drm%27%20appears%20to%20already,kernel%20was%20configured%20without%20support%20for%20module%20unloading.
-----------------------------------------------------------------------
最后:安装了高版本驱动+11.1cuda
问题1:nvcc --version提醒anaconda里libstdc++.so.6匹配不到GLIBCXX_3.4.26
解决办法:先用sudo find /usr -name "libstdc++.so.6*"找到所有的libstdc++.so.6文件
找到了很多文件后,检验某文件是否能匹配到GLIBCXX_3.4.26,命令为strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX
发现这个文件是支持到GLIBCXX_3.4.28,直接复制替换到anaconda环境即可,cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 .
-----------------------------------------------------------------------
问题2:安装了cuda,nvcc依然找不到命令
解决:在~/.bashrc里,输入export PATH="$PATH:/usr/local/cuda/bin"
------------------------------------------------------------------------
问题3:OSError: libcusparse.so.11: cannot open shared object file: No such file or directory
通过:以下命令证明它是正常的。
>>> import torch
>>> print(torch.__version__)
1.8.0+cu111
>>> print(torch.version.cuda)
11.1
>>> print(torch.cuda.is_available())
True