记录一下最近跑TinaFace代码在原来服务器跑没有问题,新服务器跑遇到的错误

首先,按照官网步骤安装相关包:
本人环境:
显卡驱动版本: NVIDIA-SMI 460.73.01 Driver Version: 460.106.00 CUDA Version: 11.2
CUDA版本:nvcc -V: Cuda compilation tools, release 11.1, V11.1.74

pytorch                   1.8.1 torchvision               0.9.1  mmcv-full                 1.4.6mmdet                     2.22.0cudatoolkit               11.1.1 

ps:如果没有安装上mmcv或者mmdet,不要怀疑,肯定是你的版本有问题。这个问题博主也遇到了。

检查上面版本,通常来讲,没有任何问题。
cudatoolkit严格按照cuda版本安装的,mmdet也是根据cuda版本和pytorch版本安装的。

不出意外的话出意外了,最后一步执行命令 pip install -v -e . 编辑vedadet报错:

Installing collected packages: vedadet  Running setup.py develop for vedadet    Running command /mnt/data/cbm/software/anaconda3/envs/lw/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"'; __file__='"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps    running develop    running egg_info    writing vedadet.egg-info/PKG-INFO    writing dependency_links to vedadet.egg-info/dependency_links.txt    writing requirements to vedadet.egg-info/requires.txt    writing top-level names to vedadet.egg-info/top_level.txt    reading manifest file 'vedadet.egg-info/SOURCES.txt'    adding license file 'LICENSE'    writing manifest file 'vedadet.egg-info/SOURCES.txt'    running build_ext    building 'vedadet.ops.nms.nms_ext' extension    Emitting ninja build file /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/build.ninja...    Compiling objects...    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)    [1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o.d -DWITH_CUDA -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/TH -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/data/cbm/software/anaconda3/envs/lw/include/python3.8 -c -c /mnt/data1/lw/vedadet-main/vedadet/ops/nms/src/cuda/nms_kernel.cu -o /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=nms_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14    FAILED: /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o    /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o.d -DWITH_CUDA -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/TH -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/data/cbm/software/anaconda3/envs/lw/include/python3.8 -c -c /mnt/data1/lw/vedadet-main/vedadet/ops/nms/src/cuda/nms_kernel.cu -o /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=nms_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14    nvcc fatal   : Unsupported gpu architecture 'compute_86'    ninja: build stopped: subcommand failed.    Traceback (most recent call last):      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1667, in _run_ninja_build        subprocess.run(      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/subprocess.py", line 516, in run        raise CalledProcessError(retcode, process.args,    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.    The above exception was the direct cause of the following exception:    Traceback (most recent call last):      File "", line 1, in <module>      File "/mnt/data1/lw/vedadet-main/setup.py", line 119, in <module>        setup(      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup        return distutils.core.setup(**attrs)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/core.py", line 148, in setup        dist.run_commands()      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/dist.py", line 966, in run_commands        self.run_command(cmd)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/dist.py", line 985, in run_command        cmd_obj.run()      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/develop.py", line 34, in run        self.install_for_development()      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/develop.py", line 114, in install_for_development        self.run_command('build_ext')      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/cmd.py", line 313, in run_command        self.distribution.run_command(command)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/dist.py", line 985, in run_command        cmd_obj.run()      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run        _build_ext.run(self)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run        _build_ext.build_ext.run(self)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 340, in run        self.build_extensions()      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 708, in build_extensions        build_ext.build_extensions(self)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions        _build_ext.build_ext.build_extensions(self)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions        self._build_extensions_serial()      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial        self.build_extension(ext)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 202, in build_extension        _build_ext.build_extension(self, ext)      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension        objects = self.compiler.compile(sources,      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 529, in unix_wrap_ninja_compile        _write_ninja_file_and_compile_objects(      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1354, in _write_ninja_file_and_compile_objects        _run_ninja_build(      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build        raise RuntimeError(message) from e    RuntimeError: Error compiling objects for extensionERROR: Command errored out with exit status 1: /mnt/data/cbm/software/anaconda3/envs/lw/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"'; __file__='"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

总结一下,就是报这个错误:nvcc fatal : Unsupported gpu architecture 'compute_86'
意思就是我的GPU算力架构太高了,不支持computer_86。博主GPU算力是3090显卡

网上很多博主说pytorch暂不支持computer_86,有很多建议说是将算力改为computer_75等。
这种自降身价的事,博主是不会做,万一改不回来了,岂不是亏大了。而且也有很多3090显卡的博主降算力后,推理代码时出现各种问题。

解决方案:
虽然 nvcc -V 显示的版本是11.1,但是cuda有个编译版本:一般在/usr/loacal/cuda/文件夹下:

这可以可以发现,本地安装了多个cuda版本。cuda文件是编译时采用的cuda版本文件的软件链接。即虽然是nvcc -V版本是11.1,但是编译版本是cuda10.2,导致出错。

临时解决方案:

将conda环境从cuda11.1版本降低到10.2,执行指令pip install -v -e .能够编译完成,但是执行代码时候会报错:

UserWarning:  GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the GeForce RTX 30

pytorch 3090显卡对应的cuda版本过低,可以直接将conda环境中的cuda版本升级到11.1,临时解决

注意:虽然两次都是报错cuda问题,第一次需要重新加载编辑本地代码,因此执行了编辑时的cuda版本,但是代码运行时执行是nvcc -V的版本。又会报版本过低。

但是在使用mmdetection框架时,通常都会需要自己设计网络框架后,需要重新编辑本地代码,同样还是需要执行pip install -v -e .问题不会被解决。

永久解决方案:

将编译时用到的cuda版本升级到11.1

相关参考博客如下:
nvcc fatal : Unsupported gpu architecture ‘compute_86‘

安装CUDA时,nvcc –version和cat /usr/local/cuda/version.txt版本不一致