ndza

Why can't Linux with Nvidia drivers just work!

What happened?

I shutdown my dev work laptop last night, and today morning as I boot it up, this happens!

A photo of my screen showing Linux kernel panic with a purple background.

It's lucky that I didn't have urgent work commitments.

So, what's going on? I'm not running very old hardware? I don't think anything is special. It's a Dell Inc. XPS 15 9510 with Ubuntu 24.04.4 LTS. But, it has a NVIDIA GeForce RTX™ 3050 Ti Laptop GPU and therefore I suspect, Nvidia drivers or some Linux kernel upgrades are somehow involved. This is such a pain!

Using my other laptop, I was able to do some searching and found a work around. Luckily, I can to boot into an older kernel, which is this version:

nolan-veed@nolan-veed:/boot$ uname -r
6.14.0-37-generic

And so, I don't need to go into rescue mode or anything. I don't have to go around hunting for that USB flash drive.

Diagnosing the problem

Now, my new kernel, which is 6.17.0-14-generic doesn't work. In the /boot dir, the initrd image shows up as broken:

A screenshot of the /boot dir.

So, something got upgraded and broke it. If I try to re-install my kernel, I see the following:

nolan-veed@nolan-veed:/boot$ sudo apt install --reinstall linux-image-generic-hwe-24.04
...
Setting up nvidia-dkms-575 (575.57.08-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)
update-initramfs: Generating /boot/initrd.img-6.14.0-37-generic
INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
Removing old nvidia/575.57.08 DKMS files...
Module nvidia/575.57.08 for kernel 6.14.0-37-generic (x86_64):
Before uninstall, this module version was ACTIVE on this kernel.
Deleting /lib/modules/6.14.0-37-generic/updates/dkms/nvidia.ko.zst
Deleting /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-modeset.ko.zst
Deleting /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-drm.ko.zst
Deleting /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-uvm.ko.zst
Deleting /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-peermem.ko.zst
Running depmod... done.

Deleting module nvidia/575.57.08 completely from the DKMS tree.
Loading new nvidia/575.57.08 DKMS files...
Building for 6.14.0-37-generic and 6.17.0-14-generic

It then proceeds to build the kernel modules for the old and new kernels...

Building initial module nvidia/575.57.08 for 6.14.0-37-generic
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der

Building module(s)............. done.
Signing module /var/lib/dkms/nvidia/575.57.08/build/nvidia.ko
Signing module /var/lib/dkms/nvidia/575.57.08/build/nvidia-modeset.ko
Signing module /var/lib/dkms/nvidia/575.57.08/build/nvidia-drm.ko
Signing module /var/lib/dkms/nvidia/575.57.08/build/nvidia-uvm.ko
Signing module /var/lib/dkms/nvidia/575.57.08/build/nvidia-peermem.ko
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-modeset.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-drm.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-uvm.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-peermem.ko.zst
Running depmod... done.

Building initial module nvidia/575.57.08 for 6.17.0-14-generic
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der

Building module(s)............(bad exit status: 2)
Failed command:
'make' -j16 KERNEL_UNAME=6.17.0-14-generic IGNORE_CC_MISMATCH=1 SYSSRC=/lib/modules/6.17.0-14-generic/build LD=/usr/bin/ld.bfd CONFIG_X86_KERNEL_IBT= modules
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-kernel-source-575.0.crash'

Error! Bad return status for module build on kernel: 6.17.0-14-generic (x86_64)
Consult /var/lib/dkms/nvidia/575.57.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-575 (--configure):
 installed nvidia-dkms-575 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-driver-575:
 nvidia-driver-575 depends on nvidia-dkms-575 (= 575.57.08-0ubuntu1); however:
  Package nvidia-dkms-575 is not configured yet.

dpkg: error processing package nvidia-driver-575 (--configure):
 dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                          Setting up linux-image-6.17.0-14-generic (6.17.0-14.14~24.04.1) ...
Setting up linux-image-generic-hwe-24.04 (6.17.0-14.14~24.04.1) ...
Processing triggers for initramfs-tools (0.142ubuntu25.8) ...
update-initramfs: Generating /boot/initrd.img-6.14.0-37-generic
Processing triggers for linux-image-6.17.0-14-generic (6.17.0-14.14~24.04.1) ...
/etc/kernel/postinst.d/dkms:
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der

Autoinstall of module nvidia/575.57.08 for kernel 6.17.0-14-generic (x86_64)
Building module(s).............(bad exit status: 2)
Failed command:
'make' -j16 KERNEL_UNAME=6.17.0-14-generic IGNORE_CC_MISMATCH=1 SYSSRC=/lib/modules/6.17.0-14-generic/build LD=/usr/bin/ld.bfd CONFIG_X86_KERNEL_IBT= modules
ERROR: Cannot create report: [Errno 17] File exists: '/var/crash/nvidia-kernel-source-575.0.crash'

Error! Bad return status for module build on kernel: 6.17.0-14-generic (x86_64)
Consult /var/lib/dkms/nvidia/575.57.08/build/make.log for more information.

Autoinstall on 6.17.0-14-generic failed for module(s) nvidia(10).

Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
run-parts: /etc/kernel/postinst.d/dkms exited with return code 1
dpkg: error processing package linux-image-6.17.0-14-generic (--configure):
 installed linux-image-6.17.0-14-generic package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 nvidia-dkms-575
 nvidia-driver-575
 linux-image-6.17.0-14-generic
E: Sub-process /usr/bin/dpkg returned an error code (1)

But, the build fails against the newer one. And the build logs are showing the the Nvidia breakage:

In file included from nvidia-uvm/uvm_common.h:43,
                 from nvidia-uvm/uvm_pmm_gpu.c:163:
/usr/src/linux-headers-6.17.0-14-generic/include/linux/pci-p2pdma.h: In function ‘pci_p2pdma_state’:
nvidia-uvm/uvm_linux.h:390:32: error: ‘struct page’ has no member named ‘pgmap’
  390 | #define page_pgmap(page) (page)->pgmap
      |                                ^~
/usr/src/linux-headers-6.17.0-14-generic/include/linux/pci-p2pdma.h:170:37: note: in expansion of macro ‘page_pgmap’
  170 |                 if (state->pgmap != page_pgmap(page))
      |                                     ^~~~~~~~~~
make[4]: *** [/usr/src/linux-headers-6.17.0-14-generic/scripts/Makefile.build:287: nvidia-uvm/uvm_pmm_gpu.o] Error 1

It's possible that the kernel was automatically upgraded, and broke.

What if I try to move to a newer driver? According to https://endoflife.date/nvidia, the new LTSB is 580, so that seems like a sensible choice:

A screenshot of endoflife.date webpage.

So, I try to install that:

nolan-veed@nolan-veed:~$ sudo apt install nvidia-driver-580
...
Removing nvidia-driver-575 (575.57.08-0ubuntu1) ...
Removing nvidia-dkms-575 (575.57.08-0ubuntu1) ...
Removing all DKMS Modules
Done.
INFO:Disable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
update-initramfs: deferring update (trigger activated)
Removing libnvidia-gl-575:amd64 (575.57.08-0ubuntu1) ...
Removing nvidia-utils-575 (575.57.08-0ubuntu1) ...
Removing xserver-xorg-video-nvidia-575 (575.57.08-0ubuntu1) ...
dpkg: libnvidia-compute-575:amd64: dependency problems, but removing anyway as you requested:
 nvidia-compute-utils-575 depends on libnvidia-compute-575.

Removing libnvidia-compute-575:amd64 (575.57.08-0ubuntu1) ...
dpkg: libnvidia-cfg1-575:amd64: dependency problems, but removing anyway as you requested:
 nvidia-persistenced depends on libnvidia-cfg1; however:
  Package libnvidia-cfg1 is not installed.
  Package libnvidia-cfg1-575:amd64 which provides libnvidia-cfg1 is to be removed.
...
Selecting previously unselected package libnvidia-gl-580:amd64.
Preparing to unpack .../06-libnvidia-gl-580_580.126.16-1ubuntu1_amd64.deb ...
Unpacking libnvidia-gl-580:amd64 (580.126.16-1ubuntu1) ...
dpkg: error processing archive /tmp/apt-dpkg-install-ezE9Pi/06-libnvidia-gl-580_580.126.16-1ubuntu1_amd64.deb (--unpack):
 trying to overwrite '/usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so', which is also in package libnvidia-extra-575:amd64 575.57.08-0ubuntu1

Sigh! From then on, I kept getting dpkg errors which meant I was unable to move forward.

So, it was time to remove the drivers as per Nvidia's latest instructions:

nolan-veed@nolan-veed:~$ apt remove --autoremove --purge -V \
   cuda-compat\* \
   cuda-drivers\*  \
   libnvidia-cfg1\* \
   libnvidia-compute\* \
   libnvidia-decode\* \
   libnvidia-encode\* \
   libnvidia-extra\* \
   libnvidia-fbc1\* \
   libnvidia-gl\* \
   libnvidia-gpucomp\* \
   libnvidia-nscq\* \
   libnvsdm\* \
   libxnvctrl\* \
   nvidia-dkms\* \
   nvidia-driver\* \
   nvidia-fabricmanager\* \
   nvidia-firmware\* \
   nvidia-headless\* \
   nvidia-imex\* \
   nvidia-kernel\* \
   nvidia-modprobe\* \
   nvidia-open\* \
   nvidia-persistenced\* \
   nvidia-settings\* \
   nvidia-xconfig\* \
   xserver-xorg-video-nvidia\*

Once removed, I rebooted my system with the latest kernel, the display was running on the Intel UHD graphics.

I was then able to reinstall the newer drivers successfully:

nolan-veed@nolan-veed:~$ sudo apt install cuda-drivers-580
...
Setting up nvidia-dkms-580 (580.126.16-1ubuntu1) ...
update-initramfs: deferring update (trigger activated)
INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
Loading new nvidia/580.126.16 DKMS files...
Building for 6.14.0-37-generic and 6.17.0-14-generic

Building initial module nvidia/580.126.16 for 6.14.0-37-generic
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der

Building module(s)........... done.
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-modeset.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-drm.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-uvm.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-peermem.ko
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-modeset.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-drm.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-uvm.ko.zst
Installing /lib/modules/6.14.0-37-generic/updates/dkms/nvidia-peermem.ko.zst
Running depmod... done.

Building initial module nvidia/580.126.16 for 6.17.0-14-generic
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der

Building module(s)............. done.
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-modeset.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-drm.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-uvm.ko
Signing module /var/lib/dkms/nvidia/580.126.16/build/nvidia-peermem.ko
Installing /lib/modules/6.17.0-14-generic/updates/dkms/nvidia.ko.zst
Installing /lib/modules/6.17.0-14-generic/updates/dkms/nvidia-modeset.ko.zst
Installing /lib/modules/6.17.0-14-generic/updates/dkms/nvidia-drm.ko.zst
Installing /lib/modules/6.17.0-14-generic/updates/dkms/nvidia-uvm.ko.zst
Installing /lib/modules/6.17.0-14-generic/updates/dkms/nvidia-peermem.ko.zst

Summary

When it comes to Linux + Nvidia drivers, I have encountered these pain points several times. The upgrade paths aren't perfect. I'm lucky to have worked in these systems for a while to know my way around, but for some folks, it's far from an attractive setup. But, there are ways to get out this mess - removal and reinstall of the Nvidia drivers is simple enough and has always worked well for me. Thanks.