Threadripper boot issues with Fedora 40 (original) (raw)

September 11, 2024, 2:09pm 1

Hello! I’m having some trouble getting fedora (or Rhel) working properly on my Threadripper pro system. It’s a threadripper 3975wx with an asus wrx80e II motherboard and currently using an nvidia gpu.

After the system boots into the OS, the status code on my motherboard goes from AA (which asus says indicates it’s all good) to 00 (which they say indicates issues with the cpu and memory). The Fedora system also ends up freezing after some time which I can only resolve by turning off the PC. Right before it boots into the OS I also get 2 warnings. One is from nouveau indicating an unknown chipset and one is from from iwlwifi saying invalid buffer destination.

At first I thought it was a hardware issue, but I tried both Ubuntu and Windows and both posted fine with status code AA and I was able to start running tasks without issue. I’m not sure if this is the right forum to ask, but does anyone have any insight into how I might be able to debug this? I’d prefer to use Fedora/RHEL. Thanks for any help!

things i’ve tried:

ing123 (ing mar) Tags updated September 11, 2024, 2:10pm 2

We would need the logs that are related to be certain, but using the nouveau driver might be the issue. If not already doing so then maybe installing and using the nvidia drivers from rpmfusion may solve this issue.
https://rpmfusion.org/Howto/NVIDIA

Please post the output of sudo dmesg | grep -iE "nvidia|nouveau|secure" as well as inxi -Fzxx so we can see details that may be a factor.

ing123 (ing mar) September 11, 2024, 7:56pm 4

hey, so I reinstalled Fedora and reinstalled the Nvidia drivers following those docs. I also went through the secure boot steps. I attached all the logs I’m seeing after running those two commands, unfortunately still seem to be getting the wrong boot code but the PC isn’t freezing now !

[sudo] password for ingmar: 
[    0.000000] Command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-6.10.8-200.fc40.x86_64 root=UUID=75c56dbf-b235-42b2-9285-0edb16336737 ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau
[    0.000000] secureboot: Secure boot disabled
[    0.005175] secureboot: Secure boot disabled
[    0.376127] Kernel command line: BOOT_IMAGE=(hd1,gpt2)/vmlinuz-6.10.8-200.fc40.x86_64 root=UUID=75c56dbf-b235-42b2-9285-0edb16336737 ro rootflags=subvol=root rhgb quiet rd.driver.blacklist=nouveau modprobe.blacklist=nouveau
[    6.662296] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input13
[    6.664664] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input14
[    6.697461] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input15
[    6.768474] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.1/0000:01:00.1/sound/card0/input16
[    7.371586] nvidia: loading out-of-tree module taints kernel.
[    7.371596] nvidia: module license 'NVIDIA' taints kernel.
[    7.371601] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    7.371603] nvidia: module license taints kernel.
[    7.934519] nvidia-nvlink: Nvlink Core is being initialized, major device number 511
[    7.936139] nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[    7.985401] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  555.58.02  Tue Jun 25 01:39:15 UTC 2024
[    8.044496] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[    8.183930] nvidia-uvm: Loaded the UVM driver, major device number 509.
[    8.235356] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  555.58.02  Tue Jun 25 01:10:21 UTC 2024
[    9.285960] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    9.584452] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[    9.603666] nvidia 0000:01:00.0: vgaarb: deactivate vga console
[    9.685700] fbcon: nvidia-drmdrmfb (fb0) is primary device
[    9.685706] nvidia 0000:01:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
 
inxi -Fzxx
System:
  Kernel: 6.10.8-200.fc40.x86_64 arch: x86_64 bits: 64 compiler: gcc
    v: 2.41-37.fc40
  Desktop: GNOME v: 46.4 tk: GTK v: 3.24.43 wm: gnome-shell dm: GDM
    Distro: Fedora Linux 40 (Workstation Edition)
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: Pro WS WRX80E-SAGE SE WIFI II v: Rev 1.xx
    serial: <superuser required> part-nu: SKU UEFI: American Megatrends v: 1501
    date: 05/06/2024
Battery:
  Device-1: hidpp_battery_1 model: Logitech M585/M590 Multi-Device Mouse
    serial: <filter> charge: 55% (should be ignored) status: discharging
CPU:
  Info: 32-core model: AMD Ryzen Threadripper PRO 3975WX s bits: 64
    type: MT MCP arch: Zen 2 rev: 0 cache: L1: 2 MiB L2: 16 MiB L3: 128 MiB
  Speed (MHz): avg: 2635 high: 4299 min/max: 2200/4368 boost: enabled cores:
    1: 2200 2: 2200 3: 2200 4: 2200 5: 2200 6: 2200 7: 2200 8: 2200 9: 2200
    10: 2200 11: 2200 12: 2200 13: 2143 14: 2200 15: 2200 16: 3461 17: 2200
    18: 3500 19: 2200 20: 2200 21: 2145 22: 2142 23: 2200 24: 2200 25: 2200
    26: 4292 27: 2200 28: 2200 29: 4292 30: 4286 31: 2200 32: 2200 33: 4286
    34: 2200 35: 2200 36: 2200 37: 4299 38: 2200 39: 2200 40: 3500 41: 2143
    42: 2142 43: 2200 44: 4290 45: 2200 46: 2169 47: 2200 48: 4288 49: 2200
    50: 2200 51: 2200 52: 4292 53: 2145 54: 2095 55: 3819 56: 2200 57: 2143
    58: 4284 59: 2200 60: 4290 61: 2200 62: 2200 63: 2145 64: 4297
    bogomips: 447191
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: NVIDIA AD103 [GeForce RTX 4070 Ti SUPER] vendor: PNY
    driver: nvidia v: 555.58.02 arch: Lovelace pcie: speed: 16 GT/s lanes: 16
    ports: active: none off: HDMI-A-1 empty: DP-1,DP-2,DP-3 bus-ID: 01:00.0
    chip-ID: 10de:2705
  Display: wayland server: X.org v: 1.20.14 with: Xwayland v: 24.1.2
    compositor: gnome-shell driver: gpu: nvidia,nvidia-nvswitch display-ID: 0
  Monitor-1: HDMI-A-1 model: Samsung C27F398 res: 1920x1080 dpi: 82
    diag: 686mm (27")
  API: OpenGL v: 4.6.0 vendor: nvidia v: 555.58.02 glx-v: 1.4
    direct-render: yes renderer: NVIDIA GeForce RTX 4070 Ti SUPER/PCIe/SSE2
    display-ID: :0.0
  API: EGL Message: EGL data requires eglinfo. Check --recommends.
Audio:
  Device-1: NVIDIA vendor: PNY driver: snd_hda_intel v: kernel pcie:
    speed: 16 GT/s lanes: 16 bus-ID: 01:00.1 chip-ID: 10de:22bb
  Device-2: AMD Starship/Matisse HD Audio vendor: ASUSTeK driver: N/A pcie:
    speed: 16 GT/s lanes: 16 bus-ID: 2c:00.4 chip-ID: 1022:1487
  Device-3: ASUSTek USB Audio driver: hid-generic,snd-usb-audio,usbhid
    type: USB rev: 2.0 speed: 480 Mb/s lanes: 1 bus-ID: 5-6:3 chip-ID: 0b05:1984
  API: ALSA v: k6.10.8-200.fc40.x86_64 status: kernel-api
  Server-1: JACK v: 1.9.22 status: off
  Server-2: PipeWire v: 1.0.7 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
Network:
  Device-1: Intel Ethernet X550 vendor: ASUSTeK driver: ixgbe v: kernel pcie:
    speed: 8 GT/s lanes: 4 port: N/A bus-ID: 24:00.0 chip-ID: 8086:1563
  IF: enp36s0f0 state: down mac: <filter>
  Device-2: Intel Ethernet X550 vendor: ASUSTeK driver: ixgbe v: kernel
    pcie: speed: 8 GT/s lanes: 4 port: N/A bus-ID: 24:00.1 chip-ID: 8086:1563
  IF: enp36s0f1 state: down mac: <filter>
  Device-3: Intel Wi-Fi 6E AX210/AX1675 2x2 [Typhoon Peak] driver: iwlwifi
    v: kernel pcie: speed: 5 GT/s lanes: 1 bus-ID: 25:00.0 chip-ID: 8086:2725
  IF: wlp37s0 state: up mac: <filter>
Bluetooth:
  Device-1: Intel AX210 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 bus-ID: 3-6:3 chip-ID: 8087:0032
  Report: btmgmt ID: hci0 rfk-id: 0 state: up address: <filter> bt-v: 5.3
    lmp-v: 12
Drives:
  Local Storage: total: 3.64 TiB used: 5.9 GiB (0.2%)
  ID-1: /dev/nvme0n1 vendor: Western Digital model: WD BLACK SN850X 2000GB
    size: 1.82 TiB speed: 63.2 Gb/s lanes: 4 serial: <filter> temp: 38.9 C
  ID-2: /dev/nvme1n1 vendor: Western Digital model: WD BLACK SN850X 2000GB
    size: 1.82 TiB speed: 63.2 Gb/s lanes: 4 serial: <filter> temp: 39.9 C
Partition:
  ID-1: / size: 1.82 TiB used: 5.53 GiB (0.3%) fs: btrfs dev: /dev/nvme0n1p3
  ID-2: /boot size: 973.4 MiB used: 364.1 MiB (37.4%) fs: ext4
    dev: /dev/nvme0n1p2
  ID-3: /boot/efi size: 598.8 MiB used: 19 MiB (3.2%) fs: vfat
    dev: /dev/nvme0n1p1
  ID-4: /home size: 1.82 TiB used: 5.53 GiB (0.3%) fs: btrfs
    dev: /dev/nvme0n1p3
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) priority: 100
    dev: /dev/zram0
Sensors:
  Src: ipmi Permissions: Unable to run ipmi sensors. Root privileges required.
  Src: lm-sensors System Temperatures: cpu: 53.6 C mobo: N/A
  Fan Speeds (rpm): N/A
Info:
  Memory: total: 256 GiB note: est. available: 251.52 GiB
    used: 5.08 GiB (2.0%)
  Processes: 961 Power: uptime: 4m wakeups: 0 Init: systemd v: 255
    target: graphical (5) default: graphical
  Packages: Compilers: gcc: 14.2.1 Shell: Bash v: 5.2.26
    running-in: gnome-terminal inxi: 3.3.34

It appears you actually have the nvidia drivers loaded and active.
However, the driver version is not the latest. That would be 560.35.03 if installed from rpmfusion as indicated. I do not know for certain, but it is possible that the RTX 4070 Ti card may not be fully supported by that driver version.

Please show us the output of dnf list installed '*nvidia*'.

ing123 (ing mar) September 12, 2024, 9:59pm 6

this is the output I’m seeing:

'*nvidia*'
Installed Packages
akmod-nvidia.x86_64                       3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
kmod-nvidia-6.10.8-200.fc40.x86_64.x86_64 3:555.58.02-1.fc40 @@commandline      
nvidia-gpu-firmware.noarch                20240909-1.fc40    @updates           
nvidia-modprobe.x86_64                    3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
nvidia-persistenced.x86_64                3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
nvidia-settings.x86_64                    3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia.x86_64                3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda.x86_64           3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.x86_64      3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-kmodsrc.x86_64        3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.x86_64           3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-power.x86_64          3:555.58.02-1.fc40 @rpmfusion-nonfree-nvidia-driver

i’m not sure if I missed a step somewhere from the RPM fusion website

It would appear that somehow you have managed to update the kernel and the firmware packages, but that for some reason you do not have the latest nvidia driver version which is 560.35.03…

My suggestion would be to find out why the nvidia driver was not updated (unless that was deliberate) and update it to the latest version. Then find out if the update works any better.

The nvidia driver may be locked to prevent updating in the file /etc/dnf/dnf.conf or in the respective repo file in /etc/yum.repos.d/

ing123 (ing mar) September 13, 2024, 1:08pm 8

I’m sorry if this is maybe a silly question, but what is the process to update to a specific version? The RPMFusion site is kind of vague. Do i need to uninstall the driver first and then install a specific version?

No
The update should be managed with a simple sudo dnf upgrade '*nvidia*' and verify that the version being installed is 560.35.03. If that is not the version to be installed then we need to figure out why the older version is not being replaced.

Which is why I mentioned the possibility that it may be prevented from updating in one of those 2 locations.

anotheruser (Mark) September 13, 2024, 8:49pm 10

the system has enabled only repository rpmfusion-nonfree-nvidia-driver.
Nvidia driver version 560.35.03 is available in rpmfusion-nonfree-updates.

@ing123 pls follow the instruction in Configuration - RPM Fusion how to enable the rpmfusion repositories

Thank you for the heads-up.

I have the rpmfusion-nonfree & rpmfusion-nonfree-updates repos enabled so did not realize the newer driver was not in the rpmfusion-nonfree-nvidia-driver repo.

Yes, to get a driver newer than the 555 version one must enable the other repos as shown in the rpmfusion config page then an upgrade will pull in the 560 drivers.

Thanks again.

anotheruser (Mark) September 13, 2024, 9:45pm 12

I think the rpmfusion-nonfree-nvidia-driver repo would be an ideal place to distribute only the Nvidia recommended drivers, which today are version 550.107.02.

The 555.* and 560.* drivers are either BETA or NFB (the short-lived new feature branch).

computersavvy (Jeff V) September 13, 2024, 11:04pm 13

Why the restriction.?
Nvidia does not have the same widespread access to fedora users that is seen by the rpmfusion repos.

Even beta software needs testers to verify usability and having it available on fedora seems an excellent way to have many users and systems that run the driver and test it to prove stability and performance.

Fedora is a leading edge distro with frequent updates so it is an excellent test bed for even beta software. (On many different hardware platforms and configurations.)

The drivers do not get placed into the driver repo until they have been run for some time by users and stability has been proven. They appear in the updates-testing repo (and apparently the updates repo) earlier. It is up to the user to decide which version to install and use.

Personally I have 3 systems.
A laptop with a GTX 1650 card, a desktop used as a server with 2 GTX 1050 cards, and my daily driver with an RTX 3050 card. All are running the latest kernels and the latest 560 drivers from rpmfusion with no problems anywhere.

I am actually planning to upgrade my laptop to fedora 41 which is about to be released as Beta. My f41 VM has encountered no issues.

In point of fact, on my f41 VM I see this:

$ dnf  list akmod-nvidia 
Updating and loading repositories:
Repositories loaded.
Available packages
akmod-nvidia.x86_64 3:560.35.03-1.fc41 rpmfusion-nonfree
akmod-nvidia.x86_64 3:560.35.03-1.fc41 rpmfusion-nonfree-nvidia-driver

so f41 will only have the 560 drivers and newer for current versions of the nvidia cards…

anotheruser (Mark) September 14, 2024, 11:44am 14

Why force it on every user on a stable release f39/ f40?
I have no objections to distribute beta and nfb drivers on beta (f41) or rawhide (f42) but
NFB drivers should be an opt-in ( e.g. an extra-nfb repo ) for stable releases like f39/f40.
An opt-in has then also an easy opt-out, disable the extra-nfb repo, remove drivers and re-install. This can all be documented.

And yet there are many reports on nvidia forums to be found. See also the other topic with proton/wine failing on an Optimus system with a 2nd ext. monitor.
NFB drivers had major issues last(?) year when they would not work at all with desktops configured at refresh rates >90Hz (my desktop is @144Hz). That was the point when I started to build my own rpms for the stable recommended drivers when NFB drivers hit the rpmfusion repos.

It’s impossible to catch all issues with NFB before push to repo. For the nvidia-driver repo, it’s about the initial user experience. I guess @ing123 would have gotten his system up and running with 550.107.02 in no time.

I also currently have 560* installed because I had time and was curious about the explicit-sync Wayland support. So I opted-in, I wouldn’t have if I needed a stable system because of work, RL etc.

No one is ‘forced’ to use the nvidia drivers (they can stay with nouveau) or to upgrade (they can choose to avoid upgrading the version in use). They can even choose to use hardware with an nvidia GPU or with a different GPU.

We see kernel upgrades frequently and are given the same choices. Upgrade or don’t upgrade. Actually that applies to every piece of software in fedora (and every other linux distro).

It is all up to the user to select what they choose to do.

If the option is not available we are taking the choice away from the user and making the decision for them.

Deciding what a user is allowed to do is more within the realm of Apple or Microsoft than the FOSS world. Both those sources design and release their software in such a way that the only choice a user has is to use the OS or not use that OS. A user cannot choose which pieces of the OS software works best for their needs.

Every user has the choice to do exactly what you state. Nothing is forced.

It is impossible to catch all issues ever.
The myriad of hardware and software configs mean there are always some specific situations that are unanticipated and may present issues. Your note about the many reports is noted but is severely skewed in the negative. To be fully cognizant of the situation you should realize that for every single problem reported there are many thousands that have no issues at all.

Many bugs can only be identified by using the software then identifying the conditions where a problem is seen. These are the ‘edge’ cases that may not ever otherwise be found.

Exactly, and that is why the drivers must be available as soon as reasonably possible, and why it is still 100% user choice to use, to upgrade, or not.

Your argument about being forced to upgrade is undermined by the discussion in this thread. Other versions of the drivers are available and as the OP found out, for their system the older driver was more stable.

leigh123linux (Leigh Scott) September 14, 2024, 2:53pm 16

Why would anyone bitch about rpmfusion providing the NFB (560xx) for an unstable distro like fedora.
If you want production quality, use RHEL or Debian.

anotheruser (Mark) September 14, 2024, 8:25pm 17

‘Forced’ is of course an exaggeration! I could have also said ‘blessed’ with new drivers. :slight_smile:

I feel like we have completely different ideas of what a “regular” or “average” user is.

  1. a default fedora workstation installation has automatic updates enabled! Assume that most users will not change that. When do those users shall review updates!? They get notified when new updates were installed and will reboot!
  2. user was probably directed to rpmfusion repositories to set up his nvidia GPU.
    Assume rpmfusion-free and rpmfusion-nonfree repositories are enabled (and the appropriate *-updates repositories)

Those users are not interested and not aware of what kind of drivers they have installed. All they want is a stable system and are happy not to use Windows any more.

If more advanced users are wondering why they do not get the ‘shiny’ new drivers, then they can be directed to the optional NFB repository. Simple.
The rest will happily continue to work with the stable branch and migrate to the new drivers as soon as they are considered stable/recommended by upstream.

In the meantime, the regular rpmfusion repositories will continue to receive updates for the stable recommended nvidia drivers as they become available.

The current approach does not serve this type of users very well.

They are released, but only to the interested party who want to try these kind of drivers. That’s the whole point.

Very mature response. Thunbs up!

I checked https://fedoraproject.org/ twice and I don’t see any statement that this is an unstable distro. I should probably ask the admins to put a big warning on the front page.

leigh123linux (Leigh Scott) September 14, 2024, 9:12pm 18

Feel free to submit a production branch review request to address this issue.

https://bugzilla.rpmfusion.org/show_bug.cgi?id=7040

FTR rpmfusion provides the latest versions to maintain compatibility with the cuda repo.

https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/

NOT true!
They either are using the default (nouveau) driver (in which case they may not know) or they knowingly have installed the nvidia drivers (in which case they certainly know it is installed since they had to install it manually)

Exactly, and when made available it is up to the user if they choose to use the newer version.

It is users choice and if the newer driver is not made available as soon as reasonably possible then there are (and have been) users who complain about the delay in having it available.

I have seen comments like “it was just released by nvidia, why is it not available for fedora”. You seem to want only the most stable but the majority seem to want the latest and greatest instantly.

💯

@anotheruser
This has gone off topic for the thread and should stop here.
Discussion of the driver on the repo is not related to booting.
Please start your own thread if you wish to continue this discussion about the drivers and when they are placed into the repo.

ing123 (ing mar) September 17, 2024, 10:37pm 20

Hi, just to followup on this thread the motherboard still posts with q code 00 after updating to latest drivers (560) but I haven’t experienced any freezing and it seems like workloads are running fine.

Not sure what else to check but not sure how related this is to fedora anymore. Thanks for all the help