Anyone Experiencing SSD corruption with Intel Motherboard Rapid Storage Technology RAID 1? (original) (raw)

October 9, 2024, 8:30pm 1

We have multiple HP Z2 G9 workstations running Windows 11 Pro and using Intel RST motherboard RAID in RAID 1 configuration. We use Western Digital Black SN850 and Samsung 990 Pro SSDs. Over the past few months, we have experienced multiple problems, to the point of being chronic. There are too many bad SSDs for this to be an SSD issue.

Problems:

  1. After the computer sits overnight, it says “Boot Device not found”. After power cycling or even soft booting, computer reboots normally.
  2. Intel RST reports SMART errors on one or both drives.
  3. Intel RST reports one of the drives as “Unidentified” after a few days or weeks.

One computer was in since last November, and just in the past couple weeks is experiencing the “Boot Device not found” error. It restarts normally after soft boot. Driver issue with latest drivers?

We have sent multiple SSDs in under warranty to WD and Samsung. Finally, Samsung reported RAID system corrupted the drive on one of the RMA repairs.

All drivers are up to date from HP including BIOS and Intel RST. All computers have 14 gen i7 processors.

Any help would be appreciated since we are at a standstill.

Rod-IT (Rod-IT) October 9, 2024, 9:18pm 2

In the BIOS, navigate to Advanced > System Options and uncheck “Configure Storage Controller for VMD”. This has been known to resolve some RAID-related issues on HP Z2 G9 workstations

Is there a requirement for RAID if you’re using NVMe (not challenging you, asking in case setting it to AHCI for stability is an option - will require OS reinstall).

VMD relies heavily on IRST, if there are any issues with it or the version, this could cause instability, especially with your hardware. Disabling it might reduce the overall throughput, but likely not noticeable with NVMe.

Confirm the IRST drivers are the latest from Intel, those that come with windows updates are not always that great and are often old ones.

Be aware if the CPU is an Intel 13th or 14th gen, these are known to have stability issues, so if the system is crashing or rebooting unexpectedly, this could be the cause for your drive corruption too.

Rod-IT (Rod-IT) October 9, 2024, 9:21pm 3

Michael4000 (Michael4000) October 9, 2024, 11:27pm 4

Rod,

I appreciate the suggestions.

I tried disabling VMD in the BIOS, and that disabled the Intel Rapid Storage Technology RAID. In fact, it was no longer in the BIOS as a Third Party program. Windows also would not boot. it went to a blue screen or went into the Windows Repair.

I turned VMD back on, and set RST to rebuild the RAID.

I am using the latest drivers for RST from HP. I went to the Intel site, and they recommended using the drivers from the computer manufacturer rather than their site due to customizations that the manufacturers make.

We are using RAID 1 for disk mirroring (failover), rather than performance.

adrian_ych (adrian_ych) October 10, 2024, 4:07am 5

Why not consider using hardware RAID instead as it gives a better redundancy option ?
Then are the disk certifed by HP as there are also known issues and/or false positive issues when using some non-HP certified components ?

Rod-IT (Rod-IT) October 10, 2024, 10:33am 6

I am familiar with the RAID types, I was curious if you actually need it, especially given it’s causing you issues. I’d sooner stability and a good backup, than redundancy and consistent issues. That’s all my point was.

dwhipps (Dwhipps) October 11, 2024, 4:46pm 7

It may not be much of a thing, but is there any particular reason you all are using such different SSDs as well? I know the Samsung are likely much higher throughput and the WDs are almost certainly decidedly slower, though at worst that should just be a performance issue (you only get the speed of the slowest member device, or probably a little bit less here since there’s also the RAID overhead involved).

However, I do recall having read in the past that SSDs in RAID 1 especially have a problem with interfering with wear leveling. The way RAID 1 functions can/often-does prevent normal wear leveling from occurring because it’s duplicating everything from drive to drive below the level that software can typically manage the wear leveling effectively. This of course depends on the specific hardware in use and how good for instance the controller handling the RAID is at working around that issue, but I’ve only heard that issue being a thing on RAID 1 specifically because of the explicit duplication - something that RAID 10 can work around since it is both spanned and duplicated, which allows for more flexibility on a drive-by-drive basis if I’m remembering correctly.

Michael4000 (Michael4000) October 11, 2024, 8:54pm 8

Rod,

We use RAID 1 because of the amount of time it takes us to setup a new computer from scratch these days by the time we install all the apps, updates, security and settings. We are not worried about the data since it is stored on the servers, including the Windows Desktops. It also keeps the employee working if a drive fails, and we are able to replace it more at our leisure. It avoids emergencies. We have been doing it for years.

Rod-IT (Rod-IT) October 11, 2024, 9:37pm 9

All of what you said is fine - and I’ll repeat it, I’m not challenging you, but I’m not sure this is the best option if you’re having issues with stability.

An external USB with Veeam Endpoint backup would likely be just as valuable, VEB is free, a backup would also give a way to go back in time and restoring this to new drives can be done from recovery media. Sure, it will take a little longer, but with SSD life remaining reports you should be able to judge when this time is coming - outside of physical failure. VEB can also backup to a share if that’s easier and less hassle.

I use to run dual NVMe on my personal setup, but it just got too much of a PITA to deal with for varying reasons (especially if you’re not using dedicated controllers).

While I am still not telling you what to do, I am offering you options, in case one sounds good.

Why not look at having the apps on RDS servers, multiple people can share and less drama if their PC fails.

One last thing, as mentioned, it’s not good to mix brands, though I expect you meant those are brands you use, and not brands you are mixing (this wont be helping with your issues), but if they are both installed at the same time and both ‘wear’ at the same rate, since they are mirrors, they will both fail within the same window - unless they have different TBW values.

Potentially, you have the same risks, but in both drives at the same time, as they are both writing the same data.

I expect if the drives are mixed, this is for you to know which one has failed (when that time comes). Bear in mind they will probably have different controllers, different features and firmwares, which may not be 100% compatible with the other in this setup.

Michael4000 (Michael4000) October 11, 2024, 10:12pm 10

Dwhipps,

In the tests I have seen, the WD Black SN850’s are right in there performance wise with the Samsung 990 Pros. Since most of these are early failures, I can’t imagine wear leveling has factored in yet. I agree it could be an issue down the line, but surprisingly, we haven’t seen a problem even on the older computers.

Update: I should have made it clear, we don’t mix drive brands and models on a computer RAID.

Michael4000 (Michael4000) October 11, 2024, 10:26pm 11

Rod,

Thank you for all of the ideas.

We don’t mix drive brands or models on a particular computer. They are matched 100%. They are both brand new, too.

We have done RDS for a couple of accounting apps, but we got complaints about performance. Everyone started requesting native apps.

Thank you for the tip on Reeam. I don’t think it will work for most of my customer computers, because they need realtime failover, but I may use it for our office computers here.

UPDATE: I took out one of the Samsung SSDs that showed “Unidentified” in Intel RST. This computer had only been running for about a week. The other drive worked fine. Intel RST just showed the RAID degraded. Rather than sending the drive to Samsung under warranty, I thought I would see if I could repair the drive myself.

I put the SSD in another computer that had running RAID with no issues. I put it in the third slot of an HP Z2 G9. Intel RST identified the drive I put in as another RAID that showed “Failed”. I deleted the failed RAID. I then started Diskpart from the Command Prompt, and ran the Diskpart Clean command to remove the partitions. I installed the cleaned drive back into the original computer. I told Intel RST to repair the RAID by loading the repaired drive. I started Windows, and the RAID rebuilt to 100%. It has been running fine for the past day. We’ll see what happens.

Michael4000 (Michael4000) October 12, 2024, 5:56pm 12

Adrian,

Thank you for the suggestions.

I have looked at hardware RAID. Other than cost, the only holdback I have on that is if I put a non-approved-HP RAID card in, HP won’t warranty the system. I would have to pull the raid card out, and put the original SSD back in. I am going to look into this further. Maybe there is an HP one I can use.

As you mentioned, the use of non-HP SSDs could be related. One problem I am running into is HP has discontinued retail sale of their SSDs, and I am adding an SSD to the one that comes with the computer. They are still available through their repair parts channel, but they want almost $300 for a 512 GB plain old TLC SSD.

adrian_ych (adrian_ych) October 12, 2024, 7:07pm 13

There are always 3 sides to every coin…

Is your wotk or the work worth the $ ? Especially if it is a company and what is the opportunity cost of spending even $1K in worse case vs having all data lost or pure down time (need to get new SSD then re-install and/or recover files etc ? The company turnover is easily in tens or hundreds of millions…regardless making profit or loss ?

If HP warranty do not cover (I know Dell does not bother as they replace the Dell faulty parts even if 3rd party hardware is added, except for SAN), the contact the vendor and get HP parts ?

Else maybe consider using Veeam Agent for Windows “free” to perform Full backup (with the Veeam bootable DVD for Bare Metal Restore) and perform your own backups ? I would not risk using consumer SSDs at all in that sense ?

dwhipps (Dwhipps) October 14, 2024, 2:33pm 14

FWIW, I had done a quick search and I saw some refurbished HP PCIE RAID controllers (HPE P410) for around $50. Would that be feasible for your usage case? I saw some random sites selling them, but I also saw some through Walmart via their website (though probably just marketplace selling them through Walmart’s website). Think those might be viable for your usage case?

Rod-IT (Rod-IT) October 14, 2024, 7:28pm 15

Unfortunately not, that card support SAS and SATA, the OPs drives are NVMe.

Michael4000 (Michael4000) October 16, 2024, 7:45pm 16

UPDATE: We have not had any additional Samsung SSDs showing up as “Unknown hard disk” in Intel RST after I removed the RAID flag, cleared the partitions, and rebuilt the RAID. We did have one computer this week that came up with “Boot Device Not Found” after sitting over the weekend running, but the computer booted normally after a soft boot. The other computer that was giving us this problem now says "Unknown hard disk (0 bytes). I’ll have to go down there and clear this disk.

Michael4000 (Michael4000) October 21, 2024, 10:42pm 17

I have another update. I just built another HP Z2 G9 with RAID 1. I got the operating system and drivers installed. Everything is running normally.

I intentionally broke the RAID by taking one of the disks out. I put the disk in an external USB to NVMe adapter. I plugged it into my laptop, and ran the following commands:

  1. Open Command Prompt as Administrator.
  2. Diskpart
  3. List Disk
  4. Select Disk X, where X is the “unrecognized disk” from the computer.
  5. Clean
  6. List Partition (to confirm partitions erased)

I then put the SSD back in the computer. Started the BIOS Intel RST, and told it to rebuild the RAID with the cleaned disk. I started Windows, and opened the Intel RST program. I saw it was rebuilding the RAID, and it completed successfully.

At least now I can correct these problems on the Samsung disks out in the field without having the bring the SSD back to my office.

If anyone is interested, I used a Startech M2. SATA/NVMe SSD Enclosure part number M2-USB-C-NVME-SATA. Note that SSDs with heatsinks will not fit in the enclosure, but for my temporary purposes that is fine. I just left the tray out during my operation.

Michael4000 (Michael4000) February 9, 2025, 12:17am 18

Here is an update.

HP came out with a new BIOS and new Intel RST drivers on January 7. Unfortunately, it did not correct the problem. The issue is with the latest Samsung 990 Pro and WD SN850 SSDs. I have posted on the Intel forum multiple times, and the “Intel representatives” just disappear after a period of time after they ask questions, and you post huge amounts of information. They obviously don’t want to bother with this. This is awful of Intel. It really leaves a bad taste in my mouth after being a customer of theirs for decades . First their processors burn out, and now this. I can see why their stock is in the toilet.

I contacted HP Workstation Support. They gathered quite a bit of information, and will work on the issue. I am keeping my fingers crossed that they can come up with a solution.

spiceuser-8s80 (spiceuser-8s80) February 9, 2025, 4:23am 19

It seems like there may be a deeper issue with the RAID configuration or compatibility. Here are some potential causes and solutions:

  1. Intel RST Driver Issue: Since the problem started after driver updates, it’s possible the latest Intel RST driver is causing instability. Try rolling back to a previous driver version to see if it resolves the issue.
  2. RAID Configuration Corruption: RAID corruption can cause the “Boot Device not found” error. Consider rebuilding the RAID array or testing with a single drive to isolate whether the issue is with the RAID setup itself.
  3. Power Supply or Cable Issues: Ensure the power supply and cables are stable, as intermittent power could be causing the SSDs to lose connection.
  4. Check SSD Firmware: Ensure both WD and Samsung SSDs are running the latest firmware, as some RAID controllers have compatibility issues with specific firmware versions.
  5. Test with Different RAID Setup: If possible, test the system with a non-RAID configuration (e.g., AHCI) to determine whether the RAID 1 setup is causing the issue.

If the issue persists, it may be worth testing with a different RAID controller to rule out hardware issues with the Intel RST.

Michael4000 (Michael4000) February 11, 2025, 10:46pm 20

Thank you for your ideas here.

  1. I have used the current and previous Intel RST drivers, and there was no change.
  2. I have rebuilt the RAID mirroring. The next day or after a few days, the boot issue reappears.
  3. We have had this issue on 10 different HP Z2 G9 computers, and I believe the power is reasonably stable at all locations.
  4. I have checked the firmware on the WD and Samsung drives, and the issue persists even with the latest firmware.
  5. Unfortunately, I don’t have the opportunity to test the system in non-RAID configurations since these are production workstations and need the RAID for failover redundancy.

I am in the process of building some variations on the original builds by using the HP branded Windows rather than the Windows from the Microsoft site. I am building one today with HP branded 512 GB SSDs. The next one will be with WD Black SN850 512 GB SSDs.