Mysterious reboot and then BAM! Permanent errors and degraded volume

Stranded Camel · Jun 30, 2018

My FreeNAS box had been running FreeNAS-11.1-U2 perfectly fine since it came out, and had been running continuously for about three months. Then, a couple hours ago, BAM! -- it rebooted for no reason I can figure. (I confirmed that the reboot was real via the uptime, which was in the minutes after this.)

The box is connected to a large UPS, and there was no power-related event in any case (unless it was something that does't affect the lights, etc. -- but with the UPS, it wouldn't matter anyway.) Furthermore, my power supply is rated for a lot more juice than I pull from it. In short, I doubt this was an electrical issue.

Anyway, while the cause of this reboot is something I'd like to eventually figure out and prevent in the future, the issue I'm dealing with right now are the errors that my main pool (5 x 10 TB WD Gold in Z2) experienced as a result of whatever this event was.

Once I unlocked the pool (it's encrypted), I got a warning that a permanent error had been found in the following file: tank/multimedia:<0x2ab15>. From researching this, it seems that the hex address means the error is in metadata.

The FreeNAS system mails me nightly SMART reports, and everything has been in good order. Drive temps are between 27ºC and 32ºC, which is where they always are. All five drives had 0 errors of any type. All had perfectly fine SMART results. They've got about 2800 power-on hours. The latest scrub, which was a week ago, showed 0 errors, as always. In short, the drives were in perfect working order right up until this incident happened.

So I started a scrub. I'm using ZFS for precisely this reason, so I should have no worries! After all, it's designed to prevent errors with features like COW, and it's designed to allow errors to be corrected with all those checksums and such. And I'm using Z2, which gives me a healthy margin for errors.

Well... as the scrub progresses, more and more errors are being detected. Here is where I stand right now:

Code:

$ zpool status -vx tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
		corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
		entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Jun 30 07:27:34 2018
		8.03T scanned at 1.00G/s, 6.04T issued at 771M/s, 25.8T total
		5.33M repaired, 23.38% done, 0 days 07:28:27 to go
config:

		NAME												STATE	 READ WRITE CKSUM
		tank												DEGRADED	 0	 0	 0
		  raidz2-0										  DEGRADED	 0	 0	 0
			gptid/re-da-ct-e-d.eli  DEGRADED	 0	 0   126  too many errors  (repairing)
			gptid/re-da-ct-e-d.eli  DEGRADED	 0	 0	98  too many errors  (repairing)
			gptid/re-da-ct-e-d.eli  DEGRADED	 0	 0   103  too many errors  (repairing)
			gptid/re-da-ct-e-d.eli  DEGRADED	 0	 0	68  too many errors  (repairing)
			gptid/re-da-ct-e-d.eli  DEGRADED	 0	 0	95  too many errors  (repairing)

errors: Permanent errors have been detected in the following files:

		tank/multimedia:<0x2ab15>

Since this is an excruciatingly slow process, can anyone tell me (1) what to expect, and (2) what I should do if the scrub can't fix all the errors?

Does "5.33M repaired" mean 5.33 MB of data has been repaired, or 5.33 million errors have been repaired? ("M" is a really bad abbreviation in this context!)

Are these CKSUM errors unrecoverably high, or is this par for the course?

Any other suggestions are quite welcome!

(PS: I would love to use ECC RAM, but I live in South America and no retailer in my country sells it or ECC-capable mobos. And B2B sellers won't give me the time of day.)

Jailer · Jun 30, 2018

I hope you have a good backup.......

kdragon75 · Jun 30, 2018

Yeah all drives at once (if that the case) sounds more like a bad disk controller that hates you. Are all of your disks connected to the PEX10-SAT SATA III PCIe host adapter?

Stranded Camel · Jun 30, 2018

kdragon75 said:
Yeah all drives at once (if that the case) sounds more like a bad disk controller that hates you. Are all of your disks connected to the PEX10-SAT SATA III PCIe host adapter?

Actually, no! I use that card for my secondary zpool. All five of my damaged zpool's drives are connected directly to the mobo.

Perhaps I should add that I was doing some very intensive I/O on the main zpool when this happened -- I was streaming from Plex, downloading Linux ISOs from Usenet, downloading ReactOS ISOs from eMule, torrenting Plan9 ISOs, editing files, and unzipping a bunch of other files. I don't know if this caused the reboot (if so, something is very wrong), but my hope is that most or all of the corruption is in the temp files of everything I was downloading when the reboot happened.

Stranded Camel · Jun 30, 2018

By the way, does anyone know what the "too many errors (repairing)" message means? "Too many" would seem to indicate that something is beyond repair... but then it says "repairing"! What makes a number of CKSUM errors be "too many" if they can be repaired?

kdragon75 · Jun 30, 2018

Well considering you only have 8GB of RAM, this could have been a bit much and caused something to internally go *BANG* if/when it was denied memory (like the disk controller driver).

Stranded Camel · Jun 30, 2018

kdragon75 said:
Well considering you only have 8GB of RAM, this could have been a bit much and caused something to internally go *BANG* if/when it was denied memory (like the disk controller driver).

I've actually got 32 GB of RAM.

kdragon75 · Jun 30, 2018

Stranded Camel said:
I've actually got 32 GB of RAM.

Whoops! I missed the 4x!
I would still bet the controller freaked out. You could export your pool and drop some spare disk on there (if you can) and try to reproduce the load or just run some heavy benchmarks while monitoring the temps on the main IC on the controller.

Stranded Camel · Jun 30, 2018

So besides running scrubs, as well as SMART tests and Memtest, what do you do in this situation?

Stux · Jun 30, 2018

You wait and see if ZFS can repair it, or just start restoring from backup.

About the only thing that can explain it is memory corruption.

Stranded Camel · Jun 30, 2018

Stux said:
You wait and see if ZFS can repair it, or just start restoring from backup.

Does that mean run more scrubs?

Does it make sense to do zpool clear tank after a scrub and scrub again?

Also, when you say "restore from backup", what exactly do I restore, since the error message says that the permanent error is in tank/multimedia:<0x2ab15>, which is apparently metadata located in one of my datasets.

Could I just create a parallel dataset, move the files from the damaged one over to it, and then eliminate the damaged dataset?

Stranded Camel · Jun 30, 2018

An update and a question.

UPDATE: I've finished the first scrub and ended up with what seem like a lot of errors to me. I then ran a short SMART test of all drives, and they came out perfect to a unit. I'm now running a new scrub, and the amount of repaired data and number of CKSUM errors continues to climb slowly.

QUESTION: In addition to the protection that ZFS's built in error protection, checksumming, etc. should have been providing, and in addition to the protection that RAID Z6 should have given, I've got tons of snapshots. Is there no way to use them to recover damaged files automatically? As in, have FreeNAS go through the zpool, find corrupt files, and if fixing them by using redundancy, checksums, RAID and other data fails, restore the last known good copy of the file?

rs225 · Jun 30, 2018

You should really be running a memtest.

Snapshots can help with corrupted data, if they preserved a different version of the metadata. Snapshots are cheap because they only retain what has changed. If nothing changed, there is no other copy.

Stranded Camel · Jun 30, 2018

rs225 said:
You should really be running a memtest.

Snapshots can help with corrupted data, if they preserved a different version of the metadata. Snapshots are cheap because they only retain what has changed. If nothing changed, there is no other copy.

Memtest is my next step. Thanks.

kdragon75 · Jul 1, 2018

Stux said:
About the only thing that can explain it is memory corruption.

I'm not disagreeing but would like more information on your conclusion. I would think the disk controller would be more likely as only the disks connected to the motherboard were affected.

Stux · Jul 1, 2018

Memory corruption explains the reboot. It also explains massive spreading corruption if a checkpoint got corrupted before writing to disk.

Either way, it’s a serious hardware failure which has caused your issue.

You should consider stopping the scrubs and pulling the disks. Before too much more happens.

It’s probably the memory though.

Stranded Camel · Jul 3, 2018

Stux said:
Memory corruption explains the reboot. It also explains massive spreading corruption if a checkpoint got corrupted before writing to disk.

Either way, it’s a serious hardware failure which has caused your issue.

You should consider stopping the scrubs and pulling the disks. Before too much more happens.

It’s probably the memory though.

Stux wins the prize this round!

Turns out one of my 4 sticks of RAM was just fried to hell. Not literally, but when I ran Memtest it started erroring out immediately, and the test stopped itself after about two minutes, alleging "Too many errors!".

I really wonder how that happened (especially without anything else being defective).

Stux · Jul 4, 2018

So, did you lose the data?

If so, you just proved the “scrub of death”

Stranded Camel · Jul 4, 2018

Well, the data on my NAS falls into two categories: (1) irreplaceable and (2) stuff I'd rather not lose, and may not have another copy of, but whose loss would not result in tears.

All of (1) is on my main computer, my NAS (backed up manually), my lab computer (backed up automatically with Dropbox sync), three 8 GB external drives (backed up manually and kept in two different parts of town), plus Dropbox and Google Drive (using rclone and encryption). So I didn't even bother to check what state that stuff was in -- I just deleted the relevant datasets from FreeNAS.

(2) is made up mostly of movies and TV shows. Some are old and in foreign languages and have been very hard to collect. So I started backing them up when I saw my FreeNAS failing. I got most of them off, but as time went on, more and more errors popped up. What worries me is that even when errors did occur, and FreeNAS warned me about them, it let rsync copy the files off of the NAS -- I would have expected these errors (e.g. "no data found for foo.bar.mkv") would have prevented the files from being copied.

Where I got well and truly fucked was with my VM (Ubuntu on bhyve) and my iocage jails. I left them for last, for some dumb reason, and by the time I was able to try and back them up, FreeNAS would reboot everytime I tried to read from them. And then it got to the point where even trying to decrypt my zpool would cause FreeNAS to reboot.

So now I'm left trying to recreate some very complicated setups, with permissions issues it took days to resolve the first time around. Easily 15-20 hours of work.

On the good side, this has given me a change to reorganize some stuff that was in odd places for historical reasons. And it's also given me the opportunity to install an SSD for VMs, jails and temp download files from things like Usenet. If you use any of these things more than casually, it's highly recommended -- the performance increase is amazing.

Questions for the masses:

We've got SMART to detect hard drive failures. Is there no equivalent for RAM? (And ECC is not available to me, I'm afraid).
Does anyone know of any program that can scan video files and detect corrupt ones?

rs225 · Jul 4, 2018

ECC is the answer. There may also be a special tunable you can turn on that causes ZFS to do extra checking in memory, but it can't provide guarantees.

Did you run a scrub after you removed the bad RAM, and still have errors? If so, then you know the pool was really damaged. If not, it is possible you had no stored damage and the errors were happening in real-time only.

As for the scrub of death, it can't exist. The worst is what may have happened here: corruption in RAM of the metadata that becomes a permanent part of the pool.

Important Announcement for the TrueNAS Community.

Mysterious reboot and then BAM! Permanent errors and degraded volume

Explorer

Not strong, but bad

Wizard

Explorer

Explorer

Wizard

Explorer

Wizard

Explorer

MVP

Explorer

Explorer

Guru

Explorer

Wizard

MVP

Explorer

MVP

Explorer

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Mysterious reboot and then BAM! Permanent errors and degraded volume"

Similar threads