Confusing CRC errors...

Status
Not open for further replies.

Plato

Contributor
Joined
Mar 24, 2016
Messages
101
Hi,

I have a weird problem which I don't understand how it occurs..

My mainboard is Asrock C2750D4I with a LSI SAS2008 card which is detected as SAS9211-8i. Also I have an Intel Expander connected to this, and currently 20 disks are connected to the expander. I'm currently using 12 of them in an active RAIDZ2 pool with 8 of them not assigned to any pool yet.

The disks I'm using in the pool are all new WD Red 6 TB disks.

I'm using the pool for mainly media storage ( for TV shows/movies ).. I use sabnzbd or deluge for downloading and sickrage and plex for cataloging.

One day, deluge said there was an error in a torrent. Which was around 500 GBs ( one torrent file, mind you ). I am not sure but it was around 80%-90% before error occurred. When I rehashed the torrent it said it's now 0,04%.. Of course I'm confused because as far as I know it was about 200 files. When I checked the directory I found only three files. All the other files were gone.

Another day, sabnzbd said it cannot write a parity file on the disk, which I don't understand because it could write another parity file just fine into same directory. When I checked, the main cause for this error is generally because the path + filename is too long for the filesystem.. That's why I checked another parity file.

When all of these happening there were no errors logged.

But, after this failures I scrubbed the pool. Here is the scan result:

scan: scrub repaired 508K in 59h33m with 0 errors on Mon Nov 6 11:46:40 2017

As you can see there were some problems but they're fixed. I also saw some CAM errors and also saw three disks as repairing while scrub was running. But as a result there are no read/write or checksum errors noted in status because most probably parity information was enough to fix those errors. I should mention BTW, I'm using 32 GB of ECC RAM, so I don't think this is caused by bad RAM.

What do you think? What may be the cause of the errors? How could an error like this happens?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Plato

Contributor
Joined
Mar 24, 2016
Messages
101
I already updated my bios and ipmi when it came up. So, it's not that.

The thing I don't understand is the parity should fix this problem in the runtime, right? Also how could ~200 files lost suddenly without any trace? I can only guess that it corrupted file information data and they disappeared like that.. But it's still unbelievable. I should have seen something in the log I assume.
 

Plato

Contributor
Joined
Mar 24, 2016
Messages
101
I should also inform that I used this same board with onboard SATA slots ( filling 11/12 of them ) previously without any problem for 2 years.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I already updated my bios and ipmi when it came up. So, it's not that.
From what I read it was a flaw in the CPU and had nothing to do with the BIOS.
I should also inform that I used this same board with onboard SATA slots ( filling 11/12 of them ) previously without any problem for 2 years.
How long you previously used something has nothing to do with when it will fail.
 

Plato

Contributor
Joined
Mar 24, 2016
Messages
101
There was two updates released last year. One for BIOS other for IPMI. I updated both of them.

What I meant when saying I used that previously was I used the same board with "onboard SATA" previously. Currently I'm using onboard SATA for two SSD disks mirrored which I use for jail only. I'm using LSI and Intel SAS cards to connect HDDs now. It may be about that. But as I said even then it should log the error when something happens.

While yes, I did see mainboard, RAM or CPU fail but most of those was about electrical problems caused by like bad PSU, shortcircuit or sudden power failure etc. I'm using my NAS with a 2KVA on-line UPS and also monitor it with FreeNAS.. Also the place I live has a generator which kicks up after about 10-20 seconds when a power failure occurs. So I didn't had any problems with power.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,909
You might want to read this: https://www.servethehome.com/intel-atom-c2000-series-bug-quiet/
https://www.servethehome.com/intel-atom-c2000-series-bug-quiet/
Also look at this thread: https://forums.freenas.org/index.ph...2550d4i-or-c2750d4i-lasted.45445/#post-307647
https://forums.freenas.org/index.ph...2550d4i-or-c2750d4i-lasted.45445/#post-307647
Several of us here have had these boards replaced under warranty by ASrock. I cannot tell from your description if your failure could be related.
 
Status
Not open for further replies.
Top