CAM Status Medium Error, still/again...

BlueMagician · Dec 23, 2017

Dear all,

I've posted a thread regarding this before, but that was towards the start of the year, so figured I'd create a new one.

75% of the time, during (or just after) a scrub, I am seeing the following errors:

Code:

[root@freenas] ~# zpool status
  pool: Chamber1
 state: ONLINE
  scan: scrub in progress since Sat Dec 23 01:00:01 2017
		15.6T scanned out of 21.9T at 504M/s, 3h37m to go
		32K repaired, 71.29% done
config:

		NAME											STATE	 READ WRITE CKSUM
		Chamber1										ONLINE	   0	 0	 0
		  raidz2-0									  ONLINE	   0	 0	 0
			gptid/7c55507f-006a-11e5-9af6-001e67aa46b9  ONLINE	   0	 0	 0
			gptid/01b99fdb-bf4c-11e7-8ea3-001e67aa46b9  ONLINE	   0	 0	 0
			gptid/7dd4abd6-006a-11e5-9af6-001e67aa46b9  ONLINE	   0	 0	 0
			gptid/7e95631f-006a-11e5-9af6-001e67aa46b9  ONLINE	   0	 0	 0
			gptid/7f54f268-006a-11e5-9af6-001e67aa46b9  ONLINE	   0	 0	 0
			gptid/80137822-006a-11e5-9af6-001e67aa46b9  ONLINE	   0	 0	 0  (repairing)

errors: No known data errors

And via email or in DMESG I see:

Code:

(da4:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 99 37 26 d0 00 00 00 40 00 00 length 32768 SMID 772 terminated ioc 804b scsi 0 state 0 xfer 0
(da4:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 99 37 26 d0 00 00 00 40 00 00																								

(da4:mps0:0:5:0): CAM status: CCB request completed with an error
(da4:mps0:0:5:0): Retrying command
(da4:mps0:0:5:0): READ(16). CDB: 88 00 00 00 00 01 99 37 27 50 00 00 00 40 00 00																								

(da4:mps0:0:5:0): CAM status: SCSI Status Error
(da4:mps0:0:5:0): SCSI status: Check Condition
(da4:mps0:0:5:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
(da4:mps0:0:5:0): Info: 0x199372750
(da4:mps0:0:5:0): Error 5, Unretryable error

Over the months that this has been happening, the device number/partition ID described in the error changes occasionally - as does the amount of data repaired, ranging from 64k to 128k etc.

The error does not seem to be connected to a particular physical drive, OR particular HBA port, OR even a common PSU power branch.

At one point, I thought the error was being caused by a particular drive - it was the latest model revision and had seen another post with someone having very similar issues with that exact revision of drive. I replaced that drive on warranty. I got one good clean scrub, but on subsequent scrubs soon after, the error returned (on the new drive, on the same port).

So I replaced the SAS data cables, but the error continued.

So then I started to suspect the physical port was faulty, so changed my HBA port arrangement so that particular port was avoided. I got one clean scrub, thought the issue was resolved, but then it returned...

So I went hardcore -- replaced the HBA entirely with a new card (identical model) - this time flashed to the latest 20.x.07 IT-mode firmware (the previous was 20.x.04 IT mode).

Today, the first scrub to complete using this latest arrangement has thrown up the error again - but on a completely different port, relating to a completely different drive - and I'm starting to lose the plot.

With the lack of any supporting bad SMART data, am I just seeing random and infrequent drive-based URE's that are being dealt with by ZFS, that are 'nothing to worry about'? Or is there something strange going on here that I've completely overlooked..?

My system spec in in my signature.

Apologies for the wall of text - I hope it makes sense - this has been going on for so long (on and off) that I've only got random notes to work from to paint a picture...

Thanks for any thoughts and advice,
S.

Johnnie Black · Dec 23, 2017

What disks are you using? There appears to be some issues with some 10TB Seagates and one 6TB WD Red model.

BlueMagician · Dec 23, 2017

Johnnie Black said:
What disks are you using? There appears to be some issues with some 10TB Seagates and one 6TB WD Red model.

I am using 6TB WD Reds with the latest firmware.
I am aware of the potential issue with later revision WD Reds - in fact I contributed to that thread as I was seeing the exact issue as the OP.

However, I only had one drive that was of that particular revision. I replaced it on warranty.

For a time, the issue disappeared, but now it's back...

S.

Inabon · Dec 23, 2017

I am seeing same errors when moving large files or scrubs in my situation GUI and SSH for freenas hang. Jails continue to work normally and fast and ssh access to them with jexec is good until you exit to main instance more details here

https://forums.freenas.org/index.php?threads/freenas-11-1-gui-and-ssh-become-unresponsive.60043/

joeschmuck · Dec 23, 2017

@BlueMagician
Have you tried to connect all six hard drives into the motherboard SATA ports directly and see what happens? Right now I'm not sure why you are using an HBA since you are booting from USB.

BlueMagician · Dec 23, 2017

@joeschmuck: I bought the majority of the components for my system throughout the end of 2014 and into the first couple of months of 2015, including a couple of new PERC200's, _before_ I got the S1200BTLR at a massive EOL discount.

I realise my motherboard has 6 SATA ports that I could use, but chose to use one of the HBA's that I already had in stock, because it was a well supported and respected card...

For the first year or so, my system (starting at FreeNAS 9.2 with the HBA on firmware P16) ran with no issues/warnings/errors/hiccups. The only thing that's changed over time is a) age of components, and b) FreeNAS / firmware / driver updates.

Looking back on my previous posts, I'd say these issues began in Q2 2016, around the time I upgraded to v9.10. Thousands of other users have no issues at all, I doubt it's release/firmware/driver related...

S.

joeschmuck · Dec 23, 2017

The problem is, you need to eliminate the HBA as a possible problem. It could be the combination of the HBA and your motherboard and FreeBSD software changes. You cannot assume anything really and have to do your best to eliminate the issues.

Shutting down your machine, pulling the HBA card, and connecting the hard drives directly to the motherboard should be easy to do, you just need six SATA cables. I feel this is your best next step if you have not done this yet and then give it a few months and if you don't have any issues, then you can reasonably suspect it was the related to the combination of hardware.

The other thing you could do is to roll back to 9.2 if you are only using FreeNAS as a NAS and if you have not upgraded your pool so that you can go backwards. I doubt this is an option for you but just wanted to toss it out there.

BlueMagician · Dec 23, 2017

@joeschmuck : Thank you, I do appreciate your advice.

It's just my luck that I try to follow the rules of what should be a trouble-free build, buying new, known name brands etc - and end up getting issues when all I needed was a stable platform that needed little intervention once up and running.

It's also typical that, me being an OCD nerd, went out of my way to source two _indentical_ revisions of Dell HBA. On one hand, great to have a spare (or matched pair for adding a second vDEV). On the other hand, not great for trouble-shooting, because all I've probably done is swap one perfectly good card for a second perfectly good card.

I guess in hopeful ignorance, I'd not considered that there may be compatibility issues between a Dell/LSI card and Intel server-class motherboard.

I'm not sure I can bear to rip the server apart again over Christmas. The whole thing has me a little irked at the moment. Your suggestion of trying to use the ports direct on the mobo is a good one - but I need to source some SATA cables and some willpower first...

S.

joeschmuck · Dec 23, 2017

@BlueMagician
Don't be discouraged, you can try like crazy to buy the best parts out there and still have problems. Case in point is the iXsystems FreeNAS Mini. That is one crappy motherboard. iXsystems didn't think it would have all these problems but it happens.

I can't tell you if your Dell HBA is the problem but it's just another test in order to rule out a piece of hardware. Honestly I'd like it to be the problem just so the problem is gone.

the good thing about taking the server apart, you only need to pull the HBA out and install six SATA cables. Make sure you buy cables long enough to reach from the ports to the drives. Don't buy cables that are super long if you don't need them that long.

Also, you shoudl stop seeing SCSI related errors since you are removing the HBA which looks like a SCSI device. If you get errors they will likely still be CAM errors. Cross your fingers you have no errors at all.

BlueMagician · Dec 23, 2017

joeschmuck said:
@BlueMagician
Don't be discouraged, you can try like crazy to buy the best parts out there and still have problems.

Thank you again for your encouragement. I must confess that I'm almost tempted to ignore it, but I won't.

I've just done a bit of Googling, and it seems that circa 2011-2014, Intel themselves released a few Add-On RAID cards based on the SAS2008 chipset as used in the PERC200, and approved for use with the S1200 series boards.

This adds weight to the 'it should be fine' argument. It also adds weight to the 'it was fine' argument.

Of course, with no BIOS update for the mobo since 2014, one does wonder if things will start to break as drivers move on but the hardware does not...

In my searching, I can find hardly anyone with these exact symptoms - let alone these symptoms whilst using this HBA.

The closest I came was the thread by @tobiasbp where we thought we'd narrowed it down to a problem with the exact revision of WD-Red drive. Alas, I replaced that drive with a new one of a different revision, but it didn't actually help...

S.

Stux · Dec 23, 2017

The SAS2008 cards begin to generate errors if they over heat.

It has been known for the thermal compound on some Percs to deteriorate over time, leading to over heating and then errors.

I remember @wblock posting about this (iirc)

(Edit: apparently I did recall correctly: https://forums.freenas.org/index.php?threads/dell-h200-versus-h310-heat.45822/)

Anyway, if you can confirm the HBA is a problem, via testing the SATA ports, then you might consider refreshing the heat sink thermal compound, or increasing its cooling.

joeschmuck · Dec 23, 2017

BlueMagician said:
Of course, with no BIOS update for the mobo since 2014, one does wonder if things will start to break as drivers move on but the hardware does not...

We are all in that same boat however FreeBSD does maintain quite a bit of backward compatability which is good for us. I believe that I will be able to get 15 years out of my curent machine but when I purchased it I was looking at 10 years. The hard drives are the real killer here to our wallets as I'd rather buy a new motherboard every 5 years than new hard drives.

One last thing... I see a lot of people here in the forums creating these massive NAS's holding 20TB or more of storage for video content (movies and TV shows). I say "Really?" That is an enourmious amount of money for bragging rights to have the largest video library. Sorry, I'm on my soap box right now... I use my NAS primaraly to backup my data files (over 50% of my capacity), photos, music (not stream the music), and I have about 300GB of videos to stream. I can easily live with a 5TB pool if I were to streamline my backups a bit more. Basically I just don't understand why someone would spend a ton of money to build a giant video library. You can only watch a video so many times before you never watch it again and streaming services like Netflix and Hulu cost a fraction of those hard drives. Okay, I'm stepping off my soap box now. Sorry about that. I guess if someone has the money to burn, they will burn it.

BTW, I was not talking about you, it was a generalization is all.

So will your motherboard be supported for a few more years? I say yes. And maybe the problem is a heat issue.

BlueMagician · Dec 23, 2017

Thank you very much for your replies and thoughts, @joeschmuck @Stux.

I had not immediately considered heat of the HBA - it's a very valid point, but I would like to think that's it's not the issue for two reasons:

1) If it were deterioration of the thermal compound causing errors over time, then why would a completely new and identical HBA (fresh out the box) exhibit the same errors within a few days of going into the system, the very first time it was scrubbed?

2) I live in the UK, and the ambient room temperature is currently 22 degrees. My Antec1200 case has 7 x 120mm Noctua fans running at ~7.5v; 4x 'front' intake, 2x 'rear' exhaust, and 1x 'side' exhaust pulling directly out over the HBA slot. I clean the case filters every 6 months, and my HDD temps hardly ever see above 30 deg...

So whilst I appreciate my enclosure isn't quite as chilly as having the server in an air-conditioned comms room, housed in 19"x3U rack full of screaming high-CFM Delta's blasting across the nuts of the little passively sinked H200 - it's not exactly subject to Mediterranean temperatures either...

Genuinely, I appreciate the input - it's all food for thought.

I still can't, though, shake the feeling that this build was solid and stable for the first 12-16 months of its life - when indeed all components were younger, but also software/firmware was too!

Either something outside my control has changed, or I'm still missing something.

@joeschmuck: I will do my best to try the non-HBA approach in the new year - but if that does solve it, leaves me in a bit of a quandry...

...if the onboard SATA ports work fine, then it doesn't answer _why_ the HBA was wonky for me, and more importantly, leaves me screwed if I want to add more drives to the build in future.

Regards and thanks again to all,

Simon.

joeschmuck · Dec 24, 2017

BlueMagician said:
...if the onboard SATA ports work fine, then it doesn't answer _why_ the HBA was wonky for me, and more importantly, leaves me screwed if I want to add more drives to the build in future.

This is just an opinion as I have no facts to back me up (I'm too lazy to go looking for facts right now) but I feel the Dell PERC Raid cards are a custom design and I don't mean just in firmware but also hardware. How many times have you read that the PERC H-310 needs to have one pin on the PCIE connector taped over to mask a signal? This is just me but I will not buy a Dell HBA, I don't care how good someone tells me it is.

If you do need to expand your pool in the future then that issue can be solved at that point in time. You have several options, anywhere from buying a small add-on SATA port card (like the one I use) to replacing the motherboard with more SATA ports.

BlueMagician · Dec 24, 2017

@joeschmuck: Thank you yet again.

Two questions if I may, please:

1) If I were to get another new (or known good used) HBA, what would you recommend? Another SAS2008 chip card like an original LSI-9211, or something else/newer?

2) If I decide to try the ports on the Intel board itself, but don't know if they work or are recognised by FreeNAS...

a) what CLI commands can I run to see if FreeNAS already sees those ports, hopefully currently active but empty?

b) if I move all my drives over to the Intel board but for some stupid reason only 3 out of the 6 are recognised or work - what will happen to my pool?

Is it too hopeful to assume that upon boot, if two or more discs are missing in action due to non-operating ports, then the worst that will happen is that the pool won't mount?

Or will FreeNAS chuck a mental, try it's best, fault the pool and force me to rebuild and resilver back on the old HBA.

Just trying to ascertain how much wriggle room I've got for testing thongs out without risking my data.

Yes, all my critical stuff is backed up, but if I lose some of my other media then it's a LOT of hours to get it back...

Kind regards,
S.

joeschmuck · Dec 24, 2017

BlueMagician said:
1) If I were to get another new (or known good used) HBA, what would you recommend? Another SAS2008 chip card like an original LSI-9211, or something else/newer?

Someone else would have to recommend that, I'm not a HBA expert.

BlueMagician said:
2) If I decide to try the ports on the Intel board itself, but don't know if they work or are recognised by FreeNAS...

a) what CLI commands can I run to see if FreeNAS already sees those ports, hopefully currently active but empty?

FreeNAS will recognize the SATA port that are in use. I don't see why it wouldn't work fine.

BlueMagician said:
b) if I move all my drives over to the Intel board but for some stupid reason only 3 out of the 6 are recognised or work - what will happen to my pool?

Your pool will not import during the boot process.

BlueMagician said:
Is it too hopeful to assume that upon boot, if two or more discs are missing in action due to non-operating ports, then the worst that will happen is that the pool won't mount?

Yes, the pool will not mount.

BlueMagician said:
Or will FreeNAS chuck a mental, try it's best, fault the pool and force me to rebuild and resilver back on the old HBA.

Nope.

You should have nothing to loose by moving the drives to the SATA ports. After all FreeNAS was born to be very portable. If your motherboard fried then you could slap your hard drives into another machine and boot right on up, by design.

BlueMagician said:
Yes, all my critical stuff is backed up

Very important all the time.

BlueMagician · Jan 2, 2018

Just a quick update on this...

I removed the PERC-H200 HBA, purchased 6 x good quality 50cm SATA cables, and did a big swap-out - so my 6 x WD60EFRX's are now attached directly to my S1200BTLR motherboard SATA ports.

With several TB's written and read by moving data from one dataset to another, and also doing full image backups of the local home workstations to a separate dataset - and so far no errors.

I've also forced a full scrub, and again, no correction needed or errors found.

4 out of the 6 ports on the mobo are only SATA-2, and the Scrub took about 18 hours instead of its usual 13 hours - so there's definitely a performance penalty. However, there's still enough throughput to saturate my 1 x GbE NIC, so in real-terms it doesn't make much difference to me.

So to summarise, something was causing my crossflashed H200 card to throw ocassional CAM SCSI STATUS errors.

It wasn't any particular individual drive or drive type, or HBA port, or cable, or PCI slot, or the HBA itself (swapped out for second one).

The Intel S1200BTL motherboard is running the latest BIOS available, and I'm fairly certain that the issue only began after moving from firmware P16 to P20 along with a FreeNAS update.

So one has to conclude that there is something funky going on between the Dell card, this mobo, and the latest LSI/Avago firmware in IT mode.

I find it hard to believe that the motherboard itself has an issue with the SAS2008 chipset because Intel themselves produce(d) a riser card for the mobo in 2012 which was SAS2008 based.

So here's something not quite right with my setup and these revisions of H200 specifically.

Unless there's a magical BIOS setting I've missed, when the time comes to add a second vDEV of 6 more discs, I guess I'll either be looking to buy either a different mobo or a pair of new HBA's that play nicely together...

Now, for sale: 2 x Dell..... lol.

Thanks everyone for your help and advice. A fairly long annoying journey, which ultimatey has taken years to ignore/rectify.. but glad I seem to be there now..

Happy New Year!
S.

joeschmuck · Jan 2, 2018

Glad things are working once again. If you need to purchase a new HBA, ensure you go with a well known name brand and watch out for fake electronics. If the deal is too good to be true, then it is.

BlueMagician · Jan 2, 2018

joeschmuck said:
If you need to purchase a new HBA, ensure you go with a well known name brand and watch out for fake electronics.

Indeed. I have already seen the plethora of LSI clones on eBay, shipping from China!

I thought I was onto a safe bet, being paranoid enough to go with the Dell cards - they are genuine, and both came in their sealed Dell cardboard boxes and antistatic baggies - complete with their addendum leaflet thingies and port bungs.

Perhaps I shall aim for actual LSI or perhaps IBM next time.. but I can't afford another 6 Reds yet anyway, so not to worry!

Kindest regards,
S.

BlueMagician · Jan 2, 2018

Argh. Guessing I cracked open the champagne too soon...

After exporting a backup set to an external drive (which basically entailed 500GB of data being read from the NAS), I noticed the following error appended to the end of the DMESG output:

Code:

(ada4:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 e4 1f 40 90 02 00 01 00 00
(ada4:ahcich4:0:0:0): CAM status: ATA Status Error
(ada4:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 40 (UNC )
(ada4:ahcich4:0:0:0): RES: 41 40 00 e4 1f 00 90 02 00 00 00
(ada4:ahcich4:0:0:0): Retrying command

The SMART counters for this drive look fine, and logs show it's never failed a short or long test.

Is this a completely new problem? Or just the same error in different clothing, due to the drives now being accessed with a different driver?

Thanks again...
S.

Important Announcement for the TrueNAS Community.

CAM Status Medium Error, still/again...

Explorer

Guru

Explorer

Dabbler

Old Man

Explorer

Old Man

Explorer

Old Man

Explorer

MVP

Old Man

Explorer

Old Man

Explorer

Old Man

Explorer

Old Man

Explorer

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "CAM Status Medium Error, still/again..."

Similar threads