Replaced Disk, new suspicious behavior

Status
Not open for further replies.

Saladman

Cadet
Joined
Jun 23, 2018
Messages
3
Greetings Everyone,

I've been running a FreeNAS for about 3 years, but I'm by no means an expert, and now I'm suspicious of what I've set up, and if I'm cruising for a bigger failure. Here's the current state:

System Information

Hostname switzerland.local Edit
Build FreeNAS-9.3-STABLE-201601181840
Platform Intel(R) Atom(TM) CPU C2750 @ 2.40GHz
Memory 16331MB
System Time Sat Jun 23 08:38:19 PDT 2018
Uptime 8:38AM up 50 mins, 0 users
Load Average 0.01, 0.04, 0.06

Here's the Zpool Status:

pool: Switzerland_Vol1
state: ONLINE
scan: resilvered 108K in 0h0m with 0 errors on Sat Jun 23 07:50:05 2018
config:

NAME STATE READ WRITE CKSUM
Switzerland_Vol1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/3fb233a3-0750-11e5-acb8-d05099788ce1 ONLINE 0 0 0
gptid/4023f47c-0750-11e5-acb8-d05099788ce1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/cf1eaa36-7503-11e8-a300-d05099788ce1 ONLINE 0 0 0
gptid/40f6b22a-0750-11e5-acb8-d05099788ce1 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/426dd794-db07-11e7-b697-d05099788ce1 ONLINE 0 0 0
gptid/4376d577-db07-11e7-b697-d05099788ce1 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
gptid/5b15727a-db07-11e7-b697-d05099788ce1 ONLINE 0 0 0
gptid/5c3bd7f5-db07-11e7-b697-d05099788ce1 ONLINE 0 0 0

errors: No known data errors

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0h5m with 0 errors on Sat Jun 16 03:51:01 2018
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
da0p2 ONLINE 0 0 0

errors: No known data errors

1 month ago one of the 4TB drives failed, and I offlined it, and replaced it with an HGST 8TB. Resilvered and I thought I was off to the races. But this morning I received these emails, and the system was DOWN.


switzerland.local kernel log messages:
> ahcich3: Timeout on slot 17 port 0
> ahcich3: is 00000000 cs 001e0000 ss 001e0000 rs 001e0000 tfd 8451 serr 00000000 cmd 10009117
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 60 37 7a 40 1b 01 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 14 port 0
> ahcich3: is 00000000 cs 0003c000 ss 0003c000 rs 0003c000 tfd 8451 serr 00000000 cmd 10008e17
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f0 9e cf 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 19 port 0
> ahcich3: is 00000000 cs 00780000 ss 00780000 rs 00780000 tfd 8451 serr 00000000 cmd 10009317
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 60 f9 d1 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 18 port 0
> ahcich3: is 00000000 cs 003c0000 ss 003c0000 rs 003c0000 tfd 451 serr 00000000 cmd 10009217
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 38 d1 d8 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 78 8b de 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a0 1a e1 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 28 port 0
> ahcich3: is 00000000 cs f0000000 ss f0000000 rs f0000000 tfd 8451 serr 00000000 cmd 10009c17
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 08 34 e9 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 48 43 ed 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 48 44 ed 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a0 84 ef 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 1 port 0
> ahcich3: is 00000000 cs 0000003e ss 0000003e rs 0000003e tfd 8451 serr 00000000 cmd 10008117
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c8 10 f3 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 b8 78 f6 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 b0 41 f7 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 34 fc 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 35 fc 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 2 port 0
> ahcich3: is 00000000 cs 0000003c ss 0000003c rs 0000003c tfd 8451 serr 00000000 cmd 10008217
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 c8 be fd 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 50 80 00 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 50 81 00 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 00 57 02 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 14 port 0
> ahcich3: is 00000000 cs 0007c000 ss 0007c000 rs 0007c000 tfd 8451 serr 00000000 cmd 10008e17
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 b8 ce 1a 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 48 fb 19 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 18 a3 1a 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 18 a4 1a 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 5 port 0
> ahcich3: is 00000000 cs 000001e0 ss 000001e0 rs 000001e0 tfd 451 serr 00000000 cmd 10008517
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 88 b5 4b 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 17 port 0
> ahcich3: is 00000000 cs 001e0000 ss 001e0000 rs 001e0000 tfd 8451 serr 00000000 cmd 10009117
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 40 e2 57 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 70 1b 5d 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 18 port 0
> ahcich3: is 00000000 cs 003c0000 ss 003c0000 rs 003c0000 tfd 451 serr 00000000 cmd 10009217
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 70 21 5d 40 55 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 4 port 0
> ahcich3: is 00000000 cs 000000f0 ss 000000f0 rs 000000f0 tfd 8451 serr 00000000 cmd 10008417
> (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 60 10 9c 40 4a 00 00 00 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f0 77 3d 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 b0 5f c5 40 52 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 29 port 0
> ahcich3: is 00000000 cs e0000001 ss e0000001 rs e0000001 tfd 8451 serr 00000000 cmd 10009d17
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 b0 a1 c5 40 52 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 b8 d2 3e 40 53 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 b8 d3 3e 40 53 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 68 7f 85 40 53 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 68 93 85 40 53 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 28 port 0
> ahcich3: is 00000000 cs f0000000 ss f0000000 rs f0000000 tfd 8451 serr 00000000 cmd 10009c17
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a8 12 f5 40 52 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 5 port 0
> ahcich3: is 00000000 cs 000001e0 ss 000001e0 rs 000001e0 tfd 8451 serr 00000000 cmd 10008517
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 40 31 de 40 52 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 9f 55 40 53 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 5b 56 40 53 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 5c 56 40 53 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 28 port 0
> ahcich3: is 00000000 cs f0000000 ss f0000000 rs f0000000 tfd 8451 serr 00000000 cmd 10009c17
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 f8 45 45 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command
> ahcich3: Timeout on slot 5 port 0
> ahcich3: is 00000000 cs 000001e0 ss 000001e0 rs 000001e0 tfd 8451 serr 00000000 cmd 10008517
> (ada3:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 08 53 45 40 54 00 00 01 00 00
> (ada3:ahcich3:0:0:0): CAM status: Command timeout
> (ada3:ahcich3:0:0:0): Retrying command

-- End of security output --


Followed by this email:


The volume Switzerland_Vol1 (ZFS) state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.

And then this:

The volume Switzerland_Vol1 (ZFS) state is UNAVAIL: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

And then by this:

The volume Switzerland_Vol1 (ZFS) state is UNAVAIL: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning.

And then this:

The volume Switzerland_Vol1 (ZFS) state is UNAVAIL: One or more devices are faulted in response to IO failures.

And finally a few minutes later:

Device: /dev/ada5, unable to open device
The volume Switzerland_Vol1 (ZFS) state is UNAVAIL: One or more devices are faulted in response to IO failures.
Device: /dev/ada3, unable to open device
Device: /dev/ada2, unable to open device
Device: /dev/ada4, unable to open device

When I attempted to log in, the web service was non-responsive, so I took a deep breath. And it rebooted.

It all appears to be back on line now. But clearly I am extremely anxious and suspicious. Am I misssing clues of an impending failure? Have a misconfigured the way my pool is operating and made all the mistakes you warn us newbies about?

Thanks for all the help.
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
Hi,

From my experience with this board, I wouldn't be surprised if the drive in question (ada3) throwing the error is connected to one of the Marvell SATA connector on the board....


upload_2018-6-23_19-13-43.png



I have, from time to time, those kind of errors on my system, I just reboot and everything is fine.
The Marvell controllers are not very stable, so it's annoying (but in my case it's my backup and test system so not too annoying).

In my case, I only have one pool with 8 drives (in RAIDz2) and when this issue happens, I get a degraded state of the pool and after a reboot, everything is back online again (in your case, with several mirrors, I don't know if the behavior is different?? I'm surprised by the last message saying it is unable to open the devices ada2 to 5. Well unless those 4 drives are on the Marvell controller...).

The web GUI is not responsive and neither is the terminal (I can't ssh on the server). The only way to get back control is a reset of the system.

I usually run a scrub after the reset, just to be sure all the data is fine.


I would say: check which drives are connected to the Marvell controller and monitor the system. You will have this failure reoccurring with other (or the same) drives hooked on the Marvell controller, that's very sure.
Ultimately... well... don't use the Marvell controller! :-O
 
Last edited:

Saladman

Cadet
Joined
Jun 23, 2018
Messages
3
Hmm.. Thats interesting. Since I'm running 8 drives I'm definitely on the Marvel controller for at least part of the system, but I can't say I ever remember this happening before. (not saying it didn't, maybe it was just the first time I was smart enough to catch it and understand it?)

Since I was considering to start a new NAS, and downgrade this one to the back up anyway, what is your preferred architecture if I start over new?

Or do you prefer an above board SATA controller over those marvel wonders?
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
That's surprising indeed... How long has your system been running (I suppose it's up 24/7)?

Those CAM status errors can come from bad cables and also from the disk itself, that might be something to consider as well. Since in your case you're using the C2750, I know that an other cause is those Marvell controllers (and that's most probably it).
Just to be sure: the SMART tests are ok on your drives? And you are running them regularly I suppose? Otherwise check it out quickly.

My advise (*) would be.... to wait and see! ;-)
If this never happened to you before then wait and see when will you get the next occurrence of this issue. And I would check on which disk to confirm that it is one connected on the marvell controller (by the way, is ada3 connected on a marvell controller?).
This would also confirm that the issue is caused by the marvell controller (and not by a random cable problem for example).

And the first thing I would do: be sure that 6 drives are on the intel controller ports (connector 27 to 32 in the C2750 manual (see picture in my last post)) and only 2 on the marvell. What I did on my system: I plugged one drive on the 9172 controller and one on the 9230 controller (yes, there are two different controllers and both marvell!!).
So when the issue occurs on my system, I can check which drive causes the problem and therefore which controller (to see if one of the controller is causing more problem than the other...... but so far I always forgot to check the drive and rebooted first! :-D).

Then, if the reoccurrence is too frequent and annoying, I would consider changing.
Downgrading the actual server to a backup server is a good solution for home/personal usage (if it's a professional usage then I wouldn't).

Keeping the actual hardware (and adding an extra controller) or changing completely, well that depends on you (cost consideration, motivation, ...) and on the usage you have of your server (home/professional, number of users, use cases, ...).
If you want advise, you have to tell more about the use cases and how you want to use it (for example why did you put the disks in mirrors rather than a RAIDz2 volume?). I suggest you start a new thread in the section "Will it FreeNAS?"




(*): I assume that your drives are OK (SMART tests and all ok) and that your data is backed-up safely/regularly otherwise I would advise differently. ;-) And I also assume you're using your server for personal use with a single (or at least few) user.
 
Last edited:

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
I was reading this thread and the OP writes about a 4 port controller (Syba SI-PEX40062). So this could be a solution if you want to keep your actual set-up.

It seems that card works well with FreeNAS (eventhough it's a marvell controller).
Two experienced members (MrToddsFriends and joeschmuck) confirmed this in their respective posts here and here.
But an other experienced member (Chris Moore) makes a good point as well a little bit after.

It's worth reading and you can build your own opinion. ;-)
 
Status
Not open for further replies.
Top