fabiob
Dabbler
- Joined
- Dec 4, 2017
- Messages
- 15
Hello community,
this is my first post here, so please forgive any misses (also forr my bed English! :D)
It's my third FreeNAS build, I'm a happy user of two 9.10.2U6 servers and a (less happy) user of this 11.0U4 I'm describing below:
MB: Gigabyte GA-X150M-PRO-ECC
CPU: Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
RAM: 4 * ECC Kingston ValueRAM KVR21E15D8K2/32I
PSU: Corsair VS650 CP-9020098-EU 650W
Boot:
ada0: <DREVO X1 SSD Q0111A> ACS-2 ATA SATA 3.x device
Storage: 4* Seagate 8Tb Archive HDD on a RAID10 Pool:
ada1: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada2: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada3: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada4: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
I'm using onboard HBA, which dmesg describes as:
ahci0: <Intel Sunrise Point AHCI SATA controller> port 0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xeff48000-0xeff49fff,0xeff4c000-0xeff4c0ff,0xeff4b000-0xeff4b7ff irq 16 at device 23.0 on pci0
Here is the zpool(s) status:
This server was first booted in July 12 2017, then the first CRITICAL email message comes:
I saw lots of CHKSUM errors, my first try was replacing ada4 SATA cable and launch a scrub. It worked for almost 2 months.
Then it started again, on other disks:
Here I replace all SATA cabling, but problem intensifies:
I was panicking, the machine has a Windows Server 2016 VM box with a lot of services on it.
Then we've migrated the VM to a failover machine and continued with a more relaxed troubleshooting.
I've checked PSU voltages, all fine, just +5 slightly low (4.97v) but inside ATX specs.
Here I've replaced all SATA Power Cabling, changed "Y" cables to straight and switched to a Marvell 88SE9215 PCIe HBA ... guess what?
Finally, I replaced PSU, even if voltages was reported correct. I thought: "maybe could be some weird noise on +5v rail". I checked BIOS Settings, upgraded BIOS to latest revision, disabled ALPM and hot swap hoping for the best.
Nope.
CAM Status error is always the same. It looks like HDD disconnects from SATA channel before the FLUSHCACHE48 command. Note that disconnected HDDs will go online again only after a power cycle. If I software reboot the machine, the disk will stay offline.
Same problem on 4 disks... could be that all 4 disks are defective?
Any suggestions, prayers or magic rituals will be much appreciated!
EDIT: attached smartctl outputs
this is my first post here, so please forgive any misses (also forr my bed English! :D)
It's my third FreeNAS build, I'm a happy user of two 9.10.2U6 servers and a (less happy) user of this 11.0U4 I'm describing below:
MB: Gigabyte GA-X150M-PRO-ECC
CPU: Intel(R) Xeon(R) CPU E3-1230 v5 @ 3.40GHz
RAM: 4 * ECC Kingston ValueRAM KVR21E15D8K2/32I
PSU: Corsair VS650 CP-9020098-EU 650W
Boot:
ada0: <DREVO X1 SSD Q0111A> ACS-2 ATA SATA 3.x device
Storage: 4* Seagate 8Tb Archive HDD on a RAID10 Pool:
ada1: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada2: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada3: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
ada4: <ST8000AS0002-1NA17Z AR17> ACS-2 ATA SATA 3.x device
I'm using onboard HBA, which dmesg describes as:
ahci0: <Intel Sunrise Point AHCI SATA controller> port 0xf050-0xf057,0xf040-0xf043,0xf020-0xf03f mem 0xeff48000-0xeff49fff,0xeff4c000-0xeff4c0ff,0xeff4b000-0xeff4b7ff irq 16 at device 23.0 on pci0
Here is the zpool(s) status:
Code:
pool: STORAGE state: ONLINE scan: resilvered 9.77G in 0h5m with 0 errors on Tue Dec 5 09:39:08 2017 config: NAME STATE READ WRITE CKSUM STORAGE ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gptid/93567432-67a3-11e7-9dac-1c1b0da2bc7d ONLINE 0 0 0 gptid/bd90ba0c-67a2-11e7-9dac-1c1b0da2bc7d ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gptid/9f0647e1-6700-11e7-be5e-1c1b0da2bc7d ONLINE 0 0 0 gptid/7c2543c6-67a3-11e7-9dac-1c1b0da2bc7d ONLINE 0 0 0 errors: No known data errors pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0h0m with 0 errors on Sun Dec 3 03:45:31 2017 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 ada0p2 ONLINE 0 0 0 errors: No known data errors
This server was first booted in July 12 2017, then the first CRITICAL email message comes:
Code:
Jul 12 16:07:13 freenas ahcich5: Timeout on slot 13 port 0 Jul 12 16:07:13 freenas ahcich5: is 00000000 cs 00002000 ss 00000000 rs 00002000 tfd c0 serr 00000000 cmd 0000cd17 Jul 12 16:07:13 freenas (ada4:ahcich5:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00 Jul 12 16:07:13 freenas (ada4:ahcich5:0:0:0): CAM status: Command timeout Jul 12 16:07:13 freenas (ada4:ahcich5:0:0:0): Retrying command Jul 12 16:07:46 freenas ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jul 12 16:07:52 freenas collectd[2843]: aggregation plugin: Unable to read the current rate of "freenas.local/cpu-2/cpu-user". Jul 12 16:07:52 freenas collectd[2843]: utils_vl_lookup: The user object callback failed with status 2. Jul 12 16:08:16 freenas ahcich5: Timeout on slot 14 port 0 Jul 12 16:08:16 freenas ahcich5: is 00000000 cs 00004000 ss 00000000 rs 00004000 tfd 80 serr 00000000 cmd 0000ce17 Jul 12 16:08:16 freenas (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jul 12 16:08:16 freenas (aprobe0:ahcich5:0:0:0): CAM status: Command timeout Jul 12 16:08:16 freenas (aprobe0:ahcich5:0:0:0): Retrying command Jul 12 16:08:49 freenas ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jul 12 16:09:19 freenas ahcich5: Timeout on slot 15 port 0 Jul 12 16:09:19 freenas ahcich5: is 00000000 cs 00008000 ss 00000000 rs 00008000 tfd 80 serr 00000000 cmd 0000cf17 Jul 12 16:09:19 freenas (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jul 12 16:09:19 freenas (aprobe0:ahcich5:0:0:0): CAM status: Command timeout Jul 12 16:09:19 freenas (aprobe0:ahcich5:0:0:0): Error 5, Retries exhausted Jul 12 16:09:52 freenas ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) Jul 12 16:10:22 freenas ahcich5: Timeout on slot 16 port 0 Jul 12 16:10:22 freenas ahcich5: is 00000000 cs 00010000 ss 00000000 rs 00010000 tfd 80 serr 00000000 cmd 0000d017 Jul 12 16:10:22 freenas (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 Jul 12 16:10:22 freenas (aprobe0:ahcich5:0:0:0): CAM status: Command timeout Jul 12 16:10:22 freenas (aprobe0:ahcich5:0:0:0): Error 5, Retry was blocked Jul 12 16:10:22 freenas ada4 at ahcich5 bus 0 scbus5 target 0 lun 0 Jul 12 16:10:22 freenas ada4: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached
I saw lots of CHKSUM errors, my first try was replacing ada4 SATA cable and launch a scrub. It worked for almost 2 months.
Then it started again, on other disks:
Code:
Sep 22 10:08:48 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached
Here I replace all SATA cabling, but problem intensifies:
Code:
Nov 17 13:13:51 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached Nov 19 05:45:40 freenas ada1: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached Nov 20 09:05:52 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached Nov 25 14:31:30 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached
I was panicking, the machine has a Windows Server 2016 VM box with a lot of services on it.
Then we've migrated the VM to a failover machine and continued with a more relaxed troubleshooting.
I've checked PSU voltages, all fine, just +5 slightly low (4.97v) but inside ATX specs.
Here I've replaced all SATA Power Cabling, changed "Y" cables to straight and switched to a Marvell 88SE9215 PCIe HBA ... guess what?
Code:
Nov 26 07:36:14 freenas ada4: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached Nov 29 01:46:42 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached Nov 30 11:54:42 freenas ada3: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached
Finally, I replaced PSU, even if voltages was reported correct. I thought: "maybe could be some weird noise on +5v rail". I checked BIOS Settings, upgraded BIOS to latest revision, disabled ALPM and hot swap hoping for the best.
Nope.
Code:
Dec 1 14:00:59 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z841**** detached Dec 5 02:27:17 freenas ada2: <ST8000AS0002-1NA17Z AR17> s/n Z840**** detached
CAM Status error is always the same. It looks like HDD disconnects from SATA channel before the FLUSHCACHE48 command. Note that disconnected HDDs will go online again only after a power cycle. If I software reboot the machine, the disk will stay offline.
Same problem on 4 disks... could be that all 4 disks are defective?
Any suggestions, prayers or magic rituals will be much appreciated!
EDIT: attached smartctl outputs
Attachments
Last edited by a moderator: