After months of service, FreeNAS giving errors, unworkable :(

Status
Not open for further replies.

StephenFry

Contributor
Joined
Apr 9, 2012
Messages
171
I've been very happily using my FreeNAS system for a while now, and after the initial tweaking we all have to do, it's been rock solid.

However, all of a sudden it has become an unusable p.o.s. (sorry, I'm frustrated!).

It boots, and might work for a (short!) little bit, then - doesn't matter if I'm reading or writing - it crashes and reports:

siisch1: Timeout on Slot 30

siisch1: sii_timeout is 0040000 ss 40000000 rs 40000000 es 80000000 sts 801f0040 serr 00000000

siisch1: Error while READ LOG EXT

I have not changed any setting, it's just been sitting in its corner, working properly.

Any clue what I can do?


edit: sometimes it crashes -with the same message- DURING BOOTING or right when I start the first file transfer.
edit2: I have to add, after crashing, it doesn't shutdown/reboots properly. It hangs om the message where it says some processes would not die, ps axl advised.
 

William Grzybowski

Wizard
iXsystems
Joined
May 27, 2011
Messages
1,754
Looks like a disk is failing and screwing the array...Is it always the same slot?
The controller could be FUBAR as well...
 

StephenFry

Contributor
Joined
Apr 9, 2012
Messages
171
It's always slot 30.

The disks don't show anything strange so I'm leaning towards blaming the controller.

How can I find out which controller is the siisch1?

I have six hdds, connected to two motherboard sata slots and to two two-slot PCI-E sata controllers.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Pull one at a time, boot into singleuser mode, compare the kernel messages.

Or

If you marked your drives (as you should but everyone seems to forget to do) when you were building your system, and you know which disk is "ad0" for example, track back through the kernel messages to see what drive is attached to what controller. If you know "ad0" is attached to that controller, then just "follow the cable".

That's two basic answers to your question, but you'll want to know where to go next.

I *strongly* suggest you do not try booting multiuser mode if you pretty much know something's rotten in hardwareland. What you might want to do is to boot singleuser, and then go and test each individual drive to see what's up. You can test easily enough by going into singleuser mode, and doing

# dd if=/dev/ada0 of=/dev/null bs=1048576

Substitute your device names as appropriate (use "camcontrol devlist" to obtain list of attached devices). For example on this old box with a 3Ware 9550 RAID controller, the disks come out as

# camcontrol devlist
<AMCC 9550SX-4LP DISK 3.01> at scbus2 target 0 lun 0 (pass0,da0)
<AMCC 9550SX-4LP DISK 3.01> at scbus2 target 1 lun 0 (pass1,da1)
<AMCC 9550SX-4LP DISK 3.01> at scbus2 target 2 lun 0 (pass2,da2)
<AMCC 9550SX-4LP DISK 3.01> at scbus2 target 3 lun 0 (pass3,da3)
< Flash Disk > at scbus3 target 0 lun 0 (pass4,da4)


so I would use "da0" through "da3". Your normal ATA controller will have "adaN" devices. Now, this is not a thorough test, but it ought to be harmless to the data on the disks, and it is usually quite enough to flush major hardware bugs out of the woodwork. You can use "control-T" while the DD is running to get session statistics. All your drives should run around the same speeds; a drive that is way off is a Big Red Flag for some sort of trouble.
 

StephenFry

Contributor
Joined
Apr 9, 2012
Messages
171
Thank you so much for all that excellent info, jgreco. I'm off to sleep and then two days at work, but when I get back will follow your procedures and report back.

(I labeled my drives, this is not my first hardware rodeo! ;) )
 

StephenFry

Contributor
Joined
Apr 9, 2012
Messages
171
Well, those "two days" turned into a month. Work, vacation, people, pfff, they all get in the way of the imporant stuff! ;)

I've yanked the system out of the attic and put it in my study/computerlab to troubleshoot.

Of course, it started without a hitch ... So not knowing what to do, I went and looked into SMART info for each drive and noticed that one had been getting quite a few UDMA CRC ERRORS. Now, this shouldn't impact the functioning of the NAS system, but I still replaced the SATA cable of the offending drive.

It's been up and running for near 24 hours and I'm still watching the status monitor like a scared little hawk, but so far, so g...... (not going to jinx it!)
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
UDMA CRC errors are basically communication errors between the controller and hard drive. Of course, if either device is faulty you can get those errors. But changing the cable is an easy solution and fixes that issue usually ;)
 

StephenFry

Contributor
Joined
Apr 9, 2012
Messages
171
Aaaargh!

and.... it's back!

This time a different siisch, but same slot:


siisch2: Timeout on Slot 30

siisch2: sii_timeout is 03040000 ss 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00280000
(or, occasionally, sii_timeout is 07040000 ss 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00680000)

siisch2: Error while READ LOG EXT

:(

I have tested each disk individually and they seem all to be performing normally. So once again, I'm back suspecting the controller(s).

EDIT -- here used to be questions about identifying hardware, I think I've been able to do that using the dmesg command:

As I can see it now, I have these controllers:

siis0, which is my SiI3132 controller on pci4 and
siis1, which is my SiI3132 controller on pci5.

siis0 owns channels siisch0 and 1 and
siis1 owns channels siisch2 and 3.

The onboard controller is
ahci0, which owns ahcich0,1,2,3

Since I have NEVER seen an error coming from the ahcich* controller, but first from exclusively the siis0 and now recently from the siis1, I'm thinking of replacing these two bad boys.

It was nice putting together the system on a shoe-string, but if it's not working properly, I'm willing to throw the €150 or so at it, for a 8-channel SAS/SATA controller and a few breakout cables.

Or are there other courses of action?
I mean, it *is* quite strange that it used to be siisch1 and ONLY siisch1, while today is suddenly was siisch2. But I may be grasping at nothing here.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
The best advice I can give is to replace the part you think may be the problem. If the problem goes away then obviously you've found the bad part.
 

StephenFry

Contributor
Joined
Apr 9, 2012
Messages
171
I've just checked the SMART info once again, and *one* drive had an increased UDMA_CRC_ERROR count.
I will need to check if this drive is attached to the siis1 controller, but if it is, I'll first of all change the cables on that one as well, just like I did with siis0.

After that, I'll throw money at the problem ;)

One thing I'm unsure of, is what it means that the Timeout is always on 'Slot 30'. What I see from other people with these -kind of- problems, is that they always have different slots with each timeout.

EDIT on day on: That's it. I've ordered a M1015 (it sounds like an automatic weapon!) and cableage. I'm going to show this NAS who's boss!
 
Status
Not open for further replies.
Top