Errors in kernel log

g0del · Aug 4, 2013

I've got a home built machine running on an AMD 350 board with 6 drives attached directly and 8 attached to an IBM M1015 board which I bought pre-flashed from ebay. It's been running fairly well for the last 6 months, but a few days ago I happened to check my security run emails and noticed a fair amount of errors in the kernel logs like this:

+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4f 30 0 0 28 0 length 20480 SMID 651 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 46 c0 0 0 28 0 length 20480 SMID 370 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4e a8 0 0 30 0 length 24576 SMID 433 terminated ioc 804b scsi 0 state c xfer 16388
+(da2:mps0:0:3:0): READ(10). CDB: 28 0 86 e9 4e d8 0 0 28 0 length 20480 SMID 985 terminated ioc 804b scsi 0 state c xfer 0
+(da2:mps0:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
+(da2:mps0:0:3:0): CAM status: SCSI Status Error
+(da2:mps0:0:3:0): SCSI status: Check Condition
+(da2:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 c0 0 0 b0 0 length 90112 SMID 139 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 90 0 0 30 0 length 24576 SMID 927 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 90 28 0 0 e0 0 length 114688 SMID 110 terminated ioc 804b scsi 0 state c xfer 32772
+(da5:mps0:0:6:0): READ(10). CDB: 28 0 86 ea 91 38 0 0 58 0 length 45056 SMID 438 terminated ioc 804b scsi 0 state c xfer 0
+(da5:mps0:0:6:0): WRITE(10). CDB: 2a 0 8f 15 8e f0 0 0 30 0
+(da5:mps0:0:6:0): CAM status: SCSI Status Error
+(da5:mps0:0:6:0): SCSI status: Check Condition
+(da5:mps0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Timestamps showed that they were occurring every few hours, but no pattern that I could see. The errors only seem to occur on da1, da2, da3, and da5 (so if it's bad cabling, it must be both cables). Smart shows no issues (both long and short self-tests pass) and zpool status shows a few CKSUM errors, but they don't correspond in time with the errors in the logs (i.e. there will be errors in the logs with no corresponding CKSUM errors in zpool status).

This morning it seemed much worse - errors every few minutes, and accessing files over the network seemed to slow down or hitch a little whenever the errors were occurring. Zpool still shows no data loss and no increase in errors, smart continues to show no issues.

Do I have four drives failing silently on me, or could there be something else going on? I've got a spare drive and could try replacing the one that shows up most often in the logs (da5), but I'm kind of afraid that the stress of resilvering might cause others to fail if that's really the problem. Does anyone have any suggestions? I can provide more information if necessary.

cyberjock · Aug 4, 2013

Can you post the output of:

smartctl -a -q noserial /dev/da1
smartctl -a -q noserial /dev/da2
smartctl -a -q noserial /dev/da3
smartctl -a -q noserial /dev/da4
smartctl -a -q noserial /dev/da5
smartctl -a -q noserial /dev/da6

(If you want to post all of your disks you are welcome to).

Please post the output as a text file or in CODE so its not an eyesore. FYI, hard drive testing isn't a very good indicator of a failing disk. If the test fails, then you definitely have a failing disk. But if it passes there's still a significant chance its failing.

Do you do scrubs regularly?

What's your hardware and zpool configuration?

Whats your zpool configuration in relation to your controllers? (What I mean is are all of your errors from hard drives on the M1015 or a mix? etc.)

What version of FreeNAS are you using?

g0del · Aug 4, 2013

cyberjock said:
Can you post the output of:

smartctl -a -q noserial /dev/da1
smartctl -a -q noserial /dev/da2
smartctl -a -q noserial /dev/da3
smartctl -a -q noserial /dev/da4
smartctl -a -q noserial /dev/da5
smartctl -a -q noserial /dev/da6

(If you want to post all of your disks you are welcome to).

I've attached (hopefully) smart.txt which contains that command for da0 through da7, concatenated into one file, hopefully that's not an issue.

Do you do scrubs regularly?

Scheduled for every two weeks. The last one resulted in some CKSUM errors in zpool status, but no data errors. One was running this morning when I woke up, but I stopped it to work on other issues. It had completed about 25%, again no data errors (and no new CKSUM errors that I could see), but the logs were seeing quite a few of the errors in the first post.

What's your hardware and zpool configuration?

Asus Dual-Core AMD Fusion APU E350/ AMD FCH A50/ SATA3/ A&V&GbE/ Mini ITX Motherboard (E35M1-I)
Kingston ValueRAM 8GB 1066MHz DDR3 Non-ECC CL7 DIMM (Kit of 2) Desktop Memory
Corsair Builder Series CX V2 430-Watt 80 Plus Certified Power Supply Compatible with Intel and AMD Platforms - CMPSU-430CXV2
6 of Seagate Barracuda Green 2TB SATA 6Gb/s 64MB Cache 3.5-Inch Internal Bare Drive ST2000DL003 (These are the drives hooked directly to the motherboard, no SCSI errors from them.)
8 of Seagate Barracuda 7200 3 TB 7200RPM SATA 6 Gb/s NCQ 64MB Cache 3.5-Inch Internal Bare Drive ST3000DM001 (These are on the M1015)
Plus the M1015.

Whats your zpool configuration in relation to your controllers? (What I mean is are all of your errors from hard drives on the M1015 or a mix? etc.)

Started in 2012 with just the 6 drives connected to the motherboard. In Feb. of this year I added the M1015 with the 8 other drives connected to it. Zpool has two raidz2 pools, one with the initial 6 drives and one with the later 8 drives.

What version of FreeNAS are you using?

FreeNAS-8.3.0-RELEASE-p1-x64 (r12825

And thank you premptively for any help. I realize that this is all volunteer, and am very grateful for any help that you're able to provide.

cyberjock · Aug 4, 2013

Here's my new "to do" list.

1. Update to 8.3.1-p2 (p1 had a fatal flaw. I forget the exact issue, but going to p2 is prudent).
2. Assuming the disk info is in order starting with da1 you have a bad/failing disk.

da4 has an "End-to-End_Error" of 6. Not familiar with that exact parameter, but it is listed as "FAILING_NOW". You may not be able to RMA the disk though. If it is in warranty and you want to RMA it I'd try the seagate diagnostic tools and see if it fails. If it doesn't fail then RMA isn't an option. If you aren't having any other problems with the disk from scrubs, or long tests, then I'd ignore that parameter unless it changes. You could also do a badblocks test on the disk(but its destructive so you'll lose 1 disk of redundancy).

3. I had issues with Seagates a few years ago. They had timeout errors in Windows and I bailed on Seagate because of it. Not sure if that's related to your issue or not, but I know several people have had the same problems with current gen drives. What's odd is that they may run fine for a while, then suddenly start having timeout issues without warning and without any other errors being logged by SMART.

4. Post the output of camcontrol devlist.

I don't see anything that could be causing those problems. It could be a power supply issue, disk age, cabling, or one of a dozen other issues. I am a little concerned because of the "UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)". I believe that means that the disk has "reset" itself internally. Any data that was supposed to be written to the disk may have been lost.

Edit: Ah-ha! End_to_End errors...

This attribute is a part of Hewlett-Packard's SMART IV technology, as well as part of other vendors' IO Error Detection and Correction schemas, and it contains a count of parity errors which occur in the data path to the media via the drive's cache RAM.

To me that means a failing hard drive controller(the on-disk controller.. not your SATA/SAS controller) or possible voltage fluctuations leading to errors. Maybe try a spare power supply if you have one?

g0del · Aug 4, 2013

8.3.1-p2 is downloading just in case. Here's camcontrol devlist

Code:

<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 3 lun 0 (pass2,da2)
<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 4 lun 0 (pass3,da3)
<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 5 lun 0 (pass4,da4)
<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 6 lun 0 (pass5,da5)
<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 7 lun 0 (pass6,da6)
<ATA ST3000DM001-1CH1 CC24>        at scbus0 target 8 lun 0 (pass7,da7)
<ST2000DL003-9VT166 CC3C>          at scbus1 target 0 lun 0 (pass8,ada0)
<ST2000DL003-9VT166 CC3C>          at scbus2 target 0 lun 0 (pass9,ada1)
<ST2000DL003-9VT166 CC3C>          at scbus3 target 0 lun 0 (pass10,ada2)
<ST2000DL003-9VT166 CC3C>          at scbus4 target 0 lun 0 (pass11,ada3)
<ST2000DL003-9VT166 CC3C>          at scbus5 target 0 lun 0 (pass12,ada4)
<ST2000DL003-9VT166 CC3C>          at scbus6 target 0 lun 0 (pass13,ada5)
<LEXAR JD FIREFLY 1100>            at scbus7 target 0 lun 0 (pass14,da8)

I have a spare 3gb (unopened) from the last time I had to replace a failed drive (one of them completely failed a few weeks after I received them), so I can replace the drive with the SMART error (/dev/da3, which is one that is also showing the timeout errors in the logs). I hadn't done anything yet because I see a lot more errors in the logs when the drives are busy, and resilvering is about as busy as they get. I didn't want to replace one only to have three more fail during the resilver.

If it's a cabling issue, then it's happening with all three different cables I've tried. Disk age would be unlikely I'd think, the drives are only about 6 months old. I'll see if I can find a spare power supply somewhere to test that.

Would you recommend replacing /dev/da3 (with the smart failure) with another drive, or wait for more troubleshooting first?

cyberjock · Aug 4, 2013

I'd say that the failing disk may have silent corruption, but unless you are racking up tons of errors in zpool status to leave the "bad" disk for now until there is a more solid answer on what's going on or the disk fails outright and isn't functioning at all.

As for disk age, I started having problems at about 90 days. Of course, too late to return the drives since it started happening to a whole bunch at the same time. Really sucked as I was scared that if I tried to resilver that more disks would fail.

My first hunch is that you should try a spare power supply. Flaky power supplies can cause issues with hard drives and other components that are sensitive to voltage fluctuations.

An easy test to rule out other odd issues is a RAM test. I'm not getting the feeling its a RAM issue, but its easy to rule out. Just download memtestx86 from memtest.org and give that a go. 3 passes with no errors is good. This usually takes overnight.

So, of the disks that are throwing errors how many are on your motherboard controller? Have you done anything that could somehow be related to the errors(bios setting change, bios update, FreeNAS upgrade, hardware upgrade, etc.)?

g0del · Aug 4, 2013

I ran a single pass memtest this morning just in case, no errors. As I said before, I'm really not seeing any errors in zpool status. It's currently showing 11 CKSUM errors on /dev/da5, but nothing else. No known data errors.

All the disks throwing errors are one the M1015, none directly on the motherboard. No configuration changes or updates of anything in quite awhile.

I'll keep an eye on things until I can find another power supply to test it with. If that doesn't work I'll try pulling one of the affected drives and seeing if I can get a failure in seatools for an RMA.

cyberjock · Aug 4, 2013

Maybe your M1015 is going bad?

I'd say just keep an eye on things and see what happens. I'd definitely back up your important data just in case things get out of hand.

g0del · Aug 4, 2013

I hope not. Not sure how I'd go about testing for that (outside of buying another one, which I can't really do right now). Scrounging up another power supply for testing is one thing, but I doubt any of my friends have spare m1015's lying around.

jpaetzel · Aug 4, 2013

There is one bad drive screaming so loudly that it is disrupting the communication of the others. You can offline your drives one at a time and see if the errors go away. Make sure to resilver the pool after you online the drive. You'll want to try offlining drives while the errors are happening so you can tell which drive it is that fixes it.

Keep in mind you are intentionally degrading your pool if you try this, so make backups.

I've added a tool to the FreeNAS development branch to easy this sort of diagnosis in the future.

g0del · Aug 4, 2013

jpaetzel said:
There is one bad drive screaming so loudly that it is disrupting the communication of the others. You can offline your drives one at a time and see if the errors go away. Make sure to resilver the pool after you online the drive. You'll want to try offlining drives while the errors are happening so you can tell which drive it is that fixes it.

Keep in mind you are intentionally degrading your pool if you try this, so make backups.

I've added a tool to the FreeNAS development branch to easy this sort of diagnosis in the future.

Doing things while errors are occurring should be pretty easy, just about any drive access seems to set them off right now. The resilvering process will start automatically when a drive is onlined, correct? And should I do this through the gui, or is it better to work command line for this?

And out of curiosity, what the tool you've added and how does it work?

g0del · Aug 4, 2013

Update: I tried offlining a drive as suggested by jpaetzel. I started with da5 because it seemed to show up the most in the logs, and it was the only one with CKSUM errors in zpool status. Before offlining it, I was seeing bursts of errors in the logs on 4 drives (da1, da2, da3, and da5) and new bursts were showing up a couple of times a minute if there was any load on the drives at all. Since I offlined da5 I've only seen bursts of errors twice in the last hour - all on da1. Is this multiple drives going bad, or did I offline a drive that wasn't actually causing the issue?

g0del · Aug 5, 2013

Well, now I've got a new issue. The computer ran overnight with da5 offlined without any more errors. So I assumed that was the problem drive, and turned off the machine to replace it. I replaced it with a new drive, turned the machine back on, and before I could start the zpool replacement/resilver process, I was seeing errors again on da1, da2, and da3.

So I shut down the machine again, and put the old drive back, thinking I would follow jpaetzal's advice and online it, then try offlining a different drive. But I can't get it to go online. Here's zpool status:

Code:

        NAME                                            STATE     READ WRITE CKSUM
        tank1                                           DEGRADED     0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/3f020426-9995-11e1-b589-6805ca036cf9  ONLINE       0     0     0
            gptid/3f971fdb-9995-11e1-b589-6805ca036cf9  ONLINE       0     0     0
            gptid/401fc5d6-9995-11e1-b589-6805ca036cf9  ONLINE       0     0     0
            gptid/40afcc74-9995-11e1-b589-6805ca036cf9  ONLINE       0     0     0
            gptid/413d75d2-9995-11e1-b589-6805ca036cf9  ONLINE       0     0     0
            gptid/41cbea5f-9995-11e1-b589-6805ca036cf9  ONLINE       0     0     0
          raidz2-1                                      DEGRADED     0     0     0
            gptid/2998084c-77ef-11e2-b6c3-f46d04752957  ONLINE       0     0     0
            gptid/2a432559-77ef-11e2-b6c3-f46d04752957  ONLINE       0     0     0
            gptid/ef08a1d3-9ba6-11e2-a61a-f46d04752957  ONLINE       0     0     0
            gptid/2ba827e5-77ef-11e2-b6c3-f46d04752957  ONLINE       0     0     0
            gptid/2c51ca5d-77ef-11e2-b6c3-f46d04752957  ONLINE       0     0     0
            gptid/2cf84266-77ef-11e2-b6c3-f46d04752957  ONLINE       0     0     0
            18160692339625570720                        OFFLINE      0     0     0  was /dev/dsk/gptid/2da6bd36-77ef-11e2-b6c3-f46d04752957
            gptid/2e557e78-77ef-11e2-b6c3-f46d04752957  ONLINE       0     0     0

and trying to bring 18160692339625570720 online just leaves it unavailable. Glabel status:

Code:

                                      Name  Status  Components
gptid/3f020426-9995-11e1-b589-6805ca036cf9     N/A  ada0p2
gptid/3f971fdb-9995-11e1-b589-6805ca036cf9     N/A  ada1p2
gptid/401fc5d6-9995-11e1-b589-6805ca036cf9     N/A  ada2p2
gptid/40afcc74-9995-11e1-b589-6805ca036cf9     N/A  ada3p2
gptid/413d75d2-9995-11e1-b589-6805ca036cf9     N/A  ada4p2
gptid/41cbea5f-9995-11e1-b589-6805ca036cf9     N/A  ada5p2
gptid/2998084c-77ef-11e2-b6c3-f46d04752957     N/A  da0p2
gptid/2a432559-77ef-11e2-b6c3-f46d04752957     N/A  da1p2
gptid/2ba827e5-77ef-11e2-b6c3-f46d04752957     N/A  da2p2
gptid/2c51ca5d-77ef-11e2-b6c3-f46d04752957     N/A  da3p2
gptid/2cf84266-77ef-11e2-b6c3-f46d04752957     N/A  da4p2
gptid/2e557e78-77ef-11e2-b6c3-f46d04752957     N/A  da6p2
gptid/ef08a1d3-9ba6-11e2-a61a-f46d04752957     N/A  da7p2
                             ufs/FreeNASs3     N/A  da8s3
                             ufs/FreeNASs4     N/A  da8s4
                            ufs/FreeNASs1a     N/A  da8s1a
                    ufsid/5009cfdd91a55783     N/A  da8s2a
                            ufs/FreeNASs2a     N/A  da8s2a
gptid/98db47d3-fded-11e2-b958-f46d04752957     N/A  da5p2
gptid/98bcb0aa-fded-11e2-b958-f46d04752957     N/A  da5p1

shows that the gptid has changed, and I don't know how to bring it online like that. I'd rather not force it as a replacement, since I'm pretty sure that would trigger a full resilver. I'd appreciate and further help or suggestions.

cyberjock · Aug 5, 2013

Hmm.. that's so freakin' bazaar. I'm wondering if this is somehow related to forums.freenas.org/threads/cannot-import-a-zpool-one-disk-offline-cannot-be-put-online.14104/

That thread still boggles my mind. I can't wrap my head around that thread and what actually went wrong. It looks like user error, but some of the command outputs show something more nefarious.

g0del · Aug 5, 2013

I'll fully admit that I might have messed things up, but I don't know where. I offlined a drive as suggested by a member of the core team. When that seemed to fix the problem, I turned off the machine and physically replaced that drive with a new one, but never added the new drive to the zpool as a replacement. I simply booted the server back up, noted that I was still receiving errors in the logs, and then turned it off and put the old drive back in. Somehow doing that gave it a new gptid and I can't get it to go online.

I'm not planning to run anymore commands without advice from someone else, since I'm now down to only one drive parity on that volume (and three drives throwing SCSI sense errors in the logs). The irreplaceable data on the server has been backed up elsewhere, but I really don't want to lose everything on it.

cyberjock · Aug 5, 2013

I'd wait until we get a response from a core member(or jpaetzel). I can understand your reservation with his feedback, but he's the owner of the FreeNAS project. It doesn't go any higher than him. :P

I'd say you have 2 options in the "right now". You can continue with the disk replacement and let it resilver or wait for someone to respond. I'm starting to wonder if something is just "not quite right" with this. You're the 3rd person to attempt a disk replacement that I can think of with 9.1 that has had problems that may not be "user error" related. This time the gptid seems to have changed. I'm confused because the gptid is generally created when the partition is created so I want to assume that you didn't put the original disk back in. But someone else lost their p2(but still had a p1) and someone else saw multiple disks go offline with no logical reasoning.

Of course, its really hard to narrow down some problems because some people aren't very diligent with tracking their disks, they do thinks that don't make sense in FreeBSD(Especially uninformed Windows users), and then there's the general human-error factor.

Is it possible that you have confused your "original" with another disk?

g0del · Aug 5, 2013

I'm not on 9.1, but otherwise the issues seems similar. As for confusing it with another disk, I doubt it. I don't have tons of 3gb drives running around to get mixed up. I have 8 in the server, and one spare that was in a box. I offlined da5, and when I went to physically replace it I got the serial number for da5 from the freenas gui. I pulled the drive with that serial number from the server and put in the new one. While I didn't perform any zpool replacement activities with the new drive, I did run a zpool status while it was in. It showed one drive (da5) offline, nothing else. If I had pulled the wrong drive, it would have shown two drives off. It did not.

I then put back the original da5. It's da5 again, and the serial number as reported by smartctl matches the serial number which is on the drive and matches the serial number of da5 from the freenas gui before I offlined it. But the gptid has changed.

Wait - I did just think of something that might have caused it, maybe? When I offlined it, I did it through the gui. When I put it back in, I went to the gui and didn't see an online button, just a replace button. I used replace, and selected the only open drive (da5, the one that was originally in the pool). The gui replacement failed, with an error message that faded before I could read all of it, but I did catch something about it couldn't be used as a replacement because it already belonged to a zpool (yes, the one I was trying to put it in) and to use a force option if I really wanted to put it in. I did not use the force option. Could that have been enough to change the gptid?

cyberjock · Aug 5, 2013

That would explain your new gptid.

Once you click the "replace"button and choose the disk it creates new partitions and gets to work.

What you should do now is put the "new" disk in (the one that you were planning to put in the zpool) and follow the replacement procedure for that.

g0del · Aug 5, 2013

So, it errors out because the replacement drive was part of another pool, but not until after it's modified that drive enough so that the drive can't be used in the original pool? Mea culpa on hitting replace before I tried to online it through the command line, but it seems to me that if it's going to require a -force option to keep the user from doing something dumb, it probably shouldn't destructively modify that drive until the -force option has actually been given.

I'll wait a bit to see if any core members have any other thoughts, and if not I'll throw in the new drive tonight and pray that everything survives the resilver.

cyberjock · Aug 5, 2013

g0del said:
So, it errors out because the replacement drive was part of another pool, but not until after it's modified that drive enough so that the drive can't be used in the original pool? Mea culpa on hitting replace before I tried to online it through the command line, but it seems to me that if it's going to require a -force option to keep the user from doing something dumb, it probably shouldn't destructively modify that drive until the -force option has actually been given.

I'll wait a bit to see if any core members have any other thoughts, and if not I'll throw in the new drive tonight and pray that everything survives the resilver.

I agree. It's a little backwards. I might try to test it later in a VM to see if it actually does that. If it does I'll put in a ticket. :P

Important Announcement for the TrueNAS Community.

Errors in kernel log

Dabbler

Inactive Account

Dabbler

Attachments

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

jpaetzel

Guest

Dabbler

Dabbler

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Errors in kernel log"

Similar threads