zpool cpu utilization @ 100%, no scrubs running

hidden72 · Nov 11, 2013

I have a Freenas box running 100% cpu utilization. A quick "top" shows that the top process is "zpool":

last pid: 56013; load averages: 1.11, 1.14, 1.10 up 6+23:30:58 08:55:22
41 processes: 2 running, 39 sleeping
CPU: % user, % nice, % system, % interrupt, % idle
Mem: 880M Active, 148M Inact, 12G Wired, 1168K Cache, 671M Buf, 2725M Free
ARC: 11G Total, 1939M MFU, 9534M MRU, 1861K Anon, 66M Header, 29M Other
Swap: 22G Total, 22G Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
55677 root 1 103 0 781M 744M CPU1 1 11:34 100.00% zpool
2005 root 6 20 0 9900K 1580K rpcsvc 1 96:17 0.00% nfsd
2713 root 7 20 0 114M 14604K uwait 1 7:07 0.00% collectd

However, "zpool status" does not show a running scrub:

[root@freenas3] /var/log# zpool status
pool: tank2
state: ONLINE
scan: scrub repaired 0 in 0h24m with 0 errors on Sun Oct 27 00:24:42 2013
config:
NAME STATE READ WRITE CKSUM
tank2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/bf73d8ab-fb40-11e2-8281-00505693005d ONLINE 0 0 0
gptid/c19ce3f3-fb40-11e2-8281-00505693005d ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/c3d10d8f-fb40-11e2-8281-00505693005d ONLINE 0 0 0
gptid/c6077b53-fb40-11e2-8281-00505693005d ONLINE 0 0 0
errors: No known data errors
pool: tank3
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank3 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/e7c6599b-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
gptid/e8c9af47-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/e99596de-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
gptid/ea625d4a-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/eb8f3c06-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
gptid/ecc008e6-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
spares
gptid/edb17fe6-39f0-11e3-98ad-00505693005d AVAIL
errors: No known data errors
[root@freenas3] /var/log#

If I kill the "zpool" process, it just starts up again.

The "gstat" command shows each drive in a mostly idle state (green across the board).
I think this has happened once before, and I ended up rebooting the box. What other things can I check? Happy to file a bug if no one has any ideas.

cyberjock · Nov 11, 2013

Can you post your hardware config?

hidden72 · Nov 11, 2013

Hardware:
Asus x79 Motherboard
Intel i7-3930k
2x LSI 1068e cards
one pool has Seagate 1.5TB
other has WesternDigital 1.0TB drives

This is an ESXi 5.5 host, and I'm passing-through the LSI 1068e cards directly to a Freenas VM 9.1.1x64 Release
The Freenas VM is provisioned with 2 CPUs and 16GB RAM.

Thanks!

DrKK · Nov 11, 2013

You're not...deduping...are you?

hidden72 · Nov 11, 2013

DrKK said:
You're not...deduping...are you?

No deduping, no snapshots, no replications, no compression - pretty straight-forward NFS usage. It started about 36 hours ago, according to the system load graphs/reporting at 00:10, going from almost no utilization straight to 100%.

cyberjock · Nov 11, 2013

What happens if you remove the spare? Spares aren't hot-swapped into a pool when a disk dies. So there is no reason at present to add spares to a pool.

hidden72 · Nov 11, 2013

I did a "zpool remove tank3 gptid/(spare label)" and now "zpool status" shows the 3 mirror groups w/o the spare. Still stuck at 100% though. I'm happy to run over to the closet and physically yank the drive if you think it would help.

Thanks again for the suggestions. Keep 'em coming. I'm trying to think what thing(s) have changed recently. I've had this configuration for ~6months and it's been pretty solid. I realize that I'm walking a fine line (support-wise) by doing PCI-passthrough with the VMware layer in there. I recently upgraded to ESXi 5.5, but that was about a month ago. I have a 2nd server with a similar config and it's chugging along just fine.

cyberjock · Nov 11, 2013

Maybe reboot the FreeNAS VM and/or the ESXi server? I will warn you that we try to shy people away from virtualizing because it is still somewhat risky for FreeNAS. It could be that you are having a problem related to the virtualizing.

As for putting in a ticket its kind of useless unless you can point to the exact cause. I definitely won't stop you from putting a ticket in, but I'd bet there isn't much value unless you can provide enough information for the developers to narrow down the problem.

If you do gstat does it show any disks with heavy activity? If you do zpool iostat 1 and watch it is the pool doing alot? Have you checked the SMART on your disks? Are all of the attributes as expected?

hidden72 · Nov 11, 2013

gstat shows all drives green, no heavy usage. zpool iostat 1 is very quiet too.

[root@freenas3] ~# zpool iostat 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 46 82.3K 412K
tank3 1.01T 1.71T 3 78 354K 3.46M
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 0 0 0
tank3 1.01T 1.71T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 0 127K 0
tank3 1.01T 1.71T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 319 0 7.88M
tank3 1.01T 1.71T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----

I'll dig into the SMART attributes and see if I can find anything interesting as well.

hidden72 · Nov 11, 2013

hidden72 said:
I'll dig into the SMART attributes and see if I can find anything interesting as well.

There it is! smartctl -a /dev/da11 shows this:

[root@freenas3] ~# smartctl -a /dev/da11 | grep Error
(snip)
ATA Error Count: 6 (device log contains only the most recent five errors)
ER = Error register [HEX]
Error 6 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 5 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 4 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 3 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 2 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT

"(1 days + 13 hours)" puts it right on top of the timing for the original CPU spike ~36 hours ago. I'll go yank that drive. Thanks for your help!

DrKK · Nov 11, 2013

You would have received an automatic email on this from the SMART system. Did you not configure that? It's under the configuration under services/SMART.

hidden72 · Nov 11, 2013

DrKK said:
You would have received an automatic email on this from the SMART system. Did you not configure that? It's under the configuration under services/SMART.

I didn't... but I do now. Thanks!

Dusan · Nov 11, 2013

hidden72 said:
Error 2 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT

"(1 days + 13 hours)" puts it right on top of the timing for the original CPU spike ~36 hours ago. I'll go yank that drive. Thanks for your help!

Unless the drives are brand new this is not related. Those events happened at 37 hours into the lifetime of the drive (e.g. if you run your system 24/7 it happened 1 day 13 hours after you first powered on the system). Depending on the drive manufacturer you can run smartctl -a /dev/ada11 | grep Power_On_Hours to get the current power-on hours. Example for one of my drives:
[PANEL][root@hostname] ~# smartctl -a /dev/ada1 | grep Power_On_Hours
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2260[/PANEL]
This means my drive already accumulated 2260 power-on hours.

Dusan · Nov 11, 2013

Deleted. I should check the code before posting :).

hidden72 · Nov 11, 2013

Dusan said:
Unless the drives are brand new this is not related. Those events happened at 37 hours into the lifetime of the drive (e.g. if you run your system 24/7 it happened 1 day 13 hours after you first powered on the system). Depending on the drive manufacturer you can run smartctl -a /dev/ada11 | grep Power_On_Hours to get the current power-on hours. Example for one of my drives:
[PANEL][root@hostname] ~# smartctl -a /dev/ada1 | grep Power_On_Hours
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2260[/PANEL]
This means my drive already accumulated 2260 power-on hours.

Ahh... good catch. I'll let this resilver complete and then re-run through all of the smart data and see if anything else is out there that might line-up with the start of the spike. (note to self: save console output for easy after-the-fact referencing)

Important Announcement for the TrueNAS Community.

zpool cpu utilization @ 100%, no scrubs running

hidden72

Dabbler

cyberjock

Inactive Account

hidden72

Dabbler

DrKK

FreeNAS Generalissimo

hidden72

Dabbler

cyberjock

Inactive Account

hidden72

Dabbler

cyberjock

Inactive Account

hidden72

Dabbler

hidden72

Dabbler

DrKK

FreeNAS Generalissimo

hidden72

Dabbler

Dusan

Guru

Dusan

Guru

hidden72

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

zpool cpu utilization @ 100%, no scrubs running

Dabbler

Inactive Account

Dabbler

FreeNAS Generalissimo

Dabbler

Inactive Account

Dabbler

Inactive Account

Dabbler

Dabbler

FreeNAS Generalissimo

Dabbler

Guru

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "zpool cpu utilization @ 100%, no scrubs running"

Similar threads