zpool cpu utilization @ 100%, no scrubs running

Status
Not open for further replies.

hidden72

Dabbler
Joined
Aug 8, 2011
Messages
22
I have a Freenas box running 100% cpu utilization. A quick "top" shows that the top process is "zpool":


last pid: 56013; load averages: 1.11, 1.14, 1.10 up 6+23:30:58 08:55:22
41 processes: 2 running, 39 sleeping
CPU: % user, % nice, % system, % interrupt, % idle
Mem: 880M Active, 148M Inact, 12G Wired, 1168K Cache, 671M Buf, 2725M Free
ARC: 11G Total, 1939M MFU, 9534M MRU, 1861K Anon, 66M Header, 29M Other
Swap: 22G Total, 22G Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
55677 root 1 103 0 781M 744M CPU1 1 11:34 100.00% zpool
2005 root 6 20 0 9900K 1580K rpcsvc 1 96:17 0.00% nfsd
2713 root 7 20 0 114M 14604K uwait 1 7:07 0.00% collectd



However, "zpool status" does not show a running scrub:


[root@freenas3] /var/log# zpool status
pool: tank2
state: ONLINE
scan: scrub repaired 0 in 0h24m with 0 errors on Sun Oct 27 00:24:42 2013
config:
NAME STATE READ WRITE CKSUM
tank2 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/bf73d8ab-fb40-11e2-8281-00505693005d ONLINE 0 0 0
gptid/c19ce3f3-fb40-11e2-8281-00505693005d ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/c3d10d8f-fb40-11e2-8281-00505693005d ONLINE 0 0 0
gptid/c6077b53-fb40-11e2-8281-00505693005d ONLINE 0 0 0
errors: No known data errors
pool: tank3
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank3 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/e7c6599b-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
gptid/e8c9af47-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
gptid/e99596de-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
gptid/ea625d4a-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
gptid/eb8f3c06-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
gptid/ecc008e6-39f0-11e3-98ad-00505693005d ONLINE 0 0 0
spares
gptid/edb17fe6-39f0-11e3-98ad-00505693005d AVAIL
errors: No known data errors
[root@freenas3] /var/log#



If I kill the "zpool" process, it just starts up again.


The "gstat" command shows each drive in a mostly idle state (green across the board).
I think this has happened once before, and I ended up rebooting the box. What other things can I check? Happy to file a bug if no one has any ideas.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Can you post your hardware config?
 

hidden72

Dabbler
Joined
Aug 8, 2011
Messages
22
Hardware:
Asus x79 Motherboard
Intel i7-3930k
2x LSI 1068e cards
one pool has Seagate 1.5TB
other has WesternDigital 1.0TB drives

This is an ESXi 5.5 host, and I'm passing-through the LSI 1068e cards directly to a Freenas VM 9.1.1x64 Release
The Freenas VM is provisioned with 2 CPUs and 16GB RAM.

Thanks!
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
You're not...deduping...are you?
 

hidden72

Dabbler
Joined
Aug 8, 2011
Messages
22
You're not...deduping...are you?

No deduping, no snapshots, no replications, no compression - pretty straight-forward NFS usage. It started about 36 hours ago, according to the system load graphs/reporting at 00:10, going from almost no utilization straight to 100%.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
What happens if you remove the spare? Spares aren't hot-swapped into a pool when a disk dies. So there is no reason at present to add spares to a pool.
 

hidden72

Dabbler
Joined
Aug 8, 2011
Messages
22
I did a "zpool remove tank3 gptid/(spare label)" and now "zpool status" shows the 3 mirror groups w/o the spare. Still stuck at 100% though. I'm happy to run over to the closet and physically yank the drive if you think it would help.

Thanks again for the suggestions. Keep 'em coming. I'm trying to think what thing(s) have changed recently. I've had this configuration for ~6months and it's been pretty solid. I realize that I'm walking a fine line (support-wise) by doing PCI-passthrough with the VMware layer in there. I recently upgraded to ESXi 5.5, but that was about a month ago. I have a 2nd server with a similar config and it's chugging along just fine.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Maybe reboot the FreeNAS VM and/or the ESXi server? I will warn you that we try to shy people away from virtualizing because it is still somewhat risky for FreeNAS. It could be that you are having a problem related to the virtualizing.

As for putting in a ticket its kind of useless unless you can point to the exact cause. I definitely won't stop you from putting a ticket in, but I'd bet there isn't much value unless you can provide enough information for the developers to narrow down the problem.

If you do gstat does it show any disks with heavy activity? If you do zpool iostat 1 and watch it is the pool doing alot? Have you checked the SMART on your disks? Are all of the attributes as expected?
 

hidden72

Dabbler
Joined
Aug 8, 2011
Messages
22
gstat shows all drives green, no heavy usage. zpool iostat 1 is very quiet too.

[root@freenas3] ~# zpool iostat 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 46 82.3K 412K
tank3 1.01T 1.71T 3 78 354K 3.46M
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 0 0 0
tank3 1.01T 1.71T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 0 127K 0
tank3 1.01T 1.71T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
tank2 28.6G 431G 0 319 0 7.88M
tank3 1.01T 1.71T 0 0 0 0
---------- ----- ----- ----- ----- ----- -----

I'll dig into the SMART attributes and see if I can find anything interesting as well.
 

hidden72

Dabbler
Joined
Aug 8, 2011
Messages
22
I'll dig into the SMART attributes and see if I can find anything interesting as well.

There it is! smartctl -a /dev/da11 shows this:

[root@freenas3] ~# smartctl -a /dev/da11 | grep Error
(snip)
ATA Error Count: 6 (device log contains only the most recent five errors)
ER = Error register [HEX]
Error 6 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 5 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 4 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 3 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT
Error 2 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT

"(1 days + 13 hours)" puts it right on top of the timing for the original CPU spike ~36 hours ago. I'll go yank that drive. Thanks for your help!
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
You would have received an automatic email on this from the SMART system. Did you not configure that? It's under the configuration under services/SMART.
 

Dusan

Guru
Joined
Jan 29, 2013
Messages
1,165
Error 2 occurred at disk power-on lifetime: 37 hours (1 days + 13 hours)
04 51 01 00 00 00 a0 Error: ABRT

"(1 days + 13 hours)" puts it right on top of the timing for the original CPU spike ~36 hours ago. I'll go yank that drive. Thanks for your help!
Unless the drives are brand new this is not related. Those events happened at 37 hours into the lifetime of the drive (e.g. if you run your system 24/7 it happened 1 day 13 hours after you first powered on the system). Depending on the drive manufacturer you can run smartctl -a /dev/ada11 | grep Power_On_Hours to get the current power-on hours. Example for one of my drives:
[PANEL][root@hostname] ~# smartctl -a /dev/ada1 | grep Power_On_Hours
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2260[/PANEL]
This means my drive already accumulated 2260 power-on hours.
 

hidden72

Dabbler
Joined
Aug 8, 2011
Messages
22
Unless the drives are brand new this is not related. Those events happened at 37 hours into the lifetime of the drive (e.g. if you run your system 24/7 it happened 1 day 13 hours after you first powered on the system). Depending on the drive manufacturer you can run smartctl -a /dev/ada11 | grep Power_On_Hours to get the current power-on hours. Example for one of my drives:
[PANEL][root@hostname] ~# smartctl -a /dev/ada1 | grep Power_On_Hours
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2260[/PANEL]
This means my drive already accumulated 2260 power-on hours.

Ahh... good catch. I'll let this resilver complete and then re-run through all of the smart data and see if anything else is out there that might line-up with the start of the spike. (note to self: save console output for easy after-the-fact referencing)
 
Status
Not open for further replies.
Top