mts1472
Cadet
- Joined
- Jul 29, 2023
- Messages
- 2
Hello,
Looking for some guidance on an unusual issue.
We've had a TrueNAS 13.0-U5.2 running in production for a few months - we decided that the addition of a SLOG on M.2 would be a good move to improve our write performance.
We installed 2 x M.2 500GB Crucial P5 Plus SSDs in a M.2 PCI Adapter (https://www.startech.com/en-us/hdd/pex8m2e2)
These were configured in a LOG MIRROR with 15GiB Overprovisioning set.
After a few days we noticed periodic 'stalls' in IO. After testing, we found that under moderate write load (two concurrent VM migrations from another NAS), the NFS Share would become inaccessible to Proxmox (with the migrations appearing to continue at a crawing speed). While monitoring zpool iostat, all pool activity does appear to cease. Log entries as below throughout the 'stall'.
Once we stop the migration, the resets continue for another few minutes, after which the pool resumes as if nothing happened.
Assuming we had faulty hardware, we removed the LOG vdev and created a new Pool with the two SSDs in MIrror.
When placed under load, the SSDs perform flawlessly without errors or resets.
Removed the Overprovisioning value, reset the controllers with nvmecontrol and re-added as a MIRROR LOG - the issue persisted.
SMART Tests and nmvecontrol selftest all return without error.
No Firmware Updates available for the controller or system board.
Happy to provide any other info that may be usedful - I'd be very greatful for any advice!
Many thanks!
Looking for some guidance on an unusual issue.
We've had a TrueNAS 13.0-U5.2 running in production for a few months - we decided that the addition of a SLOG on M.2 would be a good move to improve our write performance.
We installed 2 x M.2 500GB Crucial P5 Plus SSDs in a M.2 PCI Adapter (https://www.startech.com/en-us/hdd/pex8m2e2)
These were configured in a LOG MIRROR with 15GiB Overprovisioning set.
After a few days we noticed periodic 'stalls' in IO. After testing, we found that under moderate write load (two concurrent VM migrations from another NAS), the NFS Share would become inaccessible to Proxmox (with the migrations appearing to continue at a crawing speed). While monitoring zpool iostat, all pool activity does appear to cease. Log entries as below throughout the 'stall'.
Code:
Jul 28 16:33:18 truenas nvme1: Resetting controller due to a timeout. Jul 28 16:33:18 truenas nvme1: RECOVERY_WAITING Jul 28 16:33:18 truenas nvme1: resetting controller Jul 28 16:33:18 truenas nvme1: aborting outstanding i/o Jul 28 16:34:42 truenas nvme0: RECOVERY_START 61662237802418 vs 61661980108095 Jul 28 16:34:42 truenas nvme0: timeout with nothing complete, resetting Jul 28 16:34:42 truenas nvme0: Resetting controller due to a timeout. Jul 28 16:34:42 truenas nvme0: RECOVERY_WAITING Jul 28 16:34:42 truenas nvme0: resetting controller Jul 28 16:34:42 truenas nvme0: aborting outstanding i/o Jul 28 16:35:13 truenas nvme0: RECOVERY_START 61795648062625 vs 61793861364921 Jul 28 16:35:13 truenas nvme0: timeout with nothing complete, resetting Jul 28 16:35:13 truenas nvme0: Resetting controller due to a timeout. Jul 28 16:35:13 truenas nvme0: RECOVERY_WAITING Jul 28 16:35:13 truenas nvme0: resetting controller Jul 28 16:35:13 truenas nvme0: aborting outstanding i/o Jul 28 16:36:54 truenas nvme1: RECOVERY_START 62229113312952 vs 62227429693240 Jul 28 16:36:54 truenas nvme1: timeout with nothing complete, resetting Jul 28 16:36:54 truenas nvme1: Resetting controller due to a timeout. Jul 28 16:36:54 truenas nvme1: RECOVERY_WAITING Jul 28 16:36:54 truenas nvme1: resetting controller Jul 28 17:18:07 truenas nvme1: aborting outstanding i/o Jul 28 17:18:07 truenas nvme1: RECOVERY_START 72853744986354 vs 72851855207762 Jul 28 17:18:07 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:18:07 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:18:07 truenas nvme1: RECOVERY_WAITING Jul 28 17:18:07 truenas nvme1: resetting controller Jul 28 17:18:07 truenas nvme1: aborting outstanding i/o Jul 28 17:19:16 truenas nvme1: RECOVERY_START 73150535787716 vs 73148701852485 Jul 28 17:19:16 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:19:16 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:19:16 truenas nvme1: RECOVERY_WAITING Jul 28 17:19:16 truenas nvme1: resetting controller Jul 28 17:19:16 truenas nvme1: aborting outstanding i/o Jul 28 17:19:47 truenas nvme1: RECOVERY_START 73282232361163 vs 73282159356784 Jul 28 17:19:47 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:19:47 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:19:47 truenas nvme1: RECOVERY_WAITING Jul 28 17:19:47 truenas nvme1: resetting controller Jul 28 17:20:18 truenas nvme1: aborting outstanding i/o Jul 28 17:20:18 truenas nvme1: RECOVERY_START 73416046354348 vs 73413916058414 Jul 28 17:20:18 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:20:18 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:20:18 truenas nvme1: RECOVERY_WAITING Jul 28 17:20:18 truenas nvme1: resetting controller Jul 28 17:20:18 truenas nvme1: aborting outstanding i/o Jul 28 17:21:31 truenas nvme1: RECOVERY_START 73730463714189 vs 73729282600882 Jul 28 17:21:31 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:21:31 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:21:31 truenas nvme1: RECOVERY_WAITING Jul 28 17:21:31 truenas nvme1: resetting controller Jul 28 17:22:29 truenas nvme1: aborting outstanding i/o Jul 28 17:22:29 truenas nvme1: RECOVERY_START 73978347730675 vs 73976402117990 Jul 28 17:22:29 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:22:29 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:22:29 truenas nvme1: RECOVERY_WAITING Jul 28 17:22:29 truenas nvme1: resetting controller Jul 28 17:23:00 truenas nvme1: aborting outstanding i/o Jul 28 17:23:00 truenas nvme1: RECOVERY_START 74111757999608 vs 74109975590062 Jul 28 17:23:00 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:23:00 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:23:00 truenas nvme1: RECOVERY_WAITING Jul 28 17:23:00 truenas nvme1: resetting controller Jul 28 17:23:55 truenas nvme1: aborting outstanding i/o Jul 28 17:23:55 truenas nvme1: RECOVERY_START 74347886689994 vs 74347830863198 Jul 28 17:23:55 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:23:55 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:23:55 truenas nvme1: RECOVERY_WAITING Jul 28 17:23:55 truenas nvme1: resetting controller Jul 28 17:27:34 truenas nvme1: aborting outstanding i/o Jul 28 17:27:34 truenas nvme1: RECOVERY_START 75288282605573 vs 75286667700742 Jul 28 17:27:34 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:27:34 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:27:34 truenas nvme1: RECOVERY_WAITING Jul 28 17:27:34 truenas nvme1: resetting controller Jul 28 17:28:05 truenas nvme1: aborting outstanding i/o Jul 28 17:28:05 truenas nvme1[4139]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on truenas.local Jul 28 17:28:05 truenas nvme1: RECOVERY_START 75421692862375 vs 75419897576202 Jul 28 17:28:05 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:28:05 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:28:05 truenas nvme1: RECOVERY_WAITING Jul 28 17:28:05 truenas nvme1: resetting controller Jul 28 17:28:36 truenas nvme1: aborting outstanding i/o Jul 28 17:28:36 truenas nvme1[4139]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on truenas.local Jul 28 17:28:36 truenas nvme1: RECOVERY_START 75555451022231 vs 75553445279732 Jul 28 17:28:36 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:28:36 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:28:36 truenas nvme1: RECOVERY_WAITING Jul 28 17:28:36 truenas nvme1: resetting controller Jul 28 17:28:36 truenas nvme1: aborting outstanding i/o Jul 28 17:29:36 truenas nvme1[4139]: Last message 'aborting outstanding' repeated 1 times, suppressed by syslog-ng on truenas.local Jul 28 17:29:51 truenas nvme1: RECOVERY_START 75877367388117 vs 75877182710912 Jul 28 17:29:51 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:29:51 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:29:51 truenas nvme1: RECOVERY_WAITING Jul 28 17:29:51 truenas nvme1: resetting controller Jul 28 17:29:51 truenas nvme1: aborting outstanding i/o Jul 28 17:30:22 truenas nvme1: RECOVERY_START 76009454802683 vs 76008995250277 Jul 28 17:30:22 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:30:22 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:30:22 truenas nvme1: RECOVERY_WAITING Jul 28 17:30:22 truenas nvme1: resetting controller Jul 28 17:30:53 truenas nvme1: aborting outstanding i/o Jul 28 17:30:53 truenas nvme1: RECOVERY_START 76142912315219 vs 76141138498917 Jul 28 17:30:53 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:30:53 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:30:53 truenas nvme1: RECOVERY_WAITING Jul 28 17:30:53 truenas nvme1: resetting controller Jul 28 17:31:24 truenas nvme1: aborting outstanding i/o Jul 28 17:31:24 truenas nvme1: RECOVERY_START 76276322576058 vs 76274535879361 Jul 28 17:31:24 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:31:24 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:31:24 truenas nvme1: RECOVERY_WAITING Jul 28 17:31:24 truenas nvme1: resetting controller Jul 28 17:31:55 truenas nvme1: aborting outstanding i/o Jul 28 17:31:55 truenas nvme1: RECOVERY_START 76409865985873 vs 76407950438907 Jul 28 17:31:55 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:31:55 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:31:55 truenas nvme1: RECOVERY_WAITING Jul 28 17:31:55 truenas nvme1: resetting controller Jul 28 17:31:55 truenas nvme1: aborting outstanding i/o Jul 28 17:31:55 truenas nvme1: WRITE sqid:13 cid:118 nsid:1 lba:4025072 len:256 Jul 28 17:31:55 truenas nvme1: ABORTED - BY REQUEST (00/07) sqid:13 cid:118 cdw0:0 Jul 28 17:33:10 truenas nvme1: RECOVERY_START 76730858942848 vs 76729445936387 Jul 28 17:33:10 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:33:10 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:33:10 truenas nvme1: RECOVERY_WAITING Jul 28 17:33:10 truenas nvme1: resetting controller Jul 28 17:33:41 truenas nvme1: aborting outstanding i/o Jul 28 17:33:41 truenas nvme1: RECOVERY_START 76864269199013 vs 76862491091818 Jul 28 17:33:41 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:33:41 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:33:41 truenas nvme1: RECOVERY_WAITING Jul 28 17:33:41 truenas nvme1: resetting controller Jul 28 17:34:12 truenas nvme1: aborting outstanding i/o Jul 28 17:34:12 truenas nvme1: RECOVERY_START 76997752479287 vs 76995897063047 Jul 28 17:34:12 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:34:12 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:34:12 truenas nvme1: RECOVERY_WAITING Jul 28 17:34:12 truenas nvme1: resetting controller Jul 28 17:34:12 truenas nvme1: aborting outstanding i/o Jul 28 17:35:45 truenas nvme1: RECOVERY_START 77395827207347 vs 77394951049849 Jul 28 17:35:45 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:35:45 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:35:45 truenas nvme1: RECOVERY_WAITING Jul 28 17:35:45 truenas nvme1: resetting controller Jul 28 17:36:31 truenas nvme1: aborting outstanding i/o Jul 28 17:36:31 truenas nvme1: RECOVERY_START 77593348439420 vs 77591213848072 Jul 28 17:36:31 truenas nvme1: timeout with nothing complete, resetting Jul 28 17:36:31 truenas nvme1: Resetting controller due to a timeout. Jul 28 17:36:31 truenas nvme1: RECOVERY_WAITING Jul 28 17:36:31 truenas nvme1: resetting controller Jul 28 17:46:31 truenas nvme1: aborting outstanding i/o
Once we stop the migration, the resets continue for another few minutes, after which the pool resumes as if nothing happened.
Assuming we had faulty hardware, we removed the LOG vdev and created a new Pool with the two SSDs in MIrror.
When placed under load, the SSDs perform flawlessly without errors or resets.
Code:
root@truenas[/mnt/NVMETest]# fio --name=test --size=100g --rw=write --ioengine=posixaio --direct=1 --iodepth 32 --bs=4k test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32 fio-3.28 Starting 1 process Jobs: 1 (f=1): [W(1)][100.0%][w=209MiB/s][w=53.4k IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=6247: Fri Jul 28 14:46:33 2023 write: IOPS=74.1k, BW=290MiB/s (304MB/s)(100GiB/353706msec); 0 zone resets slat (nsec): min=696, max=41826k, avg=3078.58, stdev=8679.47 clat (usec): min=19, max=205080, avg=401.36, stdev=622.44 lat (usec): min=21, max=205085, avg=404.44, stdev=622.67 clat percentiles (usec): | 1.00th=[ 91], 5.00th=[ 129], 10.00th=[ 149], 20.00th=[ 180], | 30.00th=[ 204], 40.00th=[ 223], 50.00th=[ 243], 60.00th=[ 265], | 70.00th=[ 293], 80.00th=[ 338], 90.00th=[ 562], 95.00th=[ 1844], | 99.00th=[ 3163], 99.50th=[ 3589], 99.90th=[ 4424], 99.95th=[ 5080], | 99.99th=[ 6521] bw ( KiB/s): min=50075, max=505109, per=100.00%, avg=296630.89, stdev=165476.46, samples=704 iops : min=12518, max=126277, avg=74157.41, stdev=41369.13, samples=704 lat (usec) : 20=0.01%, 50=0.04%, 100=1.50%, 250=51.57%, 500=35.68% lat (usec) : 750=3.03%, 1000=0.81% lat (msec) : 2=3.51%, 4=3.65%, 10=0.21%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=11.13%, sys=39.66%, ctx=1452026, majf=0, minf=1 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=7.4%, 16=83.7%, 32=8.8%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=94.1%, 8=0.8%, 16=2.8%, 32=2.3%, 64=0.0%, >=64=0.0% issued rwts: total=0,26214400,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Run status group 0 (all jobs): WRITE: bw=290MiB/s (304MB/s), 290MiB/s-290MiB/s (304MB/s-304MB/s), io=100GiB (107GB), run=353706-353706msec
Removed the Overprovisioning value, reset the controllers with nvmecontrol and re-added as a MIRROR LOG - the issue persisted.
SMART Tests and nmvecontrol selftest all return without error.
No Firmware Updates available for the controller or system board.
Happy to provide any other info that may be usedful - I'd be very greatful for any advice!
Many thanks!