OK folks, here's the update, and it's not good:
We replaced the controllers and cables with LSI 9300-4e4i cards that are in IT mode. Array came right back up. We started a large data copy. It still fails with the dropped iSCSI and ctl_datamove aborted and WRITE(x) errors with timeouts over 100s. The performance was greatly improved, since it required a lot longer and much heavier "abuse" to trigger it, but the problem persists.
After googling a bit, I decided to change my transaction group timeout to 1s since many recommended that for iSCSI and block storage setups. Tested again. Still failed with a marginal improvement
More googling. Decided to underprovision the SLOG to10GB. Despite the fact that I have 256GB of RAM and the conventional thinking would be to set the SLOG to 64GB (1/8 ram * 2 transaction groups), but I saw other threads that talked about tuning the SLOG based on how fast the drives could write -- which made a lot of sense to me.
Tested again. Marginal improvement, still fails.
So I thought, "OK, 12GB/s is MAX transfer, not likely these drives will achieve that in actual write speed." so I dropped the SLOG to 5GB.
At this point, I thought I had it fixed. I beat on the array with a large file copy and lots of IO (read and write) for about 40 minutes and it looked good..... So I left my file copy running (I really did need to move that data) and headed home, feeling cautiously optimistic.
When I got home and checked on it -- it had failed again.
I was out of time in my maintenance window to try anything else, and frankly, I was out of ideas.
Management is ready to chuck TrueNAS, and I really don't want to do that. What do I need to do next to track this down? Would moving to NFS from iSCSI change anything (I'm thinking no)?
Are there any logs / config files / other info I can post here to help you guys help me figure this out?
Thanks again for all your help. I really want to make this work.
We replaced the controllers and cables with LSI 9300-4e4i cards that are in IT mode. Array came right back up. We started a large data copy. It still fails with the dropped iSCSI and ctl_datamove aborted and WRITE(x) errors with timeouts over 100s. The performance was greatly improved, since it required a lot longer and much heavier "abuse" to trigger it, but the problem persists.
After googling a bit, I decided to change my transaction group timeout to 1s since many recommended that for iSCSI and block storage setups. Tested again. Still failed with a marginal improvement
More googling. Decided to underprovision the SLOG to10GB. Despite the fact that I have 256GB of RAM and the conventional thinking would be to set the SLOG to 64GB (1/8 ram * 2 transaction groups), but I saw other threads that talked about tuning the SLOG based on how fast the drives could write -- which made a lot of sense to me.
- I have 6 drives in a 3-way striped mirror, so I have the write performance of 3 drives.
- 12Gb SAS means max transfer of 1.5GB/s
- 1.5GB/s * 3 drives = 4.5 GB/s
- 4.5GB in a 1s TXG * 2 TXGs = 10GB
Tested again. Marginal improvement, still fails.
So I thought, "OK, 12GB/s is MAX transfer, not likely these drives will achieve that in actual write speed." so I dropped the SLOG to 5GB.
At this point, I thought I had it fixed. I beat on the array with a large file copy and lots of IO (read and write) for about 40 minutes and it looked good..... So I left my file copy running (I really did need to move that data) and headed home, feeling cautiously optimistic.
When I got home and checked on it -- it had failed again.
I was out of time in my maintenance window to try anything else, and frankly, I was out of ideas.
Management is ready to chuck TrueNAS, and I really don't want to do that. What do I need to do next to track this down? Would moving to NFS from iSCSI change anything (I'm thinking no)?
Are there any logs / config files / other info I can post here to help you guys help me figure this out?
Thanks again for all your help. I really want to make this work.