iscsi issue : long write times, stopping services

AG2017

Cadet
Joined
Nov 25, 2020
Messages
2
Being seeing alot of ctl_datamove aborted events.
Not sure how to read these system events. How do you determine the drive from output ?(4:4:0) ?
Nov 20 12:12:54 locutus (4:4:0/1): Tag: 0x8c94a, type 1
Nov 20 12:12:54 locutus (4:4:0/1): ctl_datamove: 571 seconds
Nov 20 12:12:54 locutus ctl_datamove: tag 0x8c94a on (4:4:0) aborted
Nov 20 12:12:54 locutus (4:4:0/1): COMPARE AND WRITE. CDB: 89 00 00 00 00 00 00 00 af 18 00 00 00
Log dump attached;
Build
HPe DL385 Gen 10 Plus Chassis
256 GB RAM (DDR4 3600 I think) ECC
2x AMD EPYC 7262 8-core processors
2x HP LSI 2-port (up to 8 drives each) HBA (don’t recall model number, can be found in purchase docs)
Boot Volume 2x KINGSTON SA400S3 480GB SSD in RAID1 (bays 1 and 5)
Data drives: (All ZFS
6x HGST HUH728080AL4200 8TB HDD in 3 VDevs of Mirrors (RAID10) Bays 2-4, 6-8)
1TB Sabrent SSD as SLOG (NVMe add-in card)
1TB Sabrent SSD as Cache (NVMe add-in card)
6x Sabrent NVMe SSDs in 3 VDevs of Mirrors (RAID10) (NVMe add-in card)

Network
Onboard:
1GB iLo connection for management (connected to 100MB switch)
1GB connection for web interface (connected to 100MB switch)
Add-in:
2x dual 10GB SFP nics (not sure make/model, we switched them around a lot)
 

Attachments

  • log dump.txt
    11.6 KB · Views: 137

AG2017

Cadet
Joined
Nov 25, 2020
Messages
2
Our issues are exceptionally long write times at various points throughout the day, dropping services. In the log that we included we get ctl_datamove errors. Essentially, the entire server stops responding to any read/write requests. SSH and other services on the TureNAS stop responding as well.

We think at this point the iops are being consumed when the SAN writes all cache in ram to storage. As we understand, TrueNAS takes 1/8 of your total RAM and uses it as temporary high speed storage before dumping it to HDD/SSD. In our case, it's 32GB of ram being written to storage. We weren't thinking about this when the machine was configured or else we would have reduced the ram or tried to reduce the allocation.

Could this be what's consuming all of our I/O and causing the SAN to not respond when other I/O operations are trying to be performed?

Is there an issue in our baseline hardware config or the ZFS or RAID?

Is there a best-practice article/guide you could point us at to make sure we set this up properly? This is our go-around using TrueNAS as a SAN.
 
Top