Hello everyone, new user here, so forgive me if I'll miss something trivial.
I have been using linux for many years now but I'm new to TrueNas and to NAS systems in general.
I recently build myself a new system (hardware specs below) and I start to play around with it to check that everything was working properly.
I run into a strange problem with the NVME SSD on which I installed SCALE that I think is software related but I'm having a hard time to figure out the cause of it.
In short: the SMART information of the SSD show a huge amount of error-log entries, which is keep increasing of time. Additionally, the 'data unit written' are also very high for a new disk that has basically spend most time in idle. The two problem seems correlated but not completly.
In detailed: I installed the ALPHA version of SCALE shortly before the new BETA came out. I followed
this guide to partition the SSD in order to have the boot-pool and a ssd-pool for the apps on the same disk (I know it is not good practise but it suits my need). The idea is similar to
this.
Everything went smooth but after a while there was a system alert that the number of error-log of the SSD had increase by a certain amount.
I inspected the smartctl info because I fear an hardware problem with the disk.
Searching around I found
this and
this, which describe well my problem, so I thought it was a minor issue.
To clarify, the system has no HDD currently connected, I basically didn't change any settings and the system spend most its time in idle: basically I was checking temperature, power consumption and hardware integrity. The problem seems to got much worst when I activated the "apps" on the ssd-pool partition. After around a week (and after updating to the BETA version of SCALE, without any significant change), the number of error-log was in the order of hundreds of thousands (!!!) and the 'data unit written' over 300GB, if I recall correctly. This makes me worried a bit, because in term of wear out of the SSD hundreds of GB written for a system in idle seems quite strange (correct me I'm interpreting this number in the wrong way).
To further check if it was software or hardware, I made a fresh install of the stable version of TrueNas Core (12.0-U4) on the whole SSD.
In this case the problem was mostly absent, of the order of one to few error-log entry per day. Unfortunately I didn't check the data written, but I can do it if you think it is necessary or useful.
I then reinstall the BETA version of SCALE, this time on the whole SSD. Paying a closer attention to the smartctl information, I noticed that with the system in idle, the behaviour seems close to that of CORE, few error here and there. But as soon as I try to do something, which include action as simple as refreshing the dashboard, the error-log entries increased easily by tens or hundreds at a time. Additionally, even when the error are not increasing, I noticed that the 'data unit written' increase of around 10GB over a period of roughly 12h with the system in idle.
My conclusion was that in the previous installation activating the 'app' on the other partition was simply magnifying a problem already existing.
I attached at the bottom the smartctl and the nvme-cli info.
I do not know if it is a problem specifically related to my SSD or not, I do not have another one to test unfortunately.
Let me know what do you think and if I can provide further information, or if you think it is useful to open a bug-report.
Thank you very much for your help!
HARDWARE SPECS:
MB: Asus P11C-I + ASMB9-iKVM
CPU: Intel Core i3-9100
RAM: 1x16GB ECC Samsung M391A2K43BB1-CTD
SSD: Sabrent Rocket Nano NVME 512GB
SMARTCLT REPORT:
Code:
truenas# smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.42+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Sabrent Rocket nano
Serial Number: A5CD07140D9183415843
Firmware Version: RKT301.3
PCI Vendor/Subsystem ID: 0x1987
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 4ae020019f
Local Time is: Tue Jul 13 03:42:14 2021 PDT
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 95 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.50W - - 0 0 0 0 0 0
1 + 2.70W - - 1 1 1 1 0 0
2 + 2.16W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 1000 1000
4 - 0.0020W - - 4 4 4 4 5000 45000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 17 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 22,691 [11.6 GB]
Data Units Written: 727,398 [372 GB]
Host Read Commands: 485,158
Host Write Commands: 40,341,997
Controller Busy Time: 55
Power Cycles: 42
Power On Hours: 237
Unsafe Shutdowns: 18
Media and Data Integrity Errors: 0
Error Information Log Entries: 511,463
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 37 Celsius
Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 511463 0 0x001b 0x4004 - 0 0 -
1 511462 0 0x001a 0x4004 - 0 0 -
2 511461 0 0x0017 0x4004 - 0 0 -
3 511460 0 0x0016 0x4004 - 0 0 -
4 511459 0 0x0014 0x4004 - 0 0 -
5 511458 0 0x0017 0x4005 - 0 0 -
6 511457 0 0x0015 0x4005 - 0 0 -
7 511456 0 0x0014 0x4005 - 0 0 -
8 511455 0 0x0018 0x4005 - 0 0 -
9 511454 0 0x001b 0x4005 - 0 0 -
10 511453 0 0x0019 0x4005 - 0 0 -
11 511452 0 0x0018 0x4005 - 0 0 -
12 511451 0 0x0016 0x4005 - 0 0 -
13 511450 0 0x0015 0x4005 - 0 0 -
14 511449 0 0x001a 0x4004 - 0 0 -
15 511448 0 0x0019 0x4004 - 0 0 -
NVME-CLI REPORT:
Code:
truenas# nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:16
.................
Entry[ 0]
.................
error_count : 511463
sqid : 0
cmdid : 0x1b
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 1]
.................
error_count : 511462
sqid : 0
cmdid : 0x1a
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 2]
.................
error_count : 511461
sqid : 0
cmdid : 0x17
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 3]
.................
error_count : 511460
sqid : 0
cmdid : 0x16
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 4]
.................
error_count : 511459
sqid : 0
cmdid : 0x14
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 5]
.................
error_count : 511458
sqid : 0
cmdid : 0x17
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 6]
.................
error_count : 511457
sqid : 0
cmdid : 0x15
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 7]
.................
error_count : 511456
sqid : 0
cmdid : 0x14
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 8]
.................
error_count : 511455
sqid : 0
cmdid : 0x18
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 9]
.................
error_count : 511454
sqid : 0
cmdid : 0x1b
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[10]
.................
error_count : 511453
sqid : 0
cmdid : 0x19
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[11]
.................
error_count : 511452
sqid : 0
cmdid : 0x18
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[12]
.................
error_count : 511451
sqid : 0
cmdid : 0x16
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[13]
.................
error_count : 511450
sqid : 0
cmdid : 0x15
status_field : 0x4005(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[14]
.................
error_count : 511449
sqid : 0
cmdid : 0x1a
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[15]
.................
error_count : 511448
sqid : 0
cmdid : 0x19
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................