Build advice

gdarends · Aug 29, 2022

Update on HDD tests.

I ran "smartctl -t long" on all drives. It took about 10hrs. All drives came out without issues except for one that had 1 error. (This is one of the newly bought drives btw)

SMART overall-health self-assessment test result: PASSED
....

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 80
3 Spin_Up_Time 0x0007 180 180 024 Pre-fail Always - 468 (Average 391)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 119 119 020 Pre-fail Offline - 35
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 64
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 15
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 15
194 Temperature_Celsius 0x0002 125 125 000 Old_age Always - 48 (Min/Max 22/52)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 32834 hours (1368 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 00 00 00 00

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ec 00 00 00 00 00 00 00 00:25:40.465 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:25:40.463 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:25:40.462 IDENTIFY DEVICE
ec 00 00 00 00 00 00 00 00:25:40.460 IDENTIFY DEVICE
ef 10 02 00 00 00 00 00 00:25:40.215 SET FEATURES [Enable SATA feature]

I tried to run the solnet-array-test, but couldn't. It only runs on TrueNAS Core, not Scale.

I ran some fio tests. Some tests were giving very weak results while some were ridiculous (I think because of caching)

root@truenas[/mnt/baki]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
Run status group 0 (all jobs):
WRITE: bw=26.9MiB/s (28.2MB/s), 26.9MiB/s-26.9MiB/s (28.2MB/s-28.2MB/s), io=1635MiB (1714MB), run=60693-60693msec

root@truenas[/mnt/baki]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
Run status group 0 (all jobs):
WRITE: bw=25.8MiB/s (27.0MB/s), 25.8MiB/s-25.8MiB/s (27.0MB/s-27.0MB/s), io=1557MiB (1632MB), run=60438-60438msec

root@truenas[/mnt/baki]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=64k --size=256m --numjobs=16 --iodepth=16 --runtime=60 --time_based --end_fsync=1
Run status group 0 (all jobs):
WRITE: bw=222MiB/s (233MB/s), 13.6MiB/s-16.6MiB/s (14.3MB/s-17.4MB/s), io=16.8GiB (18.0GB), run=64412-77161msec

root@truenas[/mnt/baki]# fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=16g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1
Run status group 0 (all jobs):
WRITE: bw=358MiB/s (375MB/s), 358MiB/s-358MiB/s (375MB/s-375MB/s), io=24.1GiB (25.9GB), run=68886-68886msec

root@truenas[/mnt/baki]# fio --iodepth=64 --ioengine=libaio --direct=1 --rw=randwrite --bs=4k --size=2G --numjobs=64 --runtime=20 --group_reporting --name=test-rand-write
Run status group 0 (all jobs):
WRITE: bw=28.0MiB/s (29.4MB/s), 28.0MiB/s-28.0MiB/s (29.4MB/s-29.4MB/s), io=561MiB (588MB), run=20009-20009msec

root@truenas[/mnt/baki]# fio --iodepth=64 --ioengine=libaio --direct=1 --rw=write --bs=1m --size=2g --numjobs=4 --runtime=20 --group_reporting --name=test-write
Run status group 0 (all jobs):
WRITE: bw=476MiB/s (500MB/s), 476MiB/s-476MiB/s (500MB/s-500MB/s), io=8192MiB (8590MB), run=17196-17196msec

I ran this one multiple times right after each other and I think this was hitting the cache.

root@truenas[/mnt/baki]# fio --iodepth=64 --ioengine=libaio --direct=1 --rw=read --bs=1m --size=2g --numjobs=4 --runtime=10 --group_reporting --name=test-read
Run status group 0 (all jobs):
READ: bw=11.3GiB/s (12.1GB/s), 11.3GiB/s-11.3GiB/s (12.1GB/s-12.1GB/s), io=8192MiB (8590MB), run=711-711msec

Run status group 0 (all jobs):
READ: bw=1451MiB/s (1522MB/s), 1451MiB/s-1451MiB/s (1522MB/s-1522MB/s), io=8192MiB (8590MB), run=5645-5645msec

Run status group 0 (all jobs):
READ: bw=7056MiB/s (7399MB/s), 7056MiB/s-7056MiB/s (7399MB/s-7399MB/s), io=8192MiB (8590MB), run=1161-1161msec

Run status group 0 (all jobs):
READ: bw=7564MiB/s (7932MB/s), 7564MiB/s-7564MiB/s (7932MB/s-7932MB/s), io=8192MiB (8590MB), run=1083-1083msec

This test I set primarycache to none on the dataset before running.

root@truenas[/mnt/baki]# fio --directory=/baki/fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=1M --size=256M --numjobs=32 --time_based --runtime=300
Run status group 0 (all jobs):
READ: bw=177MiB/s (185MB/s), 4466KiB/s-7787KiB/s (4573kB/s-7974kB/s), io=51.8GiB (55.7GB), run=300008-300234msec

Davvo · Aug 30, 2022

gdarends said:
Temperature_Celsius - 48 (Min/Max 22/52)

That is no bueno.

gdarends · Aug 30, 2022

I didn't even look at the temperature. I wanted to first run some stress test on them and then see. But I guess the SMART test was enough to show some results. SDA,SDB,SDC are all on the cage furthest away from the fan, the other 3 (SDD, SDE, SDF) are the ones right up against the fan. At least that's what I think is the setup. I'm thinking they relate to the SATA ports 0-5 on the MOBO. SDE and SDF for sure are because those were installed later and I specifically mounted them close to the fan. SDE and SDF are from the previous system. Do these results show min max from the whole history of the drives or just the latest test run? In any case, the data on the other newly bought drives indicate that the current build has temp issues.

Now, looking at the possible fan configurations of the case, there are a total of 5 fan locations for the HDD compartment.
2 in the front, 2 on top and 1 in the back. Right now the one in the back is pulling air in and blowing on the drives.
I'm thinking, changing that one for a 140mm and leave it as an intake. Make the 2 top ones also 140mm intake, blowing on the drives and the 2 up front 120mm to draw the air out. Should I worry about the balance of too much air in not enough air out? What would be the best setup?

EDIT: Or perhaps better to have the 2 on top to exhaust air since hot air rises?

/dev/sda
194 Temperature_Celsius 0x0002 136 136 000 Old_age Always - 44 (Min/Max 22/49)
/dev/sdb
194 Temperature_Celsius 0x0002 130 130 000 Old_age Always - 46 (Min/Max 22/52)
/dev/sdc
194 Temperature_Celsius 0x0002 127 127 000 Old_age Always - 47 (Min/Max 15/53)
/dev/sdd
194 Temperature_Celsius 0x0002 142 142 000 Old_age Always - 42 (Min/Max 15/47)
/dev/sde
194 Temperature_Celsius 0x0002 139 139 000 Old_age Always - 43 (Min/Max 25/47)
/dev/sdf
194 Temperature_Celsius 0x0002 136 136 000 Old_age Always - 44 (Min/Max 25/49)

awasb · Aug 30, 2022

gdarends said:
[...]

Now, looking at the possible fan configurations of the case, there are a total of 5 fan locations for the HDD compartment.
2 in the front, 2 on top and 1 in the back. Right now the one in the back is pulling air in and blowing on the drives.
I'm thinking, changing that one for a 140mm and leave it as an intake.

[...]

It's better to get a flow right through the case. If I were You, I'd put two 140mm Noctuas or beQuiet SilentWings to the front as intakes and one to two smaller fans (again, I prefer Noctuas or SilentWings, it's religion) in the back as exhausts. One could be enough, since the psu is pulling out air as well.

If You get that flow right (playing with the revs), there will be no hot air rising.

Concerning stress tests: If some decent test script only exists for CORE, I'd install CORE and beat the hell out of every part in that machine, before I'd think about the (then hopefully) running os.

gdarends · Aug 30, 2022

So I did a bit of more research on fans and the case.
Turns out that I can't mount any fans on the top where HDD cages are mounted. So in my case, I can't install any on top.

On the subject of which fans. I was doing some reading on AF and SP fans. And it's suggested to go with SP fans if there is any obstacles in front of the fan or if it's against a mesh of the case. In my case they are up against mesh from the case.

So I'm thinking for intake, going with the Noctua NF-F12 which is optimized for SP, or the NF-A12x25, which is a balance of AF and SP.
In the back for exhaust, the Noctua NF-A14, which is optimized for AF.

Davvo · Aug 30, 2022

gdarends said:
So I did a bit of more research on fans and the case.
Turns out that I can't mount any fans on the top where HDD cages are mounted. So in my case, I can't install any on top.

You could probably mount them outside the case if needed (as ugly as it is).

joeschmuck · Aug 30, 2022

gdarends said:
I can still add more fans, but will analyze and then add if needed.

I agree with you on this. One thing I'm cautious about is airflow direction. The air needs to come into the case, flow across all the drives, and exit with very little turbulence. The fans do not need to be running at full speed normally, a 7VDC voltage level will drive most fans and do what is required, and stay quiet.

gdarends said:
(This is one of the newly bought drives btw)

9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 64
Error 1 occurred at disk power-on lifetime: 32834 hours (1368 days + 2 hours)

That does not make sense to me. Which is it, 64 hours or 32834 hours?

And you have already heard that your hard drive temp is a bit high which brings you back to the larger fan, which it appears you are working on.

I'm starting to like that case, but I'd like to see some drive temps just letting the system Idle and then after the fan upgrade the same thing, after Idle and then after a long test.

gdarends · Aug 30, 2022

joeschmuck said:
I'm starting to like that case, but I'd like to see some drive temps just letting the system Idle and then after the fan upgrade the same thing, after Idle and then after a long test.

The results I sent previously were when the machine was idle, after the long test. I didn't think to check the temp while running the test.
I checked the temp again just now. System is idle.

Far from fan
194 Temperature_Celsius 0x0002 130 130 000 Old_age Always - 46 (Min/Max 22/49)
194 Temperature_Celsius 0x0002 127 127 000 Old_age Always - 47 (Min/Max 22/52)
194 Temperature_Celsius 0x0002 122 122 000 Old_age Always - 49 (Min/Max 15/53)

Near fan
194 Temperature_Celsius 0x0002 136 136 000 Old_age Always - 44 (Min/Max 15/47)
194 Temperature_Celsius 0x0002 136 136 000 Old_age Always - 44 (Min/Max 25/47)
194 Temperature_Celsius 0x0002 133 133 000 Old_age Always - 45 (Min/Max 25/49)

ChrisRJ · Aug 30, 2022

I would aim for something below 40 C. I remember a study from BackBlaze (or Google?) where drives above that temperature had a hugely increased failure rate after 4-5 years.

gdarends · Aug 30, 2022

ChrisRJ said:
I would aim for something below 40 C. I remember a study from BackBlaze (or Google?) where drives above that temperature had a hugely increased failure rate after 4-5 years.

I will try to find this data. But below 40C at idle or under stress?

gdarends · Aug 30, 2022

I found the research paper.

https://static.googleusercontent.com/media/research.google.com/en//archive/disk_failures.pdf

Between 35 - 40 C seems like the sweet spot. But 40 - 45 C doesn't seem bad either. It looks like there are other factors that affect the results in the latter range.

Davvo · Aug 30, 2022

gdarends said:
But 40 - 45 C doesn't seem bad either. It looks like there are other factors that affect the results in the latter range.

If you are 45 in idle, your drives go near 50 when in use (ie scrub). That's uncomfortably hot.

gdarends · Aug 30, 2022

Davvo said:
If you are 45 in idle, your drives go near 50 when in use (ie scrub). That's uncomfortably hot.

When I read the range I think, 40C idle and 45C in use. So I'm thinking to reach 40C at idle.

Also, in the future (maybe in a year) the system will be in a cooled environment. But where it is currently might change to a less hot environment as well.

joeschmuck · Aug 30, 2022

I think the study was done by BackBlaze a while ago. My opinion is to have an active drive temperature to not exceed 45C at any given time, meaning that an idle drive temperature would be well below 45C. That in my mind would be the ideal setup. In my setup my drives do not exceed 45C, they run typically around 38C to 41C and during a scrub up 42C to 44C. My fans are running on a constant 7VDC and promote good airflow. The server is in the second story of my house which is about 78F during the summer months in the daytime, drops to 74F at night.

So try the larger fan and make sure you know the airflow within the case. I hope you are able to drop the temps more but those drives just produce heat. If you do not need the faster speed of the drives then when you replace these drives, look for a slower speed, cooler drives.

Best of luck to you.

Edit: P.S. My drives are 7200 RPM drives as well but not because I needed faster drives, they were on sale. These drives all have 35008 power on (constantly spinning) hours, and the warranty expired September 2020, and I have zero errors on them all. They have been powered on 1679 times (most of that was right after I got them doing a lot of testing on sleeping them (it was only tests, I never stayed with it or the drives would likely be dead by now), then when I had to move to another house, and I power it off and clean the case out periodically with compressed air and inspect my fans). My drive temps range from 25C to 43C since the last time I powered on the system but nominally run between 38C and 41C as previously mentioned. So my point here is I think you can bring your drive temps down without a lot of trouble. Replace the 120mm fan with a 140mm fan, if you can control the speed then set it to low and watch the drive temps. If you use smartctl -x your drive should display the MIN/MAX temps since power on and you can see how it's performing. Oh, this should be obvious but do not stick the server in a closet or place without any ventilation. It will just cook in there. Yes people do that as crazy as it sounds. I've done that, but it was a low power one hard drive system and could tolerate it, and wasn't a NAS.

gdarends · Sep 1, 2022

joeschmuck said:
The server is in the second story of my house which is about 78F during the summer months in the daytime, drops to 74F at night.

In my case it's a bit more extreme. Outside weather right now at 6PM is 30C (86F) and inside where the server is currently sitting it's 34C (93F). Noon the temp today inside was 32C (90F) and peaked at 5PM with a temp of 35C (95F).
Definitely gonna find a better spot for the server. I've left the case panel on the HDD side open to help with the temp in the mean time.

I noticed that when I check the temp at random times. There is always to random drives that are operating much hotter than the rest. Is this ZFS RAIDZ-2? I disabled all services, even SMART. Only SSH is running. I can hear the HDDs making what seems like writing to disk noise every couple of seconds.

I have ordered the 3 fans, but it will take 1-2 weeks to get here. I think I'm gonna load TrueNAS Core to run the HDD stress test in the mean time.

Davvo · Sep 1, 2022

gdarends said:
I noticed that when I check the temp at random times. There is always to random drives that are operating much hotter than the rest. Is this ZFS RAIDZ-2? I disabled all services, even SMART. Only SSH is running. I can hear the HDDs making what seems like writing to disk noise every couple of seconds.

Could be your system dataset for the noise.
For the temps should be position related.

joeschmuck · Sep 2, 2022

gdarends said:
There is always to random drives that are operating much hotter than the rest. Is this ZFS RAIDZ-2?

You need to be specific, which two drives? Use the serial number of the drives to physically locate them, then I would place a bet they are both further from the main airflow.

gdarends said:
I've left the case panel on the HDD side open to help with the temp in the mean time.

While I do understand this, in a case with proper airflow leaving the side panel off would actually cause the drives to heat up.

So my point that I'm driving home here is proper airflow. In your situation you have a small gap between drives that produce a lot of heat. You just need to do whatever it takes to move air across those drives.

gdarends said:
I can hear the HDDs making what seems like writing to disk noise every couple of seconds.

As @Davvo mentioned, probably the System Dataset, but this will not cause the drives to heat up. The drives are spinning and the head motion is very minimal.

gdarends said:
Outside weather right now at 6PM is 30C (86F) and inside where the server is currently sitting it's 34C (93F). Noon the temp today inside was 32C (90F) and peaked at 5PM with a temp of 35C (95F).

That is just too hot of a location for a server to run. I really think that you will have hard drive failure due to excessive heat. The temps you are running will not violate the Warranty but it will generally reduce the life of the drives.

gdarends said:
I have ordered the 3 fans, but it will take 1-2 weeks to get here. I think I'm gonna load TrueNAS Core to run the HDD stress test in the mean time.

Monitor the drive temps, you do not want to exceed the maximum limit and void your warranty, or wait until you get the new fans.

awasb · Sep 2, 2022

joeschmuck said:
[...]

So my point that I'm driving home here is proper airflow. In your situation you have a small gap between drives that produce a lot of heat. You just need to do whatever it takes to move air across those drives.

[...]

!

Btw.: My "test appliance" sits right under a sloping roof in a tiny edge. No external cooling. Just one single intake, one single exhaust, one single CPU, and one single PSU fan. In summer it's up to 30°C/86°F up there. Never the less: the ambient "delta" is enough to keep drives under 45°C/113°F, even with the fan rpms reduced, when running stress tests (i.e. running offsite replication tasks via ssh with plzip while scrubbing while doing backups to the NAS itself). All numbers from smartctl.

gdarends · Sep 2, 2022

joeschmuck said:
You need to be specific, which two drives?

It's random. I'm using the dev name in this case. So previously, SDB and SDC was hotter. Then SDB and SDF was hotter, and then SDB and SDC again. SDA, SDB and SDC are in the cage further from the fan. SDD, SDE, SDF is right in front of the fan.

joeschmuck said:
While I do understand this, in a case with proper airflow leaving the side panel off would actually cause the drives to heat up.

The temperatures actually went down after I did this. So for now, until I get the other fans, I will keep it open.

joeschmuck said:
That is just too hot of a location for a server to run. I really think that you will have hard drive failure due to excessive heat. The temps you are running will not violate the Warranty but it will generally reduce the life of the drives.

Yes, I never actually checked the temp previously. I just put the old system there and it's been there for a couple of years without issues. So I never really thought about it. But now I'm concerned and I'll be looking at a better location.

joeschmuck said:
Monitor the drive temps, you do not want to exceed the maximum limit and void your warranty, or wait until you get the new fans.

Just checked the HDD datasheet. Won't be a problem with the warranty.
Environmental (operating) Ambient temperature 5C to 60C

joeschmuck · Sep 2, 2022

gdarends said:
It's random. I'm using the dev name in this case. So previously, SDB and SDC was hotter. Then SDB and SDF was hotter, and then SDB and SDC again. SDA, SDB and SDC are in the cage further from the fan. SDD, SDE, SDF is right in front of the fan.

Do not rely on the dev name, these are assigned by which drive just happens to be ready first and can swap positions, even if that doesn't happen often, it's how the OS works. So always use the drive serial number when troubleshooting a drive. Of course if you are saying the drive high temps move around after TrueNAS has ben booted and continuously runs, that is a different story. I would just have to speculate what the issue is then and I dislike guessing if I could otherwise give an accurate answer.

If you would like an easy way to track and record drive failure and temp data, look into the script I just finished writing (with help from several others here). You can setup two CRON Jobs, one to send you a nice chart of the data periodically (I do this once a day at 2AM) and a second CRON Job once an hour (I do this at 5 minutes past the hour) to just collect drive data in a CSV format (spreadsheet). This would help you automatically track your data. Look under Resources -> Multi-Report. Use the -s parameter to just record statistics. Use the -h for Help.

Good Luck and hope the fans come in early.

Important Announcement for the TrueNAS Community.

Build advice

Explorer

MVP

Explorer

Patron

Explorer

MVP

Old Man

Explorer

Wizard

Explorer

Explorer

MVP

Explorer

Old Man

Explorer

MVP

Old Man

Patron

Explorer

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Build advice"

Similar threads