Help! My FreeNAS has gone south

donairb · Oct 11, 2012

Yesterday, I returned home to find that all of my programs were crashing on the various computers hooked up to my network. From experience, I know that that usually means that my FreeNAS (on which is located all home directories for all machines) has failed.

Usually, when this happens, I reboot FreeNAS by issuing the command "shutdown -r now", or if it is not taking keyboard input, sometimes I have to do the power switch method. And, usually, whatever of these two methods I use, the NAS comes back to life and all is well. Not yesterday.

I am thinking one of the disks &/or the controller has failed, but here's the kicker: When I disconnect the (suspected) disk from the (suspected) controller, FreeNAS will boot, but will not mount the volume, being that part of it is not present. If I leave it connected, FreeNAS will not boot at all. I get dozens of messages such as:

mps0: (0:4:0) terminated ioc 804b scsi 0 state 0 xfer 0

This line repeats itself about 10 or 12 times, then I get four lines each starting with (da0:msp0:0:4:0) and the last ending with (unrecovered read error). Then the above line repeats 10 or 12 times again.

Any idea where to begin?

Thanks

Brian

ben · Oct 11, 2012

FreeNAS version and volume config would help us give you a better answer.

donairb · Oct 11, 2012

I think my FreeNAS version is 8.0.2 (or 8.2?).

My volume (ZFS) is made up of 8 disks, 3-2TB and 5-3TB. I don't know what details you need or how to find them without my machine running. There is no raid configured, it is JBOD.

Thanks

Brian

BobCochran · Oct 11, 2012

You can log in to the web interface and get the exact version from there. The web GUI should have an "Alert" button on the top right side you can click which will give you a brief status of your volumes. If you click specific storage volumes you should get messages telling you if one or more drive has failed.

I get daily reports on my system's health emailed to me. Each day at 3:01 a.m. EST two emails with subject lines "freenas,local daily run output" and "freenas.local security run output" land in my inbox, and I get a good picture of the system's overall health. You can set this up for yourself, too.

Bob Cochran

donairb · Oct 11, 2012

OK. I restarted my FreeNAS and went out. It finally booted at some point while I was away from it. I can now log into the web interface (which is something I could not do previously -- see my OP) and find the following:

Hostname: john.local
FreeNAS Build: FreeNAS-8.0.2-RELEASE-amd64 (8288)
Platform: Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz
Memory: 16314MB

Under the Storage tab, I have

Active Volumes: 1
Volume: Data
path: /mnt/Data
...
Status: HEALTHY

And when I click the Alert button, I get a pop-up showing

OK: The volume Data (ZFS) status is HEALTHY

I can view the disks in the volume. All are showing online; no errors.

I can see the zpool status. Everything is showing online.

The only indication that there is anything wrong (aside from the fact that nothing will connect to the NAS) is when I try to scrub the volume. I get a message "An error occurred!". Nothing more.

Help!

Brian

ben · Oct 12, 2012

Okay, for starters, 8.0.2 RELEASE is quite old now by FreeNAS standards, and thus both insecure and bug-ridden. The last release from the 8.0.x branch was 8.0.4 p3 (and even that's getting a bit long in the tooth now. If at all possible, you should upgrade to 8.2.0 RELEASE-p1 (though, given that your array may be faulted, now is not the best time for that).

Secondly: You configured all those disks in a STRIPE? That's incredibly risky - with no redundancy but data spread across every disk, each disk is a single point of failure for the entire pool. Additionally, all a scrub could possibly do is identify that bit rot has occurred - with no parity, it can't be recovered. It sounds like that might have been possible, though it's odd that the GUI would show healthy in that case.

Could you go to ssh and run zpool status there, and then post the complete results?

donairb · Oct 12, 2012

OK, here is the output of zpool status -v

Code:

[root@john] /# zpool status -v
  pool: Data
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 28h53m, 0.06% done, 49528h5m to go
config:

	NAME                                          STATE     READ WRITE CKSUM
	Data                                          ONLINE      41     0   850
	  ada0p2                                      ONLINE       0     0     0
	  gptid/88dd5c98-0373-11e1-8d95-f46d0462dffd  ONLINE       0     0     0
	  gptid/66cff5ff-049e-11e1-ae22-f46d0462dffd  ONLINE       0     0     0
	  gptid/2cfcf8a2-189b-11e1-ace9-f46d0462dffd  ONLINE       0     0     0
	  gptid/fc223aa5-4ce6-11e1-b137-14dae9f4daf2  ONLINE       0     0     0
	  gptid/67497e0b-871b-11e1-8a58-14dae9f4daf2  ONLINE       0     0     0
	  da0p2                                       ONLINE      44     0 1.75K  337K repaired
	  da1p2                                       ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Data/HOME/mythtv/recordings/1061_20121005162900.mpg
        /mnt/Data/HOME/mythtv/recordings/1041_20120917185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1111_20120917185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1041_20120917192900.mpg
        /mnt/Data/HOME/mythtv/recordings/1111_20120917192900.mpg
        /mnt/Data/HOME/mythtv/recordings/1022_20121007205900.mpg
        /mnt/Data/HOME/mythtv/recordings/1111_20121005185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1041_20121005185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1022_20121007212900.mpg
        /mnt/Data/HOME/mythtv/recordings/1022_20121007215900.mpg
        /mnt/Data/HOME/mythtv/recordings/1022_20121007222900.mpg
        /mnt/Data/HOME/mythtv/videos/Jeopardy!/Jeopardy!_20121003_2.ts
        /mnt/Data/HOME/mythtv/videos/Jeopardy!/Jeopardy!_20121004_2.ts
        /mnt/Data/HOME/mythtv/videos/Jeopardy!/Jeopardy!_20121005_2.ts
        /mnt/Data/HOME/mythtv/recordings/1571_20120917235900.mpg
        /mnt/Data/HOME/mythtv/recordings/1041_20121006185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1111_20121006185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1041_20121006192900.mpg
        /mnt/Data/HOME/mythtv/recordings/1111_20121006192900.mpg
        /mnt/Data/HOME/mythtv/recordings/1061_20121002162900.mpg
        /mnt/Data/MAIL/.thunderbird/1ezjnr4m.default/global-messages-db.sqlite.old
        /mnt/Data/HOME/mythtv/recordings/1061_20121004162900.mpg
        /mnt/Data/HOME/mythtv/recordings/1041_20121004185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1111_20121004185900.mpg
        /mnt/Data/HOME/mythtv/recordings/1022_20121006225900.mpg
        /mnt/Data/HOME/mythtv/recordings/1022_20121006232900.mpg
        /mnt/Data/HOME/user1/My_Documents/Downloads/VRDTVSH264-4-20-7-641a.exe
        /mnt/Data/HOME/mythtv/recordings/1022_20121006225900.mpg.png
        /mnt/Data/HOME/mythtv/recordings/1022_20121006235900.mpg

One thing I noticed is the suggestion to run zpool clear. Will this help or hurt in my current situation?

Thanks for all the help.

Brian

paleoN · Oct 12, 2012

donairb said:
One thing I noticed is the suggestion to run zpool clear. Will this help or hurt in my current situation?

Neither. The disk da0 is at the least suffering from unreadable, bad sectors and possibly worse. The files it lists are corrupted and should be deleted. First, I would cancel the scrub and view the SMART info for da0.

Code:

zpool scrub -s Data

smartctl -q noserial -a /dev/da0

Throw the output inside of some [code][/code] tags as it will preserve the formatting and keep my eyes from crossing.

donairb · Oct 12, 2012

How is this?

Code:

[root@john] ~# smartctl -q noserial -a /dev/da0
smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST3000DM001-9YN166
Firmware Version: CC4B
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Oct 12 19:27:07 2012 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  584) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   080   079   006    Pre-fail  Always       -       15954433
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   093   093   020    Old_age   Always       -       7515
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   067   060   030    Pre-fail  Always       -       5854692
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2068
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       14
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       1461
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   046   035   045    Old_age   Always   In_the_past 54 (35 194 54 21)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   097   097   000    Old_age   Always       -       7515
193 Load_Cycle_Count        0x0032   094   094   000    Old_age   Always       -       13842
194 Temperature_Celsius     0x0022   054   065   000    Old_age   Always       -       54 (0 21 0 0)
197 Current_Pending_Sector  0x0012   092   092   000    Old_age   Always       -       1408
198 Offline_Uncorrectable   0x0010   092   092   000    Old_age   Offline      -       1408
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       200257145144468
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       77277275797540
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       127751673115967

SMART Error Log Version: 1
ATA Error Count: 1472 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1472 occurred at disk power-on lifetime: 2036 hours (84 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 48 ff ff ff 4f 00      00:24:58.339  WRITE FPDMA QUEUED
  60 00 80 ff ff ff 4f 00      00:24:58.182  READ FPDMA QUEUED
  ea 00 00 00 00 00 00 00      00:24:58.181  FLUSH CACHE EXT
  ef 10 02 00 00 00 00 00      00:24:58.082  SET FEATURES [Reserved for Serial ATA]
  ef 02 00 00 00 00 00 00      00:24:58.082  SET FEATURES [Enable write cache]

Error 1471 occurred at disk power-on lifetime: 2036 hours (84 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 48 ff ff ff 4f 00      00:24:54.181  WRITE FPDMA QUEUED
  60 00 80 ff ff ff 4f 00      00:24:54.180  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 00      00:24:53.955  SET FEATURES [Reserved for Serial ATA]
  ef 02 00 00 00 00 00 00      00:24:53.955  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 00 00      00:24:53.955  SET FEATURES [Enable read look-ahead]

Error 1470 occurred at disk power-on lifetime: 2036 hours (84 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 48 ff ff ff 4f 00      00:24:51.041  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00      00:24:50.976  WRITE FPDMA QUEUED
  61 00 10 ff ff ff 4f 00      00:24:50.976  WRITE FPDMA QUEUED
  61 00 10 90 02 40 40 00      00:24:50.976  WRITE FPDMA QUEUED
  60 00 80 ff ff ff 4f 00      00:24:50.975  READ FPDMA QUEUED

Error 1469 occurred at disk power-on lifetime: 2036 hours (84 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 10 ff ff ff 4f 00      00:24:46.856  READ FPDMA QUEUED
  60 00 10 ff ff ff 4f 00      00:24:46.855  READ FPDMA QUEUED
  60 00 10 90 02 40 40 00      00:24:46.855  READ FPDMA QUEUED
  61 00 80 ff ff ff 4f 00      00:24:46.854  WRITE FPDMA QUEUED
  61 00 80 ff ff ff 4f 00      00:24:46.854  WRITE FPDMA QUEUED

Error 1468 occurred at disk power-on lifetime: 2036 hours (84 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 80 ff ff ff 4f 00      00:24:43.970  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00      00:24:43.970  READ FPDMA QUEUED
  60 00 80 ff ff ff 4f 00      00:24:43.969  READ FPDMA QUEUED
  ea 00 00 00 00 00 00 00      00:24:43.933  FLUSH CACHE EXT
  2f 00 01 10 00 00 00 00      00:24:43.813  READ LOG EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Oh, and I forgot to add this:

Code:

[root@john] ~# zpool scrub -s Data
cannot scrub Data: pool I/O is currently suspended

Thanks

Brian

paleoN · Oct 12, 2012

donairb said:
How is this?

Bad.

The disk is failing. There are far too many 198 Offline_Uncorrectable & 197 Current_Pending_Sector. This is one of the signs of impending disk failure. This disk has also exceeded it's operating temperature in the past, which can cause damage, and is very close to exceeding it right now.

My recommendation is to get some airflow across the disk to cool it down and copy as much of the data off the pool as you can get. Do not run any scrubs, write to the pool or do any unnecessary reads as these will only hasten complete disk failure.

donairb · Oct 12, 2012

Ay, there's the rub.

I have as much airflow as is possible. Since I cannot use the NAS for anything in the state it's in, I have shut it down. Hopefully, that will get it cool enough for starters.

There is no way that I can get anything big enough to copy the 15TB or so of data on that volume to. Is it not possible to replace just that one drive (for now) in the pool and at least get back up and running?

And that brings me to the all-important question: What is the best way to provide at least some redundancy when the data far exceeds the budget? I think I can come up with enough money for another 3TB disk, but then (after hopefully replacing the failed drive under warranty), I still only have 7TB free for those 15TB.

Thanks again for your help.

Brian

JaimieV · Oct 12, 2012

It'll heat up again in less than five minutes. You're in emergency mode now: Take the case off, spread the drives out if they're all packed together, point a desk fan at them.

The configuration you have chosen ("JBOD" or striped across all disks) means that if any one of those disks goes terminally offline, you've lost the whole lot. Everything. The whole 15Tb.

If you can't provide disk space to copy all the data off, you're simply going to have to pick the data that you want to keep, and copy that off.

And when you rebuild, make sure that you understand two things:
1) "Striped"/JBOD/RAID0 means you will lose the data.
2) Having no backup means you will lose the data.

Best plan for a RAIDZ1 setup and some method of backing up the stuff that's actually important to you.

Oh, and a third thing:
3) If your NAS crashes a lot, there's probably something wrong with it and you should find out what before it's too late.

It's a harsh lesson but one we all go through - hopefully exactly once.

donairb · Oct 12, 2012

I can address 3) fairly easily: Before I added that particular drive to the array, FreeNAS did not crash once.

Thanks

Brian

paleoN · Oct 13, 2012

I realized I haven't yet said how stupid your zpool configuration is. I had commented on a different person's stupid config yesterday.

donairb said:
There is no way that I can get anything big enough to copy the 15TB or so of data on that volume to. Is it not possible to replace just that one drive (for now) in the pool and at least get back up and running?

Yes, you screwed yourself pretty good. At this point there is no easy way out. Don't you only have 8TB of data? Only!

donairb said:
And that brings me to the all-important question: What is the best way to provide at least some redundancy when the data far exceeds the budget? I think I can come up with enough money for another 3TB disk, but then (after hopefully replacing the failed drive under warranty), I still only have 7TB free for those 15TB.

If you buy another 3TB disk you can setup 2 pools with redundancy, 4TB raw & 12TB raw respectively for a total of 16TB raw.

How long ago &, more importantly, how much data have you copied to the pool since you added da0? Was da1 also added at the same time? I'm trying to get a sense for how much data is actually on da0. Depending on how much we decide I have 2 different suggestions. You will still need to decide on what data you are saving.

JaimieV said:
It'll heat up again in less than five minutes. You're in emergency mode now: Take the case off, spread the drives out if they're all packed together, point a desk fan at them.

+1

donairb said:
I can address 3) fairly easily: Before I added that particular drive to the array, FreeNAS did not crash once.

Drive or drives? You may need a better PSU for the number of drives you have. You clearly need a better case/setup.

donairb · Oct 13, 2012

paleoN said:
I realized I haven't yet said how stupid your zpool configuration is. I had commented on a different person's stupid config yesterday.

While I thank you for any help you can provide, I don't think saying how stupid the configuration is is very helpful. The main reason I pooled all of these drives was to consolidate free space. At the time it seemed like a good idea. I also do not remember reading anywhere that you could not replace a failed drive in a pool unless there was some redundancy. My plan all along (and it turns out not to be possible) was to add drives as I could afford to and replace the smaller drives with larger ones as I went. I knew I was taking a risk of losing data by lack of redundancy, but as I said in an earlier post, you can't just stop everything and wait until you have the budget to buy more. The data already exists and so my choices are hold on to it in hopes that the media it's on doesn't fail, or delete it in case I might lose it. It is really a no-brainer as to which of those two choices I might take.

paleoN said:
Yes, you screwed yourself pretty good. At this point there is no easy way out. Don't you only have 8TB of data? Only!

No. As I said in the post you quoted, I have 15TB of data on a combined 21TB (5 - 3TB + 3 - 2TB) pool with 4TB free.

paleoN said:
If you buy another 3TB disk you can setup 2 pools with redundancy, 4TB raw & 12TB raw respectively for a total of 16TB raw.

This is at least some glimmer of hope, although I will need some details, as I'm not seeing it.

paleoN said:
How long ago &, more importantly, how much data have you copied to the pool since you added da0? Was da1 also added at the same time? I'm trying to get a sense for how much data is actually on da0. Depending on how much we decide I have 2 different suggestions. You will still need to decide on what data you are saving.

As far as I can tell, I added that disk in late June. (I thought it was more recently than that.) da1 was added in late August, and seeing as there is 4TB free on the pool, I would suspect that da0 is about (at the very least?) 2/3 filled and da1 has lots more free space than da0.

paleoN said:
Drive or drives? You may need a better PSU for the number of drives you have. You clearly need a better case/setup.

A new case and PSU are both on the wish list, but I believe that the case with 12 drive bays and lots of fans hasn't been lacking. The PSU is 800W and I have not used all available power connectors, so I wouldn't think that it would be problematic (yet). A UPS is also on the wish list, as some of the problems have been caused by the number of power outages in my building, which, in turn, are mostly due to vandalism and mischief. I could outrun the budget on every step of the process without batting an eye.

Brian

BobCochran · Oct 13, 2012

Brian, you can overload your power supply without using up all the power supply connectors. You need to work up a wattage budget for your power supply and determine whether it needs replacing. For example each stick of RAM needs about 15 watts. If you have 4 sticks that is 60 watts of power there. Each drive needs x number of watts as well -- you can compute wattage for each of your drives by checking the labels on the drives. Each adapter card in the system requires x watts as well. The processor needs watts and so do each cooling fan. Then there may be other devices I have missed. Add those in. The total should be comfortably less than the maximum rating of your power supply. Otherwise get a higher-rated power supply and get a good quality one.

A UPS is just about mandatory -- I would not dream of connecting my own box to utility power because of surges and sags in the normal, everyday power at the plug. You really need a UPS.

If you search on Ebay, you should be able to find plenty of bays that can hold up to 20 drives or even more. These could be dedicated drive bays. Or you can get a chassis that has room for an entire computer, plus plenty of drive bays. For example, you can get a Norco chassis that can fit in an extended ATX motherboard with room for 20 drives and a couple of SSDs and these have backplanes and fans to cool the drives down. I really like the Norcos, even if they do eat up a lot of space. I haven't checked, but I think you can use cable locks on these for added security.

You are having a difficult experience, and I think the other posters on this thread have good solid advice to give you. I don't really enjoy buying all those extra drives for RAIDing, but I can sleep at night too. I don't own the data I built the NAS for, and I'd rather spend the extra money.

I really hope we will see the day when SSD drives in multi-terabyte capacities become both available and cheap.

Bob

donairb · Oct 13, 2012

Yes, Bob, I know. I have my eyes on a Norco RPC-4020. But it has to wait until I have the funds. I think buying up a few 3TB drives is probably more important at this juncture.

And I didn't mean to imply that I think that just because I haven't used all of my connectors, I am in no danger of overloading my PSU. It was just an example as to where I am. I know I need a PSU. I know I need a new case. I know I need a UPS. I know I need new drives. I know I need redundancy. I also know that funds are tight and I can't fix all these issues in one fell swoop.

So... Hypothetically, when the time comes that I can afford to put together a pool of 18 drives (the 4020 has room for 20, my controllers will take 18 SATA III (6Gb/s) and 4 SATA II (3Gb/s)), what should my configuration be? I am happy with having a possibility of one drive failing with no data loss. I don't think I'd need more redundancy than that, although I know also that more redundancy is better.

Thank you all for your input.

Brian

donairb · Oct 13, 2012

The other question I might ask is it possible to add redundancy to a stripe without killing and recreating the pool? If I am not mistaken, I could theoretically add another 8 drives as a mirror to the current ones, but is it possible to do the same sort of thing with fewer drives?

Thanks

Brian

BobCochran · Oct 13, 2012

"Classic" hard drives can break in many ways. I guess the level of redundancy you should target really depends on the value of the data, and the costs and benefits and risks that you perceive.

I think the FreeNAS manual has a great explanation of the various levels of redundancy you can get. Why not check that? Also, Oracle publishes excellent documentation on ZFS. I think ZFS has so much to offer and I would like to get the most out of it -- exploit it as much as I can. I myself have a lot to learn.

http://docs.oracle.com/cd/E19963-01/html/821-1448/zfsover-1.html

Bob

paleoN · Oct 13, 2012

donairb said:
While I thank you for any help you can provide, I don't think saying how stupid the configuration is is very helpful.

Then consider it the cost of my help. There are a rather select few stupider configurations, perhaps an even wider RAID0 stripe?

donairb said:
The main reason I pooled all of these drives was to consolidate free space. At the time it seemed like a good idea. I also do not remember reading anywhere that you could not replace a failed drive in a pool unless there was some redundancy.

I take it that it seems like a bad idea now? I'm curious how you could replace a failed drive if the pool has no redundancy.

donairb said:
My plan all along (and it turns out not to be possible) was to add drives as I could afford to and replace the smaller drives with larger ones as I went.

Properly planned, with certain caveats, you can do exactly that.

donairb said:
I knew I was taking a risk of losing data by lack of redundancy, but as I said in an earlier post, you can't just stop everything and wait until you have the budget to buy more.

Dead horse, moving on.

donairb said:
The data already exists and so my choices are hold on to it in hopes that the media it's on doesn't fail, or delete it in case I might lose it. It is really a no-brainer as to which of those two choices I might take.

There are other options besides those two and with the first one you will lose the data, sooner rather than later assuming you make it passed the current little issue. You have 8 x greater chance of pool failure vs a single drive. On top of which at least some of those drives are in a rather challenging thermal environment which further increases the chance for failure.

donairb said:
This is at least some glimmer of hope, although I will need some details, as I'm not seeing it.

Not much hope at least not in the way you are hoping for. Backup what you can, destroy the original pool and create new pools with the existing disks.

donairb said:
The PSU is 800W and I have not used all available power connectors, so I wouldn't think that it would be problematic (yet).

The PSU may also be a bit flaky. Of course this isn't the only possible cause. It's just the first thing that came to mind.

donairb said:
A UPS is also on the wish list, as some of the problems have been caused by the number of power outages in my building, which, in turn, are mostly due to vandalism and mischief.

You're the one who decides if you care about your data or not.

Important Announcement for the TrueNAS Community.

Help! My FreeNAS has gone south

Dabbler

FreeNAS GUI Developer

Dabbler

Contributor

Dabbler

FreeNAS GUI Developer

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Guru

Dabbler

Wizard

Dabbler

Contributor

Dabbler

Dabbler

Contributor

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Help! My FreeNAS has gone south"

Similar threads