Updating from Nexenta to a new, upgraded FreeNAS build

Jatrabari · Sep 5, 2018

Thanks for the replies.

My project has been on hold for a while but now I am getting back into it with HDD validation using the amazing scripts provided here.

One strange issue has risen, I can't establish a pool with some of my HDDs because of this

I read some posts that dealt with this problem and I think they all said that it is because the HDDs have some old stuff on them but my HDDs are new and unused until now and it is affecting only 2 of 10 disks. So why is this happening to new disks as well? I thinking of using the sysctl - dd method to erase GPT soon if I can't find a better solution.

Jatrabari · Sep 5, 2018

FreeNAS version is 11.1 U6

Chris Moore · Sep 5, 2018

Jatrabari said:
Thanks for the replies.

My project has been on hold for a while but now I am getting back into it with HDD validation using the amazing scripts provided here.

One strange issue has risen, I can't establish a pool with some of my HDDs because of this

I read some posts that dealt with this problem and I think they all said that it is because the HDDs have some old stuff on them but my HDDs are new and unused until now and it is affecting only 2 of 10 disks. So why is this happening to new disks as well? I thinking of using the sysctl - dd method to erase GPT soon if I can't find a better solution.

Depending on where you ordered the disks from, it is possible they may have been installed into a system before you received them and that system may have written some data to the drive that is causing issues.
The way to clear a disk that is linked would be a way to ensure the disk is fully cleared of data.
Did you do any badblocks testing?

Jatrabari · Sep 5, 2018

Ok, thanks for the reply, Chris. Discs were delivered in sealed antistatic bags but if the method clears the data for good and everything starts working it doesn't matter...

Badblocks and SMART tests would have been next on the agenda before I encountered this and got stuck resolving it.

Chris Moore · Sep 5, 2018

Jatrabari said:
Ok, thanks for the reply, Chris. Discs were delivered in sealed antistatic bags but if the method clears the data for good and everything starts working it doesn't matter...

Badblocks and SMART tests would have been next on the agenda before I encountered this and got stuck resolving it.

I usually do the badblocks testing first. Because the badblocks utility writes to the entire disk surface for testing, if there is any data on the disk, even just junk that came from the disk manufacturer, it will be erased by the testing. Then the disk is in a known state to create the pool.

Are you using the script from here:

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

I used this for the last batch of disks I prepared and it writes and verifies the entire disk surface four times.

Jatrabari · Sep 6, 2018

That is the script I will be using too. Do you remember the duration for a X TB disc?

How can I determine what is keeping one of my disc busy as I can't unmount or gpart destroy the content?

Chris Moore · Sep 6, 2018

Jatrabari said:
That is the script I will be using too. Do you remember the duration for a X TB disc?

It took about 4 days to run on my 6TB drives. It is a lengthy test because it writes the whole drive, then reads it, and it repeats the process for a total of four passes.

Jatrabari said:
How can I determine what is keeping one of my disc busy as I can't unmount or gpart destroy the content?

What disk controller is that connected to? It could be some firmware grabbed the disk. Or, if you created a pool, that disk could have a swap partition on it now.
If you have a Windows computer handy, you could connect the drive to it and use the command line utility called diskpart to do a clean and it should work. I have used that method a few times. If that doesn't work, there is always Darik's Boot And Nuke (DBAN) which I have used many times and it has only failed when the drive was defective.

Jatrabari · Sep 6, 2018

Chris Moore said:
What disk controller is that connected to?

All HDDs are connected to a single LSI HBA.

Chris Moore said:
If you have a Windows computer handy, you could connect the drive to it and use the command line utility called diskpart to do a clean and it should work. I have used that method a few times. If that doesn't work, there is always Darik's Boot And Nuke (DBAN) which I have used many times and it has only failed when the drive was defective.

I gather badblocks does the same thing?

Chris Moore · Sep 6, 2018

Jatrabari said:
I gather badblocks does the same thing?

DBAN runs in a version of Linux and uses a tool to generate random data that it writes to the drive to ensure a secure erasure. I have used it to do burn-in testing because it can be set to do a verify read after each write so it has a similar functionality and it will tell you the write speed of each drive while writing and tell you if a drive fails, but not the reason why. You have to run the SMART long test using some other OS though because the DBAN software doesn't really give you many options.

Jatrabari · Sep 6, 2018

No, I meant that the end result is the same with both that all data will be overwritten and then that drive will also be empty and can be unmounted.

Chris Moore · Sep 6, 2018

Jatrabari said:
No, I meant that the end result is the same with both that all data will be overwritten and then that drive will also be empty and can be unmounted.

The advantage of DBAN being that it is specifically designed to blank the drive, so it doesn't look to see if there is anything on it before it tries to clean it.
If you are having problems getting FreeNAS to let you clean the drive, using something else to do the job might make an easier task.

Jatrabari · Sep 25, 2018

Jatrabari said:
Is there a way to record motherboard component high/low temps during CPU stress test? I didn't find anything more from FreeNAS GUI.

Would this do the trick using spincheck script for CPU stress and badblocks temperature monitoring and logging? Can it interfere with SMART tests or badblocks?

Jatrabari · Sep 30, 2018

Project is moving along.

Badblocks tests are complete and all 10 drives except one showed 0/0/0 errors. The spincheck script above was a great help and I was able to monitor very well HDD temperatures and test my cooling on FULL mode.

For future reference and searches about badblocks and burn-in information. I ran disk-burnin script from Resources and all test ran in parallel using tmux.

For WD Red 4 TB (7 drives) badblocks took about 64 hours and the whole script with SMART tests about 80 hours. For WD Red 6 TB (3 drives) same numbers were 95 and 118 hours, respectively.

So I have one problem drive from these tests. I got this information from badblocks and SMART tests

I aborted badblocks test for da7 after 4,5 hours when I noticed this

Code:

Finished SMART extended test on drive /dev/da7: Wed Sep 26 06:56:05 EEST 2018															  
+-----------------------------------------------------------------------------															  
+ Run badblocks test on drive /dev/da7: Wed Sep 26 06:56:05 EEST 2018  
+-----------------------------------------------------------------------------															  
Checking for bad blocks in read-write mode							
From block 0 to 976754645		  
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device
done																  
Reading and comparing: ›  0.00% done, 4:34:41 elapsed. (6469/976587754/0 errors)

If I remember correctly the first value in badblocks is read, then write and last is compare. So as Badblocks had written the pattern I got 9 GB worth of badblocks log file. Also the read value kept climbing and the percentage didn't rise from 0 before I aborted the test. I then let the script run the other SMART tests to completion and got this information

Code:

+-----------------------------------------------------------------------------
+ Run SMART extended test on drive /dev/da7: Wed Sep 26 11:35:01 EEST 2018
+-----------------------------------------------------------------------------
Extended test started, sleeping 28440 seconds until it finishes
SMART self-test failed
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed: read failure	   90%	   218		 1515136
# 2  Short offline	   Completed: read failure	   60%	   218		 1515136
# 3  Extended offline	Completed without error	   00%	   214		 -
# 4  Short offline	   Completed without error	   00%	   206		 -

Finished SMART extended test on drive /dev/da7: Wed Sep 26 19:29:01 EEST 2018

Code:

ID# ATTRIBUTE_NAME		  FLAGS	VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate	 POSR-K   100   253   051	-	0
  3 Spin_Up_Time			POS--K   181   167   021	-	5908
  4 Start_Stop_Count		-O--CK   100   100   000	-	101
  5 Reallocated_Sector_Ct   PO--CK   200   200   140	-	0
  7 Seek_Error_Rate		 -OSR-K   200   198   000	-	0
  9 Power_On_Hours		  -O--CK   100   100   000	-	226
 10 Spin_Retry_Count		-O--CK   100   100   000	-	0
 11 Calibration_Retry_Count -O--CK   100   100   000	-	0
 12 Power_Cycle_Count	   -O--CK   100   100   000	-	101
192 Power-Off_Retract_Count -O--CK   200   200   000	-	100
193 Load_Cycle_Count		-O--CK   200   200   000	-	32
194 Temperature_Celsius	 -O---K   122   118   000	-	28
196 Reallocated_Event_Count -O--CK   200   200   000	-	0
197 Current_Pending_Sector  -O--CK   200   200   000	-	0
198 Offline_Uncorrectable   ----CK   100   253   000	-	0
199 UDMA_CRC_Error_Count	-O--CK   200   200   000	-	0
200 Multi_Zone_Error_Rate   ---R--   200   200   000	-	1

5 errors like these also were in the logs:

Code:

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 5
		CR	 = Command Register
		FEATR  = Features Register
		COUNT  = Count (was: Sector Count) Register
		LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
		LH	 = LBA High (was: Cylinder High) Register	]   LBA
		LM	 = LBA Mid (was: Cylinder Low) Register	  ] Register
		LL	 = LBA Low (was: Sector Number) Register	 ]
		DV	 = Device (was: Device/Head) Register
		DC	 = Device Control Register
		ER	 = Error register
		ST	 = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 [4] occurred at disk power-on lifetime: 214 hours (8 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 00 16 37 18 40 00  Error: IDNF at LBA = 0x00163718 = 1455896

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 00 00 00 00 16 37 18 40 00	 09:21:18.880  WRITE FPDMA QUEUED
  b0 00 d0 00 01 00 00 00 c2 4f 00 00 00	 09:21:13.476  SMART READ DATA
  61 00 08 00 00 00 00 00 16 37 10 40 00	 09:21:13.476  WRITE FPDMA QUEUED
  61 00 08 00 00 00 00 00 16 37 08 40 00	 09:21:13.476  WRITE FPDMA QUEUED
  61 00 08 00 00 00 00 00 16 37 00 40 00	 09:21:13.476  WRITE FPDMA QUEUED

Error 4 [3] occurred at disk power-on lifetime: 214 hours (8 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 51 00 00 00 00 00 16 13 c0 40 00  Error: IDNF at LBA = 0x001613c0 = 1446848

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  61 00 08 00 00 00 00 00 16 13 c0 40 00	 09:20:48.939  WRITE FPDMA QUEUED
  61 00 08 00 00 00 00 00 16 13 b8 40 00	 09:20:48.729  WRITE FPDMA QUEUED
  61 00 08 00 00 00 00 00 16 13 b0 40 00	 09:20:48.729  WRITE FPDMA QUEUED
  61 00 08 00 00 00 00 00 16 13 a8 40 00	 09:20:48.729  WRITE FPDMA QUEUED
  61 00 08 00 00 00 00 00 16 13 a0 40 00	 09:20:48.706  WRITE FPDMA QUEUED

I am now starting to find out what could be the problem or should I just RMA the drive as it is new. What I see now is that besides the fact that SMART reports a read error and LBA of first error, the multi zone error rate has gone up to 1.

If someone has some insight to what could be the problem with this drive I welcome any information.

Ericloewe · Oct 1, 2018

If it's failing SMART tests, that's more than enough to RMA it, in my book.

Jatrabari · Dec 9, 2018

Update to my project.

Returned to original drive back to store and the second WD Recertified drive replacement passed the SMART and badblocks tests with no errors.

The last memory test which was as long as the first one went thru with out a hitch. Also CPU stress test showed that my cooling is more than adequate even for heavier jobs if needed (planning on installing BOINC to crunch some numbers during idle time)

Now the next step is planning data migration from the old server and I have a question about that.

What would be the most efficient way to migrate and verify with checksums my data over a home 1 GB LAN? I have been thinking FTP, basic ssh session transfer, rsync...

Chris Moore · Dec 10, 2018

Jatrabari said:
Update to my project.

- snip -

Now the next step is planning data migration from the old server and I have a question about that.

What would be the most efficient way to migrate and verify with checksums my data over a home 1 GB LAN? I have been thinking FTP, basic ssh session transfer, rsync...

The fastest way is to connect all the drives (temporarily) to a single system. It is then possible to locally do a zfs send | zfs receive from one pool to the other within the same system and it goes as fast as the drives are capable of doing.

Ericloewe · Dec 10, 2018

I believe that the pipe involves two context switches, so it's possible to be CPU-limited in that transfer, though probably only with SSDs.

Chris Moore · Dec 10, 2018

Ericloewe said:
I believe that the pipe involves two context switches, so it's possible to be CPU-limited in that transfer, though probably only with SSDs.

I got very high transfer rate on a single socket system but a dual socket system actually went slower, so that is probably the reason.

Chris Moore · Dec 10, 2018

Jatrabari said:
What would be the most efficient way to migrate and verify with checksums my data over a home 1 GB LAN? I have been thinking FTP, basic ssh session transfer, rsync...

If I am wrong, please let me know, but I was under the impression that Nexenta uses ZFS, so even sending the data over the network, send and receive should be the fastest way to go.

rvassar · Dec 10, 2018

Chris Moore said:
If I am wrong, please let me know, but I was under the impression that Nexenta uses ZFS, so even sending the data over the network, send and receive should be the fastest way to go.

Nexenta derives from Open Solaris and hasn't been updated since October 2012. Has anyone tested ZFS interoperability that far back?

Important Announcement for the TrueNAS Community.

Updating from Nexenta to a new, upgraded FreeNAS build

Contributor

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Hall of Famer

Contributor

Contributor

Server Wrangler

Contributor

Hall of Famer

Server Wrangler

Hall of Famer

Hall of Famer

Guru

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Updating from Nexenta to a new, upgraded FreeNAS build"

Similar threads