SOLVED Resilvering issue/rebooting and HDD read error

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
Hello,

I've got a NAS running Freenas 11.3-U5 with 1 RAIDZ1 Pool of 6 * 4TB WD RED (CMR) HDD. All with 63K+ running hours. 15 days ago ada0p2 change to /dev/ada0p2 with read/writing fail. I ordered a new 4TB WD Red (CMR) HDD and I try to rebuild my NAS.
Freenas start the resilvering 10 days ago and every day, the resilvering restart when approaching 20%. It never end. On my own research I found some people talking about rebooting (properly) the system. However while I'm looking for a solution I discover that my ada3 have some reading issues...
1st Question: can the resilvering issue came from it?
2nd Question: Can I buy a second WD RED 4TB and replace the ada3 despite the resilvering still in progress?
3rd Question: If 2nd is not possible can I offline ada3 while resilvering and replace it later when the resilvering finish? without data loosing?

Here is the zpool status result:

root@Pure:~ # zpool status
pool: PureStorage
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Mar 28 21:24:26 2022
895G scanned at 532M/s, 122G issued at 629M/s, 3.66T total
20.2G resilvered, 3.24% done, 0 days 01:38:28 to go
config:

NAME STATE READ WRITE CKSUM
PureStorage DEGRADED 0 0 459
raidz1-0 DEGRADED 0 0 921
gptid/57ad2629-a6ea-11ec-a933-bc5ff4b964d3 ONLINE 0 022
gptid/ebdbcda4-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors
gptid/eca7762e-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors
gptid/ed79b7e1-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 443 0 0 too many errors
gptid/ee4edc0a-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors
gptid/ef1f6ebe-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors

errors: 3 data errors, use '-v' for a list

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:00:19 with 0 errors on Mon Mar 28 03:45:19 2022
config:

The following segment is constantly updated throughout the day
scan: resilver in progress since Mon Mar 28 21:24:26 2022
895G scanned at 532M/s, 122G issued at 629M/s, 3.66T total
20.2G resilvered, 3.24% done, 0 days 01:38:28 to go

When I plug a display on the NAS I have this error repeated:
(ada3:ahcich4:0:0:0): CAM status: ATA Status Error
(ada3:ahcich4:0:0:0): ATA status; 51 (DRDY SERV ERR), error: 40 (UNC )
(ada3:ahcich4:0:0:0): RES: 51 40 00fa 6b 4b 0b 00 00 38 00
(ada3:ahcich4:0:0:0): Retrying command
(ada3:ahcich4:0:0:0):READ_DMA. ACB: c8 00 00 fa 6b 4b 00 00 00 00 38 00

Thank you a lot ;)
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
[UPDATE]
Almost 20 day since the resilvering start and the problem keep up...
However in the Reporting / Disk tab I found some information showing Disk I/O activity scheme alway the same during the resilvering time from the start to the restart. My ada0 (replacement disk) show writing activity curve exactly same as the other ada's reading one. Does it mean that my resilvering is in a good way despit of the restarting? and I just have to wait the time it need?

Thank you
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
6 * 4TB Z1 has a single disk of parity. It should take no more than a day or so to resilver (depending on how full it is).
Looking at your zpool status, all but one of your disks is saying its got errors - so to be blunt you have a problem. You also have data corruption that ZFS cannot fix

Do you have a backup?

Please post your complete hardware spec as per forum rules
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
hello thx for the answer, I don't have any backup. But my data still available. Available space is 14TB (17% used). I run freenas on miniITX platform with an i5-4430 and 8GB of ram. The system is set on an individual SSD.
I know all disk show too many error but it seems that only ada3 have error and I discover it after I start the resilvering for ada0...

I don't have a backup but if It was, what I should to do next?
After this resilvering I plan to change my ada3 and I have a free sata slot. Do y ou think it's possible to plug a new 4TB HDD and fix error? by replacing ada3 without offline and unplug it?

thx
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
I repeat
"Please post your complete hardware spec as per forum rules"
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
Updated my signature
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
OK, a few comments
8GB Non ECC - minimum memory (possibly even sub minimum)
Some wierd PCIe Card which may be causing you issues and you don't seem to need, unless there is a wierd motherboard lane issue.
At least the network card is Intel although one of its gaming chipsets I think - still if it works then it ought to be OK

I assume the resilver is still going - your pool is borked - ZFS is complaining about every disk

The first thing you need to do is make an offmachine backup of anything you can't live without
Then backup anything else you would like to keep, if you can

Remove the wierd PCIe card (Why are you using the PCIe card - you don't seem to need it. Boot from mSATA and 6 disks on 6 motherboard connectors - keep it simple)

Memtest the PC for 24 hours (seriously consider 16GB instead of the 8)

Then dismantle the pool and thoroughly test every disk using both WD diagnostics (first) and something like badblocks to stress each disk (see tmux to be able to badblocks all disks at the same time - not the mSATA). This will take a few days

Seriously consider 16GB RAM instead of the 8)

Only then rebuild (with V12 U8+) and re setup the NAS and reload data. Keep the backups for a while, just in case
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
Ok I will first upgrade to 16GB, it's not a problem. In 2013 I bougth the itx mobo because of the number of sata + msata but I discover that the 4th sata and msata are one the same link and you can't use them in the same time.
First I will upgrade to 16GB with totaly new DDR3 RAM stick couple. Can't use ECC because of the i5-4430.
I will backup all my important stuff and if the upgrade doesn't fix my problem I will dismantle the pool and redo it.

Thank you
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
If you can't use SATA 4 and the mSATA then I suggest the following concept.
Get an SSD to USB, or M.2 SSD to USB or NVMe to USB adapter and use that to boot from. You already have the mSATA - so an mSATA to USB adapter would be good. Use that as the boot device, do not use the onboard mSATA, and use all 6 motherboard ports for the data pool. This is probably your most reliable way forward unless you have motherboard problems. If you need more SATA ports then use an LSI HBA (in IT Mode), not one of the cheap PCIe SATA expansion cards which are usually junk.

Amazon.co.uk Link as an example. My QNAS uses two similar velcro'd to the outside of the case as a mirrored boot pool. Its not elegant, tidy or pretty, but it seems to work. I used Optane 16GB NVMe drives as they are / were very very cheap and have good endurance. The adapters were more expensive than the drives themselves

You could test with a decent flash drive (USB 2 rather than USB 3) - initially, just don't leave that running long term and make sure you move the system pool to the HDD's

ECC would be better, but whilst advised its not gonna work for you in the current setup
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
Thank you for your advices. I will try all of these but after your previous message I'm thinking about a similar solution for the system were I'll put all my 6 drive on sata's mobo. Can I reorganize all my disk without loosing the RAIDZ1 and the pool?

In the case of none of solutions do the job and befor redoing the pool, do you think it's possible to add a fresh new disk I'll buy soon as 7th disk to replace the ada3 (that present reading errors) without offline/replace it? juste by replacing? Could it be a good practice or as long as the resilvering is not finished I can't do anything else?
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
You can move the disks around onto different sata ports. TN recognises the disk not by its /dev/da'x' but by its GPTID which means you can move it to a different port without any issues.

Until your resilver completes I don't believe you can remove a disk from the pool - thats a recipe to lose data / the pool

Seriously - make a backup

Can you post again please:
zpool status -v
glabel status
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
Hi, I'm going to upgrade the RAM to 16GB but here is the command result

#zpool status
pool: PureStorage
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Apr 5 20:42:16 2022
905G scanned at 2.07G/s, 87.6G issued at 205M/s, 3.64T total
14.5G resilvered, 2.35% done, 0 days 05:02:56 to go
config:

NAME STATE READ WRITE CKSUM
PureStorage DEGRADED 0 0 14.3K
raidz1-0 DEGRADED 0 0 28.6K
gptid/57ad2629-a6ea-11ec-a933-bc5ff4b964d3 ONLINE 0 022
gptid/ebdbcda4-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors
gptid/eca7762e-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors
gptid/ed79b7e1-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 14.3K 0 0 too many errors
gptid/ee4edc0a-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors
gptid/ef1f6ebe-48c5-11e3-abd5-bc5ff4b964d3 DEGRADED 0 0 0 too many errors

errors: Permanent errors have been detected in the following files:

/mnt/PureStorage/Perso/001.jpg

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:00:21 with 0 errors on Sun Apr 3 03:48:32 2022
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
gptid/8c4cfb9b-9c88-11e5-8a52-bc5ff4b964d3 ONLINE 0 0 0

errors: No known data errors

#glabel status

Name Status Components
gptid/57ad2629-a6ea-11ec-a933-bc5ff4b964d3 N/A ada0p2
gptid/ebdbcda4-48c5-11e3-abd5-bc5ff4b964d3 N/A ada1p2
gptid/ef1f6ebe-48c5-11e3-abd5-bc5ff4b964d3 N/A ada2p2
gptid/ed79b7e1-48c5-11e3-abd5-bc5ff4b964d3 N/A ada3p2
gptid/ee4edc0a-48c5-11e3-abd5-bc5ff4b964d3 N/A ada4p2
gptid/8c4a34cf-9c88-11e5-8a52-bc5ff4b964d3 N/A ada5p1
gptid/8c4cfb9b-9c88-11e5-8a52-bc5ff4b964d3 N/A ada5p2
gptid/eca7762e-48c5-11e3-abd5-bc5ff4b964d3 N/A ada6p2
gptid/ef0d1dd9-48c5-11e3-abd5-bc5ff4b964d3 N/A ada2p1
gptid/ed68e3f1-48c5-11e3-abd5-bc5ff4b964d3 N/A ada3p1
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Thanks for the info

What is suprising is that the pool is working at all and its only detected 1 permanent error.
I don't think the RAM upgrade will fix anything - but its good practise.

I did notice that on March 28th disk gptid/ed79b7e1-48c5-11e3-abd5-bc5ff4b964d3 (ada3) had 443 errors
Today it has 14.3 thousand errors - no wonder it won't resilver
Thats an impressive jump. But its the only one with errors - which makes me think (see below)

[What I would do - warning its just as likley to kill the pool as fix it, or do nothing]
Once you backup I would give up on the re-silver, shutdown, pull that disk, insert a new disk and power on. Add the new disk to the pool and resilver (assuming that you still have a pool in the first place) - sit back and watch what happens.
Please ensure that the new disk is stress tested first - to make sure its OK

IF it recovers then reorganise the server, get rid of the expansion card and get a mSATA to USB adapter. Rebuild TN on the USB adapter with your saved config file, plugging the HDD's into the motherboard SATA ports. Then try and upgrade to V12

If it doesn't recover take the opportunity to rebuild the server with a new pool, mSATA to USB adapter, disks plugged into the motherboard etc. Make sure you thoroughly test the disks first and memtest the memory for at least 24 hours
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
Thank you,

You mean the pool is not working at all? ( sorry I'm french and sometime need focus to understand)
The pool is just degraded, I can access data and plex is working. If I delete the file in error, another one appears.

In fact the RAM upgrade just permit to step up the resilvering progresse to 31% then restart... In first time I can see all drive Online and progressively degraded one by one till all degraded. The RAM upgrade was the first stage because I have 16GB to put but I'll follow you advices, make a backup and buy a new HDD this week to try ada3 replacement. Hope it will resilver but I don't believe in.

Thank you
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
No - I mean that I am suprised that you haven't lost all the data already
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Make sure that you run Memtest86 on your RAM, I'd run it for at least 24 hours to make sure that you do not have any stability issues. Know that a RAM test does not just check the RAM, it also tests the motherboard and the CPU interfaces to the RAM. If you want to go the extra mile, run a CPU stress test too. These two tests help validate your system is solid.

As for 8GB of RAM, that is the bare minimum to operate and you may find out that your system is using SWAP space to make up for the lack of RAM. The goal is to have SWAP space always remain at 0, but a periodic small amount of maybe 100k of space is not reason to add RAM unless it happens all the time. I think the 16GB of RAM is the right direction as well, very good advice.

I too am surprised you only have a single file that is corrupt right now. You really need to backup all your data, it should be your first priority. At least backup the data you really want to keep. you might find out that some of that data will be corrupt too.

Good Luck, I think you will need it.
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
Hello,

I'll give you some new as soon as possible. I ordered a new 4TB HDD this afternoon (at home in 2 days) and I found a way to make a whole backup for this evening. Finaly I'll replace my msata (system) ssd too but would you advice to replace it by a NVME drive on an USB 3 adapter? or an USB stick can do the job for a years? Because I realized that my msata ssd is as old as my HDDs and they are all 63K+ hours old.
I'll not play with the luck more longer thank you for your advices ;)
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
The boot-pool doesn't get much use, other than just occaisonally booting.
Also, its disposable - as in you can always rebuild and upload the config file (for which you do have a copy somewhere safe don't you)

I wouldn't use a USB stick although some do and get away with it. Some form of SSD / NVMe / mSATA to USB adapater is ideal and it needn't cost the earth either.

As an aside those drives all have 7 years runtime on them - I would be ordering more than 1 new disk, or at least making sure I know where to buy more from veru quickly. You may get several failures if you stress test all the disks in the pool.
 

Azdimi

Dabbler
Joined
Jun 24, 2014
Messages
27
The boot-pool doesn't get much use, other than just occaisonally booting.
Also, its disposable - as in you can always rebuild and upload the config file (for which you do have a copy somewhere safe don't you)

While all data are backed-up I'm exporting and verifying if I have all the file I need and if I understand I just have to going under System/general and click saveconfig and restore it on the new install to recover my conf.

Is this sufficent to retrieve all my datasets and pool conf, jails/plugin and services in place ?
What I'll lost and what can I restore?

To be honest, once the backup finish and befor reinstalling something Iwould like to try to clone the msata on the new nvme. Is this recommended or not? Perhaps I should start a new thread to ask?

Thank you
 
Last edited:
Top