need urgent help - disk not recognized after replacement of faulty drive.

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
OK, so da7 has part of swap3.

We can stop that with:

swapoff /dev/mirror/swap3

If that works, we can go back to the process.
quick question, it says that it is missing an argument. Since I am bit limited with knowledge in terms of disk operations, is that argument -v?
swapoff.png
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
still invalid argument. one question, in terms of mirror/swap3 I noticed that 4 drives are having the same swap3 so in terms of removing that, how it will affect other 3 drives which are up and running? will they become affected in some way?
sharing-swap.png
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
quick question, it says that it is missing an argument. Since I am bit limited with knowledge in terms of disk operations, is that argument -v?
Actually it should be swapoff /dev/mirror/swap3.eli

I was referring to the mirror name incorrectly.
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
Actually it should be swapoff /dev/mirror/swap3.eli

I was referring to the mirror name incorrectly.
OK, this storage is getting even more weird.

After executing your command I get this
swap1.png


I though that I got some weird console output so I executed again as it is referring to the same da7 disk but I got different output:
swap2.png


Since I was quite confused I tried to execute once again to see if there is another output and I got again something different
swap3.png


What this drive has to do with other devices?!? Is this a bug?
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
not certain if something has to do with this, but with "zpool clear ZFS da7p2" I got disk online... (even though I have no clue why it is labeled like da7p2 under zpool status or in the GUI /dev/da7p2...)

zpool-clear.png


clearzpool.png


now I only have one "offline" disk which I need to check why is that. Currently I am doing replication from main storage to secondary (as I found out that I overlooked few replication that did not do the job right and ended only creating parent directory but no data...). After completing this (which will take probably till Friday as it has 40TB with insane amount of small files), I will try to bring that disk online as well and see if I can find a way to fix this.

I will update as soon as I get something useful :smile:
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
Ok, in regards to da7 seems that clearing zpool fixed that disk at least according to my colleague who works with ZFS quite a lot, it looks like data is as equal to the rest of the disks so this disk is indeed in use. still that weird labling from the TrueNAS itself is what is raising my eyebrows.

working-disk.png


Next week since I am pulling replication from the original primary storage, we will try to bring that other disk online and see if this will be as well synced and cleared of all messages stating that is removed, or degraded (as again that is a brand new disk - second one which I used, so the chances that two disks are dead on arrival are very small).

I will keep you updated about progress and hopefully this will someday help others troubleshooting their issues.

Btw command that my colleague used for checking disk if it is running or not is:

"zpool iostat -v ZFS 10"

Brgds
 

Alecmascot

Guru
Joined
Mar 18, 2014
Messages
1,177
still that weird labling from the TrueNAS itself is what is raising my eyebrows.
When things are stable you should offline DA7, do a short wipe and then do a replace; all from the gui. This will rebuild the gptid on DA7.
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
perhaps I was being to fast in my comments, the disk is online but the checksum count is 15. This is bit strange as this is second disk which is brand new.
zpool.png


Funny that truenas did not removed that disk or declared it as degraded or offline...

also self test is currently running and it is at 91% for probably few hours now, almost like it is stuck...

smart.png


I have ordered additional two new disks and I will replace it when I get them once again, but if TrueNAS shows again that new disk is bad, then either this system is killing disks somehow or it has bugs as this would be the third disk which would be marked as faulty...

Btw, can someone write proper procedure how to replace disk via CLI because Web GUI appears that has some issues.
In my understanding, I need first to:
1. offline current disk which has issues: zpool offline ZFS /dev/da7p2
2. remove disk from bay and set new one.
3. issue command for replace: zpool replace ZFS /dev/da7p2

(please correct me if I am wrong.)
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
So in short, da14 was set as "ready" and that is why it was not visible under the TrueNAS. After it was converted to non-raid, I could use GUI, but prior I had to reboot the machine. So brand new da14 is not working even though under checksum I can still see some counts 140, which is bizarre as it is now the 3rd disk which is brand new, so either it is some data correction and no hardware issue or I have no explanation on that. While re-silvering I got on da7 and da3 some error checksum counts, again either ZFS is repairing something and it increased count or there is some bug, as again I really feel that it is not possible that 4 disks in total are having hw failure.

I have removed error counts by using zpool clear ZFS <da-name> and we will see, but for now I think I solve the problem.

Brgds
 

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
Chksum is often the result of a cabling issue rather than an HDD failure
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
Chksum is often the result of a cabling issue rather than an HDD failure
Sure but what is strange is that for two years this server had no issues whatsoever, and never have been touched, so how cabling can get loose. the vibrations are way too low to make any part in this connection issue...
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
disk was replaced and no other errors were shown after resetting counters via: zpool clear <pool-name> <disk>
- Disk was re-silvered
- no errors has been reported by system (so far).

IMO, there was definitely a disk error and not some cabling connection on the back-plane (as I said for 2 years that storage was running fine, and never "physically" been touched by me or anyone else...).
A week later after this incident/case which I reported here, I had a similar issue on our older storage Dell R730xd + MD1400 that is attached.

Basically 3 disks had issues. 2 had predictive failure shown in iDRAC and 1 completely failed disk. I assume, this is somehow linked with Dell's SAS firmware that was released. Literally, many disk have failed - total: 5 disks on 2 storage systems (in just 3 weeks difference), and I noticed that it happened after I patched the system two weeks later.

2 disks on relatively new storage R740xd2 and 3 on R730xd+MD1400

As I said, it looks very suspicious (almost pattern like) and it seems that Dell has QA problems before releasing their firmware. :-(
 
Last edited:

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
just short update. I found an article where someone has reported that after firmware update on the back-plane, there were some power issues which result of disconnecting disks from the pool. I have now re-installed TrueNAS and instead of using all the disks, placed 5 on standby (sacrificing several TB of storage) in order to check if this is true, and somehow I do not have any errors or issues with the disks. So same disks, same setup, only few of them are now not running but standby, so no power is given to them. So it seems that it is somehow related with firmware and power which is obviously causing the issues.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
So it seems that it is somehow related with firmware and power which is obviously causing the issues.
Power supplies will lose capacity over time, so it's possible that replacing with a newer (or more powerful) one would resolve it. (maybe you should still address the firmware thing too in any case)
 

vtravalja

Dabbler
Joined
Apr 25, 2022
Messages
27
Power supplies will lose capacity over time, so it's possible that replacing with a newer (or more powerful) one would resolve it. (maybe you should still address the firmware thing too in any case)
Hi again,

for now I placed disks into standby so they are not running, meaning less power is used. Changing power supply would required me to buy one, but from Dell's point of view they won't do anything via NBD unless it is completely damaged. Also I reported potential problem with the firmware, yet it landed on a deaf ears. Their first response was: There are thousands of customers running that firmware and no one has reported issue. (Sure, but no one has the same setup, nor they run the same system), so this was quite weird response from them tbh. Almost like they don't give a f.
 
Top