email: ZFS data corruption

Status
Not open for further replies.

wheezer

Dabbler
Joined
Feb 18, 2013
Messages
15
** Relative newbie to FREENAS, so "be kind" y'all ...**
Hardware installed:
M/B: Supermicro X8DTL-iF M/B
Proc: Dual quad-core XEON E5620 @ 2.40 GHz
RAM: 24 GB DDR3-RDIMM - 10600
STORAGE: 3 X Seagate Ironwolf 4 TB ZFS
boot: 2X 16GB Corsair flash (failover)
P/S: EVGA NEX750

Received an email this am telling me:
state: ONLINE
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.

Performed #zpool status -x and #zpool status -xv and find the only files affected are some w/s backup files that are totally freaking unbelievably 'old', which puzzled me immensely since this only went into service in March of this year (2018). So suspecting that I knew what happened, I did some investigating and find the the owner/operator of this system has been copying his *** @#$%ed OLD Windows backups from his USB-connected external drives to the 'backup' share - which I KNOW I told him to absolutely forget about in the first place (ie: those old freaking backups) - but, never mind I guess.. It is what it is...
So - I'm going to presume I can simply delete these old POS backups and allow his PC to continue to backup to this share location and not have to be concerned about this error message. I see no other indications of any hardware issues, but I wanted to touch base and make certain what I THINK I can do is what I should do... And NATURALLY, I also find they've disconnected (unplugged) the old NAS from the network which was used to perform backups, so a full pool restore is a 'no-go'.... ("...we thought it wasn't needed any more..")

Is there an emoticon for "gun to my head'..?
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
I would try to find the source of that data error regardless of what you do with the corrupt files. If ZFS detected an error it's because it didn't match it's checksum data and you need to figure out why.

Start by checking all hardware for problems. I'd start with the memory.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
The only way a file should be picked up by ZFS as corrupted is if it became corrupted after it was stored in the ZFS pool. You should not have a file become corrupted in a ZFS pool unless there is something wrong, WRONG, with the system.
If you will post (in code tags) the output of the zpool status without the -x or -v, and it would help if we has more complete information on the system configuration.
 

wheezer

Dabbler
Joined
Feb 18, 2013
Messages
15
Well, lots of testing later with no 'culprits' found (RAM 100%, disks 100%, disk controller 100%, power supply 100%, etc.) what I HAVE found is that there was absolutely NO power protection for this system (UPS NOT CONNECTED), and due to "environmental power failures" to the building, there were a SERIES of rapid 'power drops' that apparently corrupted the data. Or so it appears.. So I backed up ALL the data that was possible (some files were lost, but this consisted of only the aforementioned 'old backups'), then destroyed the pool and recreated it and restored all 2.8 TB of data ... And after 8 days of intense 'bench testing' have had no issues since. From this I have to assume there were no 'hardware issues' but more likely just the 'catastrophic' power failures.
So now my next effort is in properly connecting and configuring this brand new Cyberpower 1500PFCLCD 'true sine wave' UPS... And not having a lot of luck, in the process.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
due to "environmental power failures" to the building, there were a SERIES of rapid 'power drops' that apparently corrupted the data.
That will definitely do it.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
what I HAVE found is that there was absolutely NO power protection for this system (UPS NOT CONNECTED), and due to "environmental power failures" to the building, there were a SERIES of rapid 'power drops'
That is why I always say, no matter how reliable you think your power is, you should always have a UPS. The building I work in has one of those massive 18 wheeler sized backup generators that can carry the whole building for two weeks without even needing to be refueled, but we have UPS units to carry us during any transient power events.
So now my next effort is in properly connecting and configuring this brand new Cyberpower 1500PFCLCD 'true sine wave' UPS... And not having a lot of luck, in the process.
What kind of problem are you having?
 

wheezer

Dabbler
Joined
Feb 18, 2013
Messages
15
Hmmm... Yeah... UPS not connected.. No-brainer...
But my issue with the connection for new UPS is when I test access, I get the return: "connection refused" - which obviously means I don't have it 'pointed' to that port correctly.
But I've definitely determined it is plugged into USB 1 (ugen1.2) because after I performed:

[root@****** ~]# dmesg | grep ugen
I get:
uhub3: ugen0.1: <Intel UHCI root HUB> at usbus0
ugen1.1: <Intel UHCI root HUB> at usbus1
ugen1.2: <CPS CP1500PFCLCD> at usbus1

And yet I THINK I've got this setup properly in UPS service:

upload_2018-6-11_17-53-53.png

upload_2018-6-11_17-57-53.png


And correct me if I'm wrong, but the monitor password should altered from the default (fixmepass) to root PW - right? Which is what I did ...

but when I do: uspc ups
I get:
[root@r******* ~]# upsc ups
Error: Connection failure: Connection refused

And I've tried a LOT of the drivers that appear to be for similar Cyberpower UPSs, but get the same results, so I'm not doing something right, no doubt.. or maybe 'left something out' somewhere... And of course my Freenas / Linux "guru" is in California for the week and won't answer his @#$#%$ phone.. (NOT that I blame him)

Hints?
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
And correct me if I'm wrong, but the monitor password should altered from the default (fixmepass) to root PW - right? Which is what I did ...
Did you turn the service on?

upload_2018-6-11_19-49-44.png
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I have the same UPS. Only difference for me is that I have always understood that the pw should indeed be "fixmepass" and that's what mine is. And mine does work.

To determine the port setting, I used the technique described in the manual :"For USB devices, the easiest way to determine the correct device name is to check the box Show console messages in System →Advanced. Plug in the USB device and look for a /dev/ugen or /dev/uhid device name in the console messages." - it does not indicate the same apparent result as the method you used for my install. However, I have also heard that it doesn't matter what value you select for that box, all should work... but I never tested that suggestion.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
And correct me if I'm wrong, but the monitor password should altered from the default (fixmepass) to root PW - right? Which is what I did ...
The username and password is all about another system being able to get the UPS status of your UPS via a network connection to this server.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
However, I have also heard that it doesn't matter what value you select for that box, all should work... but I never tested that suggestion.
If you change the port setting in your UPS to something wrong, you will get this error in your log:
Code:
Jun 11 20:07:13 Emily-NAS upsmon[57715]: upsmon parent: read
Jun 11 20:08:33 Emily-NAS upsd[61254]: mainloop: Interrupted system call
Jun 11 20:08:33 Emily-NAS root: /usr/local/etc/rc.d/nut: WARNING: $nut_upsshut is not set properly - see rc.conf(5).
Jun 11 20:08:33 Emily-NAS upsmon[61277]: upsmon parent: read
Jun 11 20:09:13 Emily-NAS ugen1.3: <American Power Conversion Back-UPS XS 1500G FW866.L8 .D USB FWL8> at usbus1 (disconnected)
Jun 11 20:09:16 Emily-NAS upsd[62474]: Data for UPS [ups] is stale - check driver
Jun 11 20:09:19 Emily-NAS upsmon[62498]: Poll UPS [ups] failed - Data stale
Jun 11 20:09:19 Emily-NAS upsmon[62498]: Communications with UPS ups lost
The gist of that is, you are telling the system what USB port to look at to get updates from the UPS and if you are looking at the wrong port, the data from the UPS is, "stale"...
When you have the UPS monitor pointed at the correct port, your log will update, like this:
Code:
Jun 11 20:09:20 Emily-NAS upsd[62474]: UPS [ups] data is no longer stale
Jun 11 20:09:20 Emily-NAS ugen1.3: <American Power Conversion Back-UPS XS 1500G FW866.L8 .D USB FWL8> at usbus1
Jun 11 20:09:20 Emily-NAS ums0 on uhub2
Jun 11 20:09:20 Emily-NAS ums0: <Winbond Electronics Corp Hermon USB hidmouse Device, class 0/0, rev 1.10/0.01, addr 3> on usbus0
Jun 11 20:09:20 Emily-NAS ums0: 3 buttons and [Z] coordinates ID=0
Jun 11 20:09:24 Emily-NAS upsmon[62498]: Communications with UPS ups established
 

wheezer

Dabbler
Joined
Feb 18, 2013
Messages
15
Oh for ... Jeez.. uh.. heh yeah, ya dipsoid.. Ya gotta turn it on for cryin' out loud.. Isn't that amazing
Heh-heh.. Cripes... guess it works when UPS service is 'on'...

I'll be alright in the morning, I know it...
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Oh for ... Jeez.. uh.. heh yeah, ya dipsoid.. Ya gotta turn it on for cryin' out loud.. Isn't that amazing
Heh-heh.. Cripes... guess it works when UPS service is 'on'...

I'll be all right in the morning, I know it...
I guess that means it is working now? ;)
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Regardless of the power issues, data shouldn’t have corrupted.

Do you have redundancy on the ZFS pool?

Are you running any hardware raid controllers?

Small power glitch and no ECC could’ve resulted in corrupt data though.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
If you change the port setting in your UPS to something wrong, you will get this error in your log:
Code:
Jun 11 20:07:13 Emily-NAS upsmon[57715]: upsmon parent: read
Jun 11 20:08:33 Emily-NAS upsd[61254]: mainloop: Interrupted system call
Jun 11 20:08:33 Emily-NAS root: /usr/local/etc/rc.d/nut: WARNING: $nut_upsshut is not set properly - see rc.conf(5).
Jun 11 20:08:33 Emily-NAS upsmon[61277]: upsmon parent: read
Jun 11 20:09:13 Emily-NAS ugen1.3: <American Power Conversion Back-UPS XS 1500G FW866.L8 .D USB FWL8> at usbus1 (disconnected)
Jun 11 20:09:16 Emily-NAS upsd[62474]: Data for UPS [ups] is stale - check driver
Jun 11 20:09:19 Emily-NAS upsmon[62498]: Poll UPS [ups] failed - Data stale
Jun 11 20:09:19 Emily-NAS upsmon[62498]: Communications with UPS ups lost
The gist of that is, you are telling the system what USB port to look at to get updates from the UPS and if you are looking at the wrong port, the data from the UPS is, "stale"...
When you have the UPS monitor pointed at the correct port, your log will update, like this:
Code:
Jun 11 20:09:20 Emily-NAS upsd[62474]: UPS [ups] data is no longer stale
Jun 11 20:09:20 Emily-NAS ugen1.3: <American Power Conversion Back-UPS XS 1500G FW866.L8 .D USB FWL8> at usbus1
Jun 11 20:09:20 Emily-NAS ums0 on uhub2
Jun 11 20:09:20 Emily-NAS ums0: <Winbond Electronics Corp Hermon USB hidmouse Device, class 0/0, rev 1.10/0.01, addr 3> on usbus0
Jun 11 20:09:20 Emily-NAS ums0: 3 buttons and [Z] coordinates ID=0
Jun 11 20:09:24 Emily-NAS upsmon[62498]: Communications with UPS ups established
Thanks for the explanation. Helpful.
My "in-head" info came from posts 10,11 and 13 in this thread https://forums.freenas.org/index.php?threads/ups-connection-not-working.60387/#post-429120
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Thanks for the explanation. Helpful.
My "in-head" info came from posts 10,11 and 13 in this thread https://forums.freenas.org/index.php?threads/ups-connection-not-working.60387/#post-429120
I looked back over that thread and I am not sure that the problem there wasn't the same as this. Nobody asked him if he turned the service on and that is the error if it is off:
Code:
[root@r******* ~]# upsc ups												
Error: Connection failure: Connection refused
The connection might timeout if the service is looking at the wrong port, but to be refused makes me think the service isn't enabled.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Regardless of the power issues, data shouldn’t have corrupted.

Do you have redundancy on the ZFS pool?

Are you running any hardware raid controllers?

Small power glitch and no ECC could’ve resulted in corrupt data though.
He said that it went on and off repeatedly and with the server not on a UPS. You don't think that could be an explanation?
 

wheezer

Dabbler
Joined
Feb 18, 2013
Messages
15
As per post #15, I believe I inferred that this was ZFS1 because of the 3 X 4TB Seagate Ironwolf disks. And I believe I also dealt with the point regarding ECC RAM by stating it was 24 GB RDIMM (registered memory).
They specifically asked for "RAID-5" and IMHO that is the situation I gave them.

Also did:
[root]@r******~]# upsmon -c fsd

and shutdown occurred as specified in UPS service (180 seconds). And I only chose this arbitrary time limit because I JUST had the electrician add this particular circuit to the backup generator panel box, and because I have seen some backup generators not 'settle' their voltage for as much as 90 seconds. And I'm not taking chances, now... Which is also why I specified a 'true sine wave' UPS of this capacity ...
 
Last edited:

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I believe I inferred that this was ZFS1 because of the 3 X 4TB Seagate Ironwolf disks.
Not trying to be to picky here, but ZFS1 threw me. I didn't know what you were talking about until I spent a moment thinking and then it occurred to me, you mean RAIDz1. It is a little thing, but it makes reading so much easier with the correct terms.
They specifically asked for "RAID-5" and IMHO that is the situation I gave them.
When a customer asks for that, you should try to educate them that double parity (RAIDz2) is a much better solution, especially for large drives. On the forum they have been suggesting that RAIDz1 not be used on drives larger than 1TB (if I recall correctly) from the time I started lurking back in 2011. I have seen the guidance that RAID-5 is dead elsewhere on the internet, and for the same reason.
 
Status
Not open for further replies.
Top