Files copied to server being corrupted?!

Status
Not open for further replies.

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Sorry about the "?!" but that's about how I feel :confused:

I've just been testing my FreeNAS install. It's on good hardware (SuperMicro, Xeon, 96gb ECC, Chelsio+Intel), was clean installed on 9.10.2 + upgraded to 11.0 latest via GUI, and has never been configured or modded other than via GUI. Not much functionality is in use at the moment either - one share type, SSH, no VMs or extensions/plugins/jails. It's sharing the data store via Samba to a Windows network.

The permissions/ACLs and Samba setup do have issues which I can't quite fix as I'm still hazy on a bunch of perms+ACL stuff, but that's a whole different problem, and wouldn't cause this problem (they would affect file access not file content, meaning perhaps the share or some files/dirs couldn't be browsed/traversed/read/written as desired).

So anyhow as part of testing the NAS which I do periodically, I copied a directory of about 3000 photos and files (13gb) from Windows 8.1 to the server and back. Of the 3000, about 500 had different hashes when I compared the originals to the round trip. I've never seen that before. Things I've done to try and figure what's going on:

  • Does the corruption occurs on saving to FreeNAS, reading from FreeNAS, or both? Checked by hashing the original data and final data on the workstation, and the intermediate copy on the server using " find . -exec sha1{} >> hashes.txt\;". Result - the corruption seems to be on the workstation -> file server trip, when FreeNAS receives and writes the files. (Workstation original != Server data; Server data == Workstation copy)
  • Is the corruption always on the same files or the same changes? Copied the same dataset 3 times successively from client to server to 3 dirs in the same parent dir on FreeNAS. DIrs named SHARE_ROOT/dir1 through SHARE_ROOT/dir3 to ensure no effect due to casing, parent dir, or dir name. Result - different files were affected each time, and the data in the 3 copies didn't always have the same hashes as each other.
  • Is the workstation at fault? Repeated using 2 other networked Windows machines. Same kinds of results.
  • Are there any reported integrity errors on the FreeNAS server? Checked server integrity using GUI, and ran a scrub on the boot/system volume. Result - both passed, no errors detected.
  • Is it related to long filenames, long paths, casing, or unusable characters? Seems not: the photos have standard ASCII filenames < 30 chars long and folders not nested deeply. As the folders are newly created the files won't have multiple copies with different cases, and the count of files and their names in each dataset copy is identical.
  • Is it related in any way to file system metadata or ADS on the Windows side? Unlikely - if this were a problem then the hashes would still all be the same on the server, because they're computed the same way from static data at rest, even if the server-side hash was different from the hashes I got on Windows. (I've also used hash checking for years and never found metadata or ADS to be included in the hash that's computed on Windows, whatever software I use).
Not tested yet:
  • Network reliable? Not entirely sure how to best check this. I can't see any obvious signs of errors, and SSH uses SSL which would be sensitive to network issues I guess. Any suggestions how to test the network end-to-end in case of something weird, would be worthwhile.
  • Does the issue seems to be linked to a specific file transfer/share mechanism? Repeat using (for example) WinSCP or sharing via NFS instead of Samba, and see if the data is reliable or also corrupted? I should do this, and will add it above when done.
  • Any consistency to the corruption when it happens? For example, single bytes, truncation, always a similar block or anything else in common, when checked with a hex editor? Not done yet, can do if needed.
  • Clean install FreeNAS and import the data drives + config? Can do. A bit apprehensive though.

I've never heard of FreeNAS (or indeed anything FreeBSD) having an issue like this. The data on it seems safe, but right now I'm scared to trust my NAS until this is resolved. What's going on?
 
Last edited:

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Did you run a scrub on the data volume? If that scrub returns clean, it means that FreeNAS wrote the data that was received, which would indicate the problem was on the Windows end.

Run a memtest on the Windows machine.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
Also consider what your network adapters are on each end, and any switches in between.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
What are these files? You should test with a simple file created in freenas via dd. Then hash it and copy it to your windows host, then hash it again. This is probably a windows issue it maybe EA's getting dropped.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Thanks all. Some more info and quick replies:

I just noticed that the copies from different clients aren't equal in their errors. One client (3 x folder = 9500 files) had just 3 mismatches on copying to the server. The copy from the other client (identical data) had over 500. If anything that's more confusing. If it's the networking gear it should affect both equally, if both route through it, or one not at all ("all or none"), shouldn't it? But if it's the client why are multiple clients both doing this? Two issues conflating maybe, and both giving same symptoms?

Tomorrow I will try to use a different server adapter or even set up a temporary FreeNAS on a spare board just to exclude the server, then try on a few more clients and swap out the networking gear.


Did you run a scrub on the data volume? If that scrub returns clean, it means that FreeNAS wrote the data that was received, which would indicate the problem was on the Windows end. Run a memtest on the Windows machine.
Ran a scrub - clean, no errors. I copied onto a newly created "test" pool to ensure clean, so the scrub was quick. But it's now 3 Windows machines all doing the same, so it seems unlikely to be simultaneous h/w issues on all. Will check this shortly anyway but not hopeful.

Also consider what your network adapters are on each end, and any switches in between.
Indeed, I'll check this tomorrow. But shouldn't TCP/IP pick up if the data stream is being corrupted in flight? Isn't that what TCP/IP is there for? If it is a networking gear issue, I'm unclear how it wouldn't just cause a networking error. That's partly why I discounted this as a problem. Wouldn't one expect networking errors rather than reports of successful file copies if the adapters, cables or switch weren't faithfully transmitting data?

Other protocols such as SSH and MS Remote Desktop that rely on good connections seem to be stable across the same switch and cables and adapters, except for the final adapter being in the server rather than in my laptop. I can move the cable to a different server adapter tomorrow.

What are these files? You should test with a simple file created in freenas via dd. Then hash it and copy it to your windows host, then hash it again. This is probably a windows issue it maybe EA's getting dropped.
The files are all JPEG/PNG images - mostly photos around 5 - 12MB x 3000 files in a flat folder. I tried your idea and just copied a single 31 GB backup from a client to the server. Indeed, the SHA1 on the client and for the received file copied to the server differ.

Do any of these mean anything? What can be going on that could cause it?
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
The network should catch corruption, but it is possible for it to fail:
http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html

I would guess it is in the server. One possibility is that your ethernet adapter does checksum offloading, and is then corrupting the data between the NIC and system RAM.

If you do ifconfig eth0 (or whatever your adapter is), options= may show you RXCSUM and TXCSUM, along with many similar. This would mean the adapter is handling the checksums, not the FreeBSD kernel. TSO is another possibility. Some drivers do let you disable these.
 
Last edited:

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Given one FreeBSD-based machine with server-class hardware, including ECC RAM, and three Windows machines of any hardware... my suspicions would first land on the three Windows machines all being infected with the same virus.

What has still not been mentioned is whether these data files are actually corrupted. Yes, the hash is different. But does the file have actual visible errors? Is it Windows messing with archive bits or metadata somehow?
 

anodos

Sambassador
iXsystems
Joined
Mar 6, 2014
Messages
9,554
Given one FreeBSD-based machine with server-class hardware, including ECC RAM, and three Windows machines of any hardware... my suspicions would first land on the three Windows machines all being infected with the same virus.

What has still not been mentioned is whether these data files are actually corrupted. Yes, the hash is different. But does the file have actual visible errors? Is it Windows messing with archive bits or metadata somehow?

Well, @Stilez could always use the Windows powershell tool Get-FileHash on both files. It works on local and remotely stored files. I believe it excludes metadata stored as an ADS, and ACL / attributes.
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
How are you checking these hashes? Show us the exact command I want to replicate it.
 

ovizii

Patron
Joined
Jun 30, 2014
Messages
435
Are you using SMB shares when you copy from the Windows workstations to the server and back?
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
I had a quick reply in draft, but since then I've pinned down the issue (I think - still can't believe it, so if I'm wrong.....). I'm posting what I did and what I found, for anyone else who gets hit by a similar issues and finds this post.

Short version - it's something to do with either the 10G switch or the FreeNAS server 10G card. I can narrow it down from there easily so the problem is, with luck, solved. This is what's gone on:

  • I tried using SCP (client: WinSCP) instead of Windows Explorer. My aim was to exclude Samba by using a different transfer protocol. I then repeated the same as before - 3 copies sent of the same data by client A, 3 copies from client B, and check hashes on the server. I didn't get to the last part because one of the clients could hardly get a stable connection, kept getting "Server has closed connection" errors. On maximum debug, WinSCP couldn't tell me more. The other did, but was very slow on the usual SFTP, I switched to SCP as I wasn't after security, just after verification of result; much faster. The 2nd client had the occasional disconnection, but much less often.
  • Not being sure where to look for a reason for the disconnections, I tried every server log written in the past hour, and unexpectedly found a big hint in /var/log/auth.log - a bunch of errors that were much more specific:

root@svr-1:/var/log # egrep -i "(corrupt|fatal|timeout|connect)" auth.log
Aug 18 15:47:21 svr-1 sshd[86517]: Connection reset by 192.168.1.99 port 57748 (preauth)
Aug 26 07:32:17 svr-1 sshd[22720]: Corrupted MAC on input.
Aug 26 07:32:17 svr-1 sshd[22720]: ssh_dispatch_run_fatal: Connection from 192.168.1.99 port 50865: message authentication code incorrect
Aug 26 07:32:39 svr-1 sshd[22765]: Corrupted MAC on input.
Aug 26 07:32:39 svr-1 sshd[22765]: ssh_dispatch_run_fatal: Connection from 192.168.1.99 port 50869: message authentication code incorrect
Aug 26 07:32:48 svr-1 sshd[22782]: Connection closed by 192.168.1.99 port 50873 (preauth)
Oct 3 16:17:20 svr-1 sshd[31709]: Timeout, client not responding.
Oct 3 17:48:27 svr-1 sshd[36473]: Timeout, client not responding.
Oct 3 18:17:27 svr-1 sshd[81056]: Timeout, client not responding.
Oct 4 10:23:31 svr-1 sshd[26493]: Corrupted MAC on input.
Oct 4 10:23:31 svr-1 sshd[26493]: ssh_dispatch_run_fatal: Connection from 192.168.1.29 port 58492: message authentication code incorrect
Oct 4 10:23:46 svr-1 sshd[27224]: Corrupted MAC on input.
Oct 4 10:23:46 svr-1 sshd[27224]: ssh_dispatch_run_fatal: Connection from 192.168.1.29 port 58510: message authentication code incorrect
Oct 4 10:29:30 svr-1 sshd[27494]: Corrupted MAC on input.
Oct 4 10:29:30 svr-1 sshd[27494]: ssh_dispatch_run_fatal: Connection from 192.168.1.29 port 58565: message authentication code incorrect
Oct 4 11:00:28 svr-1 sshd[32202]: Corrupted MAC on input.
Oct 4 11:00:28 svr-1 sshd[32202]: ssh_dispatch_run_fatal: Connection from 192.168.1.29 port 58705: message authentication code incorrect
Oct 4 11:06:09 svr-1 sshd[33133]: Corrupted MAC on input.
Oct 4 11:06:09 svr-1 sshd[33133]: ssh_dispatch_run_fatal: Connection from 192.168.1.29 port 58740: message authentication code incorrect
Oct 4 11:46:17 svr-1 sshd[16644]: Timeout, client not responding.
Oct 4 15:47:17 svr-1 sshd[91497]: Corrupted MAC on input.
Oct 4 15:47:17 svr-1 sshd[91497]: ssh_dispatch_run_fatal: Connection from 192.168.1.22 port 49307: message authentication code incorrect
Oct 4 15:49:46 svr-1 sshd[31891]: Corrupted MAC on input.
Oct 4 15:49:46 svr-1 sshd[31891]: ssh_dispatch_run_fatal: Connection from 192.168.1.22 port 49273: message authentication code incorrect
  • There was obviously a network data corruption issue of some kind. The networking topology is split between "domestic" and "enterprise" sides, so data flow for the tests is (clients ->1G Intel adapters -> 'dumb' desktop 1G switch -> 1G/10G switch -> FreeNAS 10G Chelsio adapter). I reallocated the Chelsio's IP to the onboard 1G Intel adapter and plugged the server directly into the 1G desktop switch, completely bypassing the 1G/10G switch and 10G card, to take those two out of the loop. (I had a spare 1G switch to swap if that had failed).
  • With the server NIC changed from 10G Chelsio to 1G Intel and the 10G switch and card out of the data path, the network was immediately happier. I repeated above - multiple copies of the data via SCP from both clients to the FreeNAS box, and instead of occasional or frequent disconnects, it did 5 copies from each box without issue. The data rate was stable for the hour or whatever it took, at 115 MB/s which is about the limit for a 1G link. So I gave it 4 more from each box, for a total of 9 copies of the data from each of 2 machines.

Result - not a single disconnect in either WinSCP or auth.log, and according to Excel, the hashes on the servers for all 18 copies -whether from machine A or machine B - are in every case now identical. Meaning, 18 copies of each of 3154 files (90k or so of 5-10MB files). Ditto with a further 5 copies from each PC using Samba, when I tried that again. I'm going to let diff check this on the server to be absolutely sure. If the copies are identical according to diff, then I plan to find which component failed and donate it to a flamethrower testing service or the local axe-wielding display team ;). But I'm hopeful that this one's solved. (If not, I might reopen it.)

Thank you for the help and support, and ideas of things to check. Especially, thank you @rs225 for the tip that TCP/IP might *not* detect corruption in flight in some cases, I hadn't known that and it was crucial. Turns out that was exactly what it was.
 
Last edited:

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Don't destroy the component unless there is something physically damaged. If it's a driver or software problem, maybe we can fix it.
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
Don't destroy the component unless there is something physically damaged. If it's a driver or software problem, maybe we can fix it.
Thanks @wblock - I might not be able to pin it down for a few days, maybe a week. But I will come back shortly bearing a broken desolate component in my teeth :eek:
 

Stilez

Guru
Joined
Apr 8, 2016
Messages
529
There's an interesting update to this. I asked if anyone knew of some tool suitable for testing network "in flight" corruption on Stack Exchange - I had in mind a tool somewhat like iperf but that tested the data itself for fidelity on receipt, and not just speed+jitter. The maintainer (I think) of iperf2 posted that he had seen a similar issue and might consider adding it to iperf2 if it was justified to do so. What a completely unexpected twist ending!

I'm still trying to pin down which exact hardware is at fault. It's slow - if I copy 400 copies x 14GB of a file across the LAN as 20 parallel copies, 398-399 will be fine and 1-2 of them will be corrupt about 1/3 of the time. So it's a huge amount of time to narrow down as I swap components in and out. But I'll get there. For one thing I can let it run overnight. For now I sidestepped it by using completely different hardware, and it's fine, but I'll have to come back to this and track down the exact component at fault eventually.
 
Status
Not open for further replies.
Top