@Constantin, you're a great member who contributes a lot of fun stuff, so let me state this isn't directed at you:
Also to be clear for new members,
smartctl issues commands to the hard drive controller, which is on the part of the hard drive unit--it's the circuitry you see stuck to the bottom of the hard drive. For simplicity I'm going to refer to everything collectively as "the hard drive."
--all will report the S.M.A.R.T. information the drive knows about
(this information is stored on the HDD controller stuck to the drive).
--xall will also print non-standard vendor-specific information.
--test=long tells the drive to run a short test as a sanity check, then read the entire disk and verify the data against the checksum data. If a block doesn't check out the controller reads the
Error Correction Information stored in a different set of blocks on that sector and tries to correct the error. If it can, the controller will write the data back to the original block, wait a bit, then read it back and see if it stored properly. If the previous magnetic pattern "got weak" from age (bit rot) then things should be fine. If not, then the controller will write the data elsewhere on the drive, verify it, generate an ECC code and write that, update a table of bad-blocks and record where the bad block is now mapped. If the controller could not correct the error because the block is too corrupted it reports the block is unrecoverable and the data lost. With RAID-5 the controller will then reconstruct the data from parity data. Since an error occurred the drive controller will update the S.M.A.R.T. data; TrueNAS will see the what happened and report it, and a user can read the full report via
smartctl --xall.
The drive will also record information on its own when things go wrong, like when the user is streaming
Sex In The City and a block doesn't match the checksum and it goes through the process mentioned previously.
In summary, and the point I'm trying to make, is all this happens on the drive, and if a long test isn't run the drive can only report what it knows about, which is in relation to what it remembers, meaning if it blacks out due to an unexpected power issue its memory might be a bit fuzzy.
---
If there is a problem external to the drive, like a bad connection, flaky HBA (because you zapped it repeatedly with high-voltage), mainboard issue (don't get me started), corrupt cache, etc., the drive probably doesn't know about it. That's where
badblocks comes in.
badblocks -v will cause the Operating System to try to read the whole drive. If there is a cabling or some other issue you have a much better clue where the problem is located because
smartctl --test=long confirmed the drive seems to be fine (hopefully that was the result), but
badblocks is reporting errors. What it is can be a bit dicey, as a non-ECC RAM problem can look like a controller issue. On that, if your mainboard HDD controller has a cache and the cache is corrupt (a common problem on some HP servers), a full mainboard reset is needed to clear everything out. This again is why I buy extra drives and have an old system set up for testing, because it helps quickly isolate which component is causing the issue. (If possible I actually have an exact copy of the Production server(s) running in the Test environment to swap out parts, and a second exact copy to put into the Production environment while a production server is down, but that's not always in the budget, which is why I often fudge my expense report. Well, that and other reasons, but that's a bit off-topic. Anyway, if the second backup server is an exact copy of the problematic server then the second backup server simply replaces the downed production server and stays in production; what used to be the production server is tested, repaired, and becomes the backup server. This is why I have time to thoroughly test the failed server's equipment while still fraternizing with anything on two legs that doesn't identify as the same gender as myself.)
---
Full Disclosure: There were two times in my career the "second exact copy" server failed in short order while in production, so I had to scramble to replace it with the "first exact copy," then quickly fix the "second exact copy," which in both cases was quick and easy as I had a spare set of parts from the failed production server and the problems weren't a 100% overlap. In both cases the "first exact copy" continued to run fine, although the hardware was a bit older and slower one of those times, so when both the systems sitting in the Test Environment burned-in successfully I swapped the original "production server" back into Production. Admittedly, that was a bit nerve-racking as for a short time I had no backup system in those two cases, but it was only briefly, and women love to see a man in full-action getting the job done against what seem like impossible odds. (The guys did too, but my gender-identity precludes me from capitalizing on that admiration in quite the same way.)