Spooky Story Megathread

wsoteros · Oct 20, 2023

Hey there TrueNAS Community,

With Halloween approaching, we're eager to hear your hypothetical (or real) storage nightmares!

We invite you to share your storage horror stories with us. What's the most terrifying storage-related incident you've faced? Did you find a way to conquer it, or is it still haunting your digital memories? Whether you've encountered a ghostly encounter with corrupted files, a monstrous data management problem, or a hair-raising hardware failure, we want to hear it all.

Ericloewe · Oct 20, 2023

No LTT. That's too easy.

Etorix · Oct 20, 2023

Well, there was this very sad user story:

How to lose data with ZFS

Step 1. Buy a house Step 2. Start 2 or 3 major projects on the house with multiple contractors Step 3. Have lots of work related stress Step 4. Lose a drive in a mirror during a weekend Step 5. Decide to burn in a drive on your production server as everything is getting ready for the big move...

www.truenas.com

Edit. And this other very sad one, ending (and actually beginning) with the loss of "a lot of memories and collections".

I lost my Dedup VDev Disk and Pool UNAVAIL

I added an Optane SSD for Dedup VDev some days ago and tested dedup performance, but it seems not very good. Then I try to delete the SSD from zpool and use it for other things, but failed. So I shut down the TrueNAS host and unplugged the SSD then erased it for other systems. When I turn on...

www.truenas.com

Common lessons: Never alter pool layout on a rush, and beware of irreversible commands.

winnielinnie · Oct 20, 2023

The Monster Goes "RAR!"

What's the point of a NAS at home? To stream media? To save important documents? To archive family photos?

All the above, sure. But what about a place to dump random files that may or may not be needed in the future?

Emergency contact information.

Back when I made some friends on the internet, we would chat over AIM (AOL Instant Messenger) and Skype. Game together. Watch movies together. Trade jokes. Or just chat about a lot of nonsense.

Life started happening, and we'd engage with each other less and less often. Sad, but expected. Not scary, though! (Not yet.)

Some time into our friendship, we created simple text files containing all of our contact information: real names, phone numbers, and email addresses. Basically, we trusted each other enough with this information, which we agreed we would only use in the case of an emergency or long periods of silence. This was our way to ensure we'd always have a method to reach out to each other, even if life, family, school, or work drifts us apart.

Fast-forward to a few months ago.

I haven't heard from any of these people for many years now. None of us have any active accounts with AIM or Skype. For all intents and purposes: there's no way to contact each other! There is no evidence of our friendship, except for the emergency text files we created. I have no clue where these people are or what they're doing. (Are they even okay?)

The text files! I can just open them up and read their emergency contact info! What luck! Even though they used to live on an old harddrive on Windows XP, I can say for certain they now live on my TrueNAS server. Through the many upgrades and migrations, I've always kept all my files with each "jump". If it exists, it is, de facto, on my NAS server. It has to be.

But where? What folder? This place is a mess. I have no real structure for my "dump and go" archive. It's like digging through a massive heap of trash, looking for a particular sticky note or greeting card. Good luck finding it.

Browse, browse, browse. Nothing. It's a grueling exercise to navigate blindly through a dumpster of data.

Aha! I'll "search" for it! But what was it called? "info.txt"? "friend.txt"? "contact.txt"? "username.txt"? No matter what keyword I try, the search comes up empty-handed.

Wait! It's a ".txt" file. Sure, I've got a lot of those, but at least I can narrow it down.

I crawl my entire NAS with the "find" command, and output the results to a log file. Every. Single. Text. File. That. It. Found. Up and down, looking for a familiar filename. Nothing. Surely I missed it? I go down the list much more carefully. Nothing. Nothing clicks. Nothing stands out.

I gave up for now.

Then a few days later, something did click. We used to be fanatics of WinRAR. "You can protect a file and save space! 'RAR it!' Why not? What harm can it do?"

And by the heavens, when I searched the entire NAS for ".rar" files... I found it! I swear the file was "glowing" as it sang to me. The filename was more obscure, but I recognized it. When I opened it up in my default GUI archiver in Linux, I confirmed the list of text files within the RAR file. Each one was associated with a different username! I tried to view one of my friend's information...

PASSWORD-PROTECTED FILE

Now I remember. We decided to password-protect our text files, just in case they fell into the wrong hands. (Brilliant. I know...)

Too many years had passed. I have no clue what password we might have used. I finally found this ancient relic, which would allow me to reach out to my friends from the past, yet I couldn't view the contents within. I tried different combinations of words that we might have considered. Nope. Nothing works. Knowing our esoteric sense of humor back then, the password could be anything.

Scary Ending (fake)

I shake my fist at the sky and curse the concept of password-protected RAR archives. From this day on, I forever scorn humankind. I run rm -rfv / out of spite for myself. I never found true happiness after this moment. I started a new life in the Metaverse.

Good Ending (real)

I foraged the web, looking for password crackers that work on .rar files. Some were "trial only", while others seemed too convoluted or unsupported. Then I found an opensource option, which I was able to compile on my desktop Linux box.

After reading about its usage, I threw a "hail mary" last-ditch attempt: let it run overnight, trying every password combination that is up to 16 characters, all lowercase.

Fingers crossed. Hopefully, back then, we weren't paranoid enough to use any numbers or capital letters. (Heaven forbid any symbols.)

The next morning, when I went to check on its progress, I couldn't believe it. It cracked the protection. Apparently, we had used a 13-letter lowercase password, which was based on one of our inside jokes. (Let's just say it's not appropriate to share this "word".)

All the text files opened! I made copies of every single one, dropped them into a folder, and then made a snapshot of my dataset. Never again will I go through this nightmare! Up to this point, I've never opened these "contact info" text files. This was the first time. Real names. Real phone numbers. Real email addresses. Prior to this, we had only known each other through our usernames on AIM and Skype. Nothing else.

Sad Ending (also real)

Their email addresses are "undeliverable". They don't own their original phone numbers. (Some can't even receive SMS messages.) Their names are too common, which rules out (creepily) "stalking" them online.

But at least I have proof of our friendship: liberated from the dreadful password-protected RAR file.

And I learned a valuable lesson: a thirteen-character lowercase password isn't as strong as I thought it was back then. I vaguely remember that "anything over 8 letters is secure!" How times have changed. (Maybe not. Maybe I just wasn't that smart.)

Arwen · Oct 21, 2023

A couple of small horror stories.

First up, 22 years ago I was building some Solaris 8 servers in a data center. So, using standard practice I mirrored the OS, which before ZFS, was DiskSuite. Primitive, but very usable.

A co-worker was also building a similar server and did not want to "waste" time loading the OS. So he "stole" a sub-mirror disk live from one of my servers. Okay, that would be annoying. BUT, he re-installed one of his server's disks as a replacement. DiskSuite had not recognized the removal of the sub-mirror for all the file systems. So, when asked, it attempted to read some of the data as if it was good data. Trashed my server.

Long story short, never remove a sub-mirror live, except using the proper tools. Meaning detach the sub-mirror, (ZFS has a special feature to make a second pool from a split mirror, zpool split ...).

Next, a small but busy server needed to grow it's storage, fast. So I had to concatenate several hardware RAID LUNs and grow the file system. Worked perfectly with minimal outage. (Had to un-mount the file system and re-mount on the DiskSuite device... then I could incrementally grow the UFS, which reduced lock time.)

A while later that external RAID device lost a disk, but also had an unknown bad block on another disk. So it could not rebuild the RAID-5. (This was before "patrol reads" were a thing...) So, re-create and restore from backups. But, even Sun Microsystems was confused on what DiskSuite was doing...

Except that was the day I schedule weeks in advance to be out of town for the afternoon, leaving shortly after 12pm. My pager, yes, we still had those things, had city coverage only to save the company money. Silly thing was, I had the on-call cell-phone because some of my co-workers used personal cell-phones, which I did not have at that time. Anyway, I had my trip and decided to take the scenic way home, adding at least 45 minutes. Even adding more to stop for dinner.

So, by the time I got into range of the pager, it was close to 8pm. When I called using the on-call cell phone, I had to leave a message, (that I was headed into the office). Weirdly enough, I had to drive really close to the office to get home. So, easy choice to stop at work first.

They were still at it hours later. Problem started before 1pm... but, by then, I was out of range

. Easy fix, but silly that no one else understood what I had done. It was a simple con-cat of 3 hardware RAID LUNs. Which, if they had to re-create, could be avoid.

IOSonic · Oct 25, 2023

A massive OpenFiler with a ridiculously sized RAID5 volume, filled with those old 4TB Seagate Barracudas that seemed to brick at a rate of one every day. They told us they were enterprise; they were not. No, it wasn't my design. Yes, you know how this went. Tossing their drives in a dumpster and setting them on fire would've been more to the point.

But I did have to help rebuild parity information from a backup with a hex editor. The one and zero keys on one of my colleague's keyboards were raw by the time we were done. Very, very bad month.

gdreade · Oct 25, 2023

First Story:

My first personal hard drive ever was a massive 40 MB monster bought for an Apple IIgs. I think I spent somewhere around $2500 for it, in late 80's dollars. Fantastic piece of kit. Yes, the operating system was smart enough that it could map out bad blocks. Unfortunately, that early version was susceptible to a problem where if one of the key blocks of the hierarchical file system (ProDOS 16) was bad early in the disk, it wasn't able to self-recover. The effect was that the majority of my data appeared to be gone. No backups yet because I'd not yet bought my first tape drive. No budget to get another disk.

Don't panic: I grabbed the (really excellent) paper manuals that described in detail how the filesystem was put together on disk. I booted from other media, brought up a byte editor, and started hand-transcribing key/directory blocks to paper, figuring out where all the pointers are going. I locate a good unallocated block elsewhere in the disk. I build a new directory block by hand, checking the math and the layout multiple times. Take a deep breath and modify the parent block to point to my new copy rather than the one that is throwing errors. Update the bad-blocks bitmap to include the problem block. Cross my fingers and reboot to the hard disk. Success; all the data has returned. That disk served me well for years afterward, and I gained religion on backups.

(Edit: Actually, the first, second, and n-th unallocated block I found was also damaged. I recall performing the above procedure about a half-dozen times, essentially doing a binary search in the critical first part of the disk before I found a block that would work.)

Second Story

After working as a software developer for a few years, I joined a local Sun Microsystems VAR to get experience on the systems side of the house. I and a co-worker go out to a client site, a small multinational. They have a development database server they want upgraded with more memory, replace the disks, and throw on a fresh copy of Solaris (it's getting burned down and they're starting from scratch). In the server room, while my coworker preps the new hardware, I log into the local admin workstation, telnet to the DB server's ALOM, and shut down the OS, with power off. Tell my coworker he's good to go. He says, "that's weird", because he can still hear the fans going. He wanders over to my console and verifies that it's sitting at the PROM prompt and he can see the OS has shut down. He shrugs, goes back, kills the physical power switch, and starts pulling disks.

My cell phone goes off and it's the client's CTO. He tells me that he doesn't know what happened but their production database server just died, half way through month-end payroll. Chills go up my spine, and my coworker's face goes white. I tell him, "ok, let me look into it and I'll get back to you shortly". I look back through the admin console history and note that the machine name is n32502p instead of n32502d. We check the labels on the machine whose parts are now spread across the anti-static mat on the floor. Answer: I shut down production (which is in a different city) rather than development, but it was the dev DB that was dismantled.

I quickly get back onto the ALOM for the production DB server, and perform a remote boot, monitoring it for sanity on the way up. It was a controlled shutdown, so it comes up cleanly (it just takes a while, something like 20min, which was not unusual at the time for such a beast). I call up the client and before I can get a word in edge wise the CTO says, "I don't know what you did, but it all just came back on its own and we can see that payroll is now continuing to process where it left off", and he thanks me profusely. I proceed to tell him exactly what happened and whose fault it is (mine), and offer my apologies. He pauses and then says, "See, THAT's one reason why I like using your company. Not only do you know your stuff but you don't try to bullsh*t people; <competitor> would have claimed ignorance as to the cause or made up a story." We remained as the preferred vendor to that multinational for many years after that.

Lots of lessons were learned that day, some of which are:

When doing privileged operations, double-check everything, even if you are not working in prod.
The difference between hostnames between development and production should really be more than just 'd' vs 'p'. (This was not actually under our control, in this particular case.)
Do not use the same credentials between development and production, besides the usual opsec reasons, it also acts as a concheck and slows you down, makes you think when you've started to touch the wrong system. (This, also, was not under our control.)
No, don't share ALOM / BMC creds between development and production, either.
If something doesn't make sense (in this case, the fans still running when the machine was supposed to be down), figure it out before you continue. On UNIX there is always an underlying reason for a behavior, even if you don't see it at first. (... bites tongue about Windows ...)
It pays to be honest with your clients. If you make a mistake, own it, and then do your best to address it.

Important Announcement for the TrueNAS Community.

Spooky Story Megathread

wsoteros

o7

Ericloewe

Server Wrangler

Etorix

Wizard

How to lose data with ZFS

I lost my Dedup VDev Disk and Pool UNAVAIL

winnielinnie

MVP

Arwen

MVP

IOSonic

Explorer

gdreade

Dabbler