First Story:
My first personal hard drive ever was a massive 40 MB monster bought for an Apple IIgs. I think I spent somewhere around $2500 for it, in late 80's dollars. Fantastic piece of kit. Yes, the operating system was smart enough that it could map out bad blocks. Unfortunately, that early version was susceptible to a problem where if one of the
key blocks of the hierarchical file system (ProDOS 16) was bad early in the disk, it wasn't able to self-recover. The effect was that the majority of my data appeared to be gone. No backups yet because I'd not yet bought my first tape drive. No budget to get another disk.
Don't panic: I grabbed the (really excellent) paper manuals that described in detail how the filesystem was put together on disk. I booted from other media, brought up a byte editor, and started hand-transcribing key/directory blocks to paper, figuring out where all the pointers are going. I locate a good unallocated block elsewhere in the disk. I build a new directory block by hand, checking the math and the layout multiple times. Take a deep breath and modify the parent block to point to my new copy rather than the one that is throwing errors. Update the bad-blocks bitmap to include the problem block. Cross my fingers and reboot to the hard disk. Success; all the data has returned. That disk served me well for years afterward, and I gained religion on backups.
(Edit: Actually, the first, second, and n-th unallocated block I found was also damaged. I recall performing the above procedure about a half-dozen times, essentially doing a binary search in the critical first part of the disk before I found a block that would work.)
Second Story
After working as a software developer for a few years, I joined a local Sun Microsystems VAR to get experience on the systems side of the house. I and a co-worker go out to a client site, a small multinational. They have a development database server they want upgraded with more memory, replace the disks, and throw on a fresh copy of Solaris (it's getting burned down and they're starting from scratch). In the server room, while my coworker preps the new hardware, I log into the local admin workstation, telnet to the DB server's ALOM, and shut down the OS, with power off. Tell my coworker he's good to go. He says, "that's weird", because he can still hear the fans going. He wanders over to my console and verifies that it's sitting at the PROM prompt and he can see the OS has shut down. He shrugs, goes back, kills the physical power switch, and starts pulling disks.
My cell phone goes off and it's the client's CTO. He tells me that he doesn't know what happened but their production database server just died, half way through month-end payroll. Chills go up my spine, and my coworker's face goes white. I tell him, "ok, let me look into it and I'll get back to you shortly". I look back through the admin console history and note that the machine name is n32502p instead of n32502d. We check the labels on the machine whose parts are now spread across the anti-static mat on the floor. Answer: I shut down production (which is in a different city) rather than development, but it was the dev DB that was dismantled.
I quickly get back onto the ALOM for the production DB server, and perform a remote boot, monitoring it for sanity on the way up. It was a controlled shutdown, so it comes up cleanly (it just takes a while, something like 20min, which was not unusual at the time for such a beast). I call up the client and before I can get a word in edge wise the CTO says, "I don't know what you did, but it all just came back on its own and we can see that payroll is now continuing to process where it left off", and he thanks me profusely. I proceed to tell him exactly what happened and whose fault it is (mine), and offer my apologies. He pauses and then says, "See, THAT's one reason why I like using your company. Not only do you know your stuff but you don't try to bullsh*t people; <competitor> would have claimed ignorance as to the cause or made up a story." We remained as the preferred vendor to that multinational for many years after that.
Lots of lessons were learned that day, some of which are:
- When doing privileged operations, double-check everything, even if you are not working in prod.
- The difference between hostnames between development and production should really be more than just 'd' vs 'p'. (This was not actually under our control, in this particular case.)
- Do not use the same credentials between development and production, besides the usual opsec reasons, it also acts as a concheck and slows you down, makes you think when you've started to touch the wrong system. (This, also, was not under our control.)
- No, don't share ALOM / BMC creds between development and production, either.
- If something doesn't make sense (in this case, the fans still running when the machine was supposed to be down), figure it out before you continue. On UNIX there is always an underlying reason for a behavior, even if you don't see it at first. (... bites tongue about Windows ...)
- It pays to be honest with your clients. If you make a mistake, own it, and then do your best to address it.