A Tale of Sadness, Frustration, and Data Los

It started on a Saturday evening with my wife asking why our DVR suddenly stopped playing a show she was watching. I told her it was probably just some glitch, but I’d take a look. I walk into the family room to look, and the error basically stated that the underlying disk was no longer available. Not good! This was the start of my three day horror story…
                                sharing information*******
Regards :-Raja Aqib ali

A little background

My DVR is actually just specialized software (SageTV for those that are curious) running on a PC. The software is very flexible and lets you separate out all the various aspects of it. I have a separate machine for centralized control, scheduling, and recording, separate machines for playback, and the star of this story, a separate machine for storage. For storage I use a Linux file server, utilizing LVM (Logical Volume Manager) for aggregating many separate, non-identical drives into one large (~6TB at present) logical drive that the operating system sees. Since backing up multiple TB of data is impractical, and since said data is “just” TV shows, my backup philosophy for this has always been to just not care. Until recent events, this philosophy had not been tested by a real-world event.

Attempting To Recover The Data

Upon seeing the error on the DVR, I immediately start looking at the storage server. The filesystem is incredibly sluggish and slow to respond, so I query LVM about the state of the physical drives underlying its logical volume. After a long delay, it comes up and says a 750 GB drive is missing. Uh-oh! I reboot the server and amazingly, the drive comes back. I issue a pvmove command to automatically migrate all the data off that drive, but it fails at less than 2% complete.
Faced with a drive that is being very uncooperative about reading its data, but at least shows up in the BIOS, I turn to my favorite drive recovery tool, Spinrite. Although Spinrite normally boots from removable media, years ago I set up network booting at my house for various utilities so I didn’t have to worry about keeping track of any media. Normally I just connect to my network, select boot from network, and I have a variety of tools at my disposal to fix many problems. The problem is the machine that makes all this magic work is the same machine that’s currently down. No big deal I say, I’ll just boot from a Spinrite CD. Except a couple years ago the optical drive on my file server gave up the ghost. At the time that happened, I decided since I never use optical media in that machine, I didn’t need to replace it. No worry, I told myself, I’ll just take the optical drive out of my main computer. I power off my main computer and take out the optical drive. Then I look for my Spinrite boot CD. Can’t find it! We moved into a new house a few months ago, so everything is in a bit of disarray. I figure I’ll just burn a new copy, but I can’t even find any blank optical media! Onto the next plan, a bootable flash drive! After a few minutes on Google to refresh my memory, I have a bootable Spinrite flash drive. I boot my Linux box off that and launch Spinrite. The computer freezes up and seems to crash. Seeking to eliminate variables, I move the bad drive from being plugged into a PCI-e expansion card to being directly plugged into the motherboard. Now Spinrite launches fine, but takes ages and ages to enumerate the drives connected to it. I systematically unplug all other drives except the bad one, but it never does finish enumerating drives no matter how long I wait. Onto the next plan! I take the drive out of my Linux box, connect it to my main computer, and boot from my shiny new Spinrite flash drive. Spinrite launches and sees the drive immediately, and I tell it to start recovering data, satisfied that I’m finally making some progress. I go back to check on it after perhaps 10 minutes, and there is an error on the screen, and it seems the drive has once again disappeared. Frustrated, I try a few more times, and tell Spinrite to start at various portions of the drive, but get the same result each time. It seems this isn’t going to help me after all.
In a fit of irrational hope, I put the drive back in my Linux box and power it up. To my amazement, the drive shows up and LVM brings everything active. Further trying my luck, I issue another pvmove command to try to move the data off the drive again. Early on, I see error messages about not being able to read from the drive, but amazingly, the pvmove continues to make progress, getting closer and closer to 100% completed. A mixture of confusion, relief, and excitement washes over me. Am I going to get away from this unscathed? Sadly, the last thing LVM does under the covers to cleanly finish a pvmove is to write an updated log to all the drives under its control. This of course fails when it tries to write to the bad drive, and thus it aborts the whole process. Defeat snatched from the jaws of victory once again! I dive back into Google, and discover it’s possible to control how much data the pvmove command moves instead of moving ALL of the data in one shot. I experiment with this and have good success moving a tiny portion of my data at a time. I get greedy and the drive disappears a few times, but always comes back after a power cycle of the computer. Theorizing that perhaps only certain portions of the drive are bad, I start jumping around instead of working on the beginning of the drive. After a few iterations of this, I have all but 40 GB out of 750 GB safely moved off the drive. For the remaining 40 GB, it failed to move no matter what I tried. It was now Sunday evening and I was exhausted, so I decided to go to bed and tackle this problem more the next day.
The following day, after some sleep and the first half of my day at work, I decide to just bite the bullet because I didn’t care about the last 40 GB of recorded TV shows, and set about removing the drive from my LVM configuration. I’ve done this many times before, so it goes quite smoothly. Next on the cleanup list is repairing the hole in the middle of the filesystem. I figure with only 40 GB instead of 750 GB missing it can’t be too bad, right? Wrong! After the repair, I had 900 GB additional free space compared to before the start of the ordeal, so that stung quite a bit. Oh well, I tell myself, it was just TV anyway. My DVR is finally functional again after its three day hiatus, and I can at last stop thinking about this with every spare brain cycle.

Lessons Learned

So what did I learn from all this? I should have done a better job of what really mattered. This happened a few weeks ago, and in that time I haven’t even missed any of the TV content that disappeared. I do however, regret preventing myself, but more importantly my family, from being able to use the TV for three days, and for putting myself in high-stress crisis mode for those three days. If I had given up on recovering my data at the beginning, function would have been restored in about an hour, not three days. I know all too well that most of the time our data is precious, but in this situation it was not.
Secondly, if your data really is precious, and 99% of the time it truly is, you need to protect it! Backup your data, there are no excuses. For my data that is irreplaceable, like thousands of pictures of my son I have on my computer, I make sure to back it up in no less than three places, one of which is a cloud backup provider. As to the DVR storage, I still don’t think it’s practical to back it up to the cloud, but with the price of drives these days, I have no excuse to not have it protected by RAID, and that’s just what I’m going to do. When I first set up my storage cluster years ago, I think it took me 10 drives or more to get to a pool of multiple TB. I just checked the prices, and you can purchase a 3 TB drive now for well under $100. I simply have no excuse for leaving my data unprotected, and if a data loss like this happens to me again, it’s truly my own fault.

Comments

Popular posts from this blog

How to Increase Blog Traffic