Tonight, during an emerge -e system (to complete a GCC 4.2 transition), Enterprise’s hard disk started failing on me. /usr got unmounted while merging xcb-util back into the live fs, and quite some fiddling didn’t bring anything useful.
After running xfs_repair
from SystemRescueCD, I ended up without /usr/lib64. The files are there, in lost+found, but the directory hierarchy is long gone.
This means that Enterprise, my main box, is now offline. Luckily /home is safe under a software RAID1 that should cover even if one of the disks decide to give up. And in /home I keep basically everything but part of the PAM documentation (which anyway is in /var so it’s also fine from tonight’s failure).
Thanks to Javier (Paya) I now have a plan to restore the data tomorrow: I’ll check for the MD5 of the files in the Portage database, and then put every file where it belongs. The problem for this is that I first have to make sure that the disks aren’t dead on hardware-level, and I’m not sure how to do that.
Suggestion about that are welcome.
And as Joshua said.. what is this with my luck? Farragut’s disk died last week, this week it’s Enterprise’s, even if I run them with a LOT of fans to keep them cool. I suppose I should start NOT working during summer, and keep the computers offline.
Anyway bottomline is that you can’t expect stuff coming from me until I can do a surface check of the two disks and recover my data. And if one of the disks is faulty, I’m afraid you’ll have quite some time to wait till I get back — for a series of reason, I don’t want to run my main box counting on just one disk (and tonight has been a good demonstration of that). And for what it’s worth, I’m in a pinch even with Farragut, as I now don’t have a way to back up the content on another box.
This is what I hate of computers: maintenance costs.
Me too.. my motherboard and CPU dead in my desktop this summer and had to be replaced. (eg pretty much a new computer, just the same RAM and such.)
And I recently went through a similar endeavor.Could you please post instructions on how to use the MD5 hashes in the Portage DB to place the files where they belong?That would make a great “Tips and Tricks” entry for GWN.
Is your power good?Last time I had a bunch of failures close together, it was because of power dips (on some of my machines that didn’t warrant UPSs).
Ian, I hope you can reuse your RAM, if I had to change computer now, I’d have to change RAM too :(As for the use of MD5 checksums, I have a couple ideas, but as usual I’ll have to test it on the fly to make sure it works.And yeah the power is fine, well, the power sucks here, but as it already killed two boxes and a half of mine, I have all of them behind UPS (SmartUPS right now), so i wouldn’t expect that to be a problem.More likely to be heat or kernel bug (dmesg refers of internal errors in XFS, although what caused them is the main question).
You could check the smart attributes of your hard drive
too bad that the SystemRescueCD that I have here does not support smartctl over SATA 🙁
You tried with -d ata ?
Tried, but no avail for now. I’m running badblocks over the first disk now, when the two badblocks runs are done, I’ll boot a new copy of sysrescuecd and see if that helps.It might not be a problem with SATA per-se but the harddisks themselves not being in database for some tweak, I just remember older smartctl having problems with it.
I would probably use drive vendor’s software to check the disks – usually comes as boot floppy/cd. Hitachi has DFT which is pretty good, Seagate has also something, probably others too.