If you have the luxury of migrating your Linux installation to a new hard disk before the old one packs up entirely, it’s quite easily done with standard tools.
Recently on my Linux machine at work I started getting some concerning emails
smartd which looked like this:
Device: /dev/sda [SAT], 7 Offline uncorrectable sectors
Any errors from
smartd are a cause for concern, but particularly this one. To
explain, a short digression — anybody familiar with SMART and hard drives can
skip the next three paragraphs.
Hard disks are split into sectors which are the smallest units which can be addressed1. Each sector corresponds to a tiny portion of the physical disk surface and any damage to the surface, such as scratches or particles of dust, may render one or more of these sectors inaccessible — these are often called bad sectors. This has been a problem since the earliest days of hard drives, so operating systems have been designed to cope with sectors that the disk reports as bad, avoiding their use for files.
Modern hard disk manufacturers have more or less accepted that some proportion of drives will have minor defects, so they reserve a small area of the disk for reallocated sectors, in addition to the stated capacity of the drive. When bad sectors are found, the drive’s firmware quietly relocates them to some of this spare space. The amount of space “wasted” by this approach is a tiny proportion of the space of the drive and saves manufacturers from having to deal with a steady stream of customers RMAing drives with a tiny proportion of bad sectors. There is a limit to the scope of this relocation, however, and when the spare space is exhausted the drive has no choice but to report the failures directly to the system2.
The net result of all this is that by the time your system is reporting bad sectors to you, your hard disk has probably already had quite a few physical defects crop up. The way hard drives work, this often means the drive may be starting to degrade and may suffer a catastrophic failure soon — this was confirmed by a large-scale study by Google a few years ago. So, by the time your operating system starts reporting disk errors, it may be just about too late to practically do anything about it. This is where SMART comes in — it’s a method of querying information from hard disks, including such items as the number of sectors which the drive has quietly reallocated for you.
smartd daemon uses SMART to monitor your disks and watch for changes
in the counters which may indicate a problem. Increases in the reallocated
sector count should be watched carefully — occasionally these might be isolated
instances, but if you see this number change continuously over a few days or
weeks then you should assume the worst and plan for your hard disk to fail at
any moment. The counter I mentioned above, the offline uncorrectable sector
count, is even worse — this means that the drive encountered an error it
couldn’t solve when reading or writing part of the disk. This is also a strong
indicator of failure.
So, I know my hard disk is about to fail — what can I do about it? The instructions below cover my experiences on an Ubuntu 12.04 system, but the process should be similar for other distributions. Note that this is quite a low-level process which assumes a fair degree of confidence with Linux and is designed to duplicate exactly the same environment. You may find it easier to simply back up your important files and reinstall on to a fresh disk.
Since I use Linux, it turns out to be comparatively easy to migrate over to a
new drive. The first step is to obtain a replacement hard disk, then power the
system off, connect it up to a spare SATA socket and boot up again. At this
point, you should be able to partition it with
fdisk, presumably in the same
way as your current drive but the only requirement is that each partition is at
least big enough to hold all the current files in that part of your system.
Once you’ve partitioned it, format the partitions with, for example,
mkswap as appropriate. At this point, mount the non-swap partitions in
the way that they’ll be mounted in the final system but under some root — for
example, if you had just
/home partitions then you might do:
sudo mkdir /mnt/newhdd sudo mount /dev/sdb1 /mnt/newhdd sudo mkdir /mnt/newhdd/home sudo mount /dev/sdb5 /mnt/newhdd/home
Important: make sure you replace
/dev/sdb with the actual device of your
new disk. You can find this out using:
sudo lshw -class disk
… and looking at the
logical name fields of the devices which are listed.
At this point you’re ready to start copying files over to the new hard disk.
You can do this simply with
rsync, but you have to provide the appropriate
options to copy special files across and avoid copying external media and pseudo-filesystems:
rsync -aHAXvP --exclude="/mnt" --exclude="/lost+found" --exclude="/sys" \ --exclude="/proc" --exclude="/run/shm" --exclude="/run/lock" / /mnt/newhdd/
You may wish to exclude other directories too — I suggest running
excluding anything else which is mounted with
tmpfs, for example.
You can leave this running in the background — it’s likely to take quite a long
time for a system which has been running for awhile, and it also might impact
your system’s performance somewhat. It’s quite safe to abort and re-run another
time — the
rsync with those parameters will carry on exactly where it left off.
While this is going on, make sure you have an up-to-date rescue disk available,
which you’ll need as part of the process. I happened to use the Ubuntu Rescue
Remix CD, but any reasonable rescue or live CD is likely to work.
It needs to have
blkid utility and a text editor
available and be able to mount all your filesystem types.
Once that command has finished, you then need to wait for a time where you’re
ready to do the switch. Make sure you won’t need to be interrupted or use the
PC for anything for at least half an hour. If you had to abandon the process
and come back to it later, make sure you re-run the above
rsync command just
prior to doing the following — the idea is to make sure the two systems are as
closely synchronised as possible.
When you’re ready to proceed, shut down the system, open it up and swap the two
disks over. Strictly speaking you probably don’t need to swap them, but I like
to keep my system disk as
/dev/sda so it’s easier to remember. Just make sure
you remember that they’re swapped now!
Now boot the system into the rescue CD you created earlier. Get to a shell
prompt and mount your drives — let’s say that the new disk is now
and the old one is
/dev/sdb, continuing the two partition example from
earlier, then you’d do something like this:
mkdir /mnt/newhdd /mnt/oldhdd mount /dev/sda1 /mnt/newhdd mount /dev/sdb1 /mnt/oldhdd mount /dev/sda5 /mnt/newhdd/home mount /dev/sdb5 /mnt/oldhdd/home
I’m assuming you’re already logged in as
root — if not, you’ll need to use
su as appropriate. This varies between rescue systems.
As you can see, the principle is to mount both old and new systems in the same
way as they will be used. As this point you can then invoke something similar
rsync from earlier, except with the source changed slightly. Note that
you don’t need all those
--exclude options any more because the only things
mounted should be the ones you’ve manually mounted yourself, which are all the
partitions you actually want to copy:
rsync -aHAXvP /mnt/oldhdd/ /mnt/newhdd/
Once this final
rsync has finished, you’ll need to tweak a few things on your
target drive before you can boot into it. After this point do not run
again or you will undo the changes you’re about to make.
First, you need to update
/mnt/newhdd/etc/fstab to reflect your new hard
disk. If you take a look, you’ll probably find that the lines for the standard
partitions start like this:
What you need to do is replace these UUIDs with the ones from your new drive.
You can find this out by running
blkid which should output something like this:
/dev/sda1: UUID="b7299d50-8918-459f-9168-2a743f462658" TYPE="swap" /dev/sda2: LABEL="/" UUID="43f6065c-d141-4a64-afda-3e0763bbbc9a" TYPE="ext4" /dev/sdb1: UUID="8affb32a-bb25-8fa2-8473-2adc443d1900" TYPE="swap" /dev/sdb2: LABEL="/" UUID="d3964aa8-f237-4b34-814b-7176719b2e42" TYPE="ext4"
What you want is to copy the UUID fields for your new disk to
the top of the old ones. Be careful not to accidentally copy a stray quote or similar.
The other thing you need to do is change the
grub.conf file to refer to the
new IDs. Typically this file is auto-generated, but you can use a simple
search and replace to update the IDs in the old file. First
grep the file for
the UUID of the old root partition to make sure you’re changing the right thing:
grep d3964aa8-f237-4b34-814b-7176719b2e42 /mnt/newhdd/boot/grub/grub.cfg
Then replace it with the new one, with something like this:
cp /mnt/newhdd/boot/grub/grub.cfg /mnt/newhdd/boot/grub.cfg.orig sed 's/d3964aa8-f237-4b34-814b-7176719b2e42/43f6065c-d141-4a64-afda-3e0763bbbc9a/g' \ /mnt/newhdd/boot/grub.cfg.orig > /mnt/newhdd/boot/grub.cfg
As an aside, there’s probably a more graceful way of using
re-write the new configuration, but I found it a lot easier just to do the
search and replace like this.
At this point you should install the
grub bootloader on to the new disk’s
grub-install --recheck --no-floppy --root=/mnt/newhdd /dev/sda
Finally, you should be ready to reboot. Cross your fingers!
If your system doesn’t come back up then I suggest you use the rescue CD to fix things. Also, since you haven’t actually written anything to the old disk, you should always be able to swap the disks back and try to figure out what went wrong.
At this point you should shut your system down again, remove the old disk entirely and try booting up again. If your system came back up before and fails now then it was probably booting off the old disk, which probably indicates the boot sector didn’t install on the new disk properly.
Hopefully that’s been of some help to someone — by all means leave a comment if you have any issues with it or you think I’ve made a mistake. Good luck!
Strictly speaking there’s also a small performance penalty when accessing a reallocated sector on a drive, so they’re also bad news in performance-critical servers — typically this isn’t relevant to most people, however. ↩