Linux Software RAID and SATA Hot Swap
Why Software RAID?
I know there are a million pages online about Linux Software RAID, but I wanted to record my own experience with it.
My home server has a lot of storage:
- a 160GB RAID1 array, for my boot volume, on one of the motherboard's RAID controllers
- a 500GB RAID1 array, for backups and pictures, again on a RAID controller
- a 2TB RAID1 array, for home directories and virtual machines, on a RAID controller
- a 3TB RAID1 array, for other stuff, using software RAID.
- a single 3TB drive, for daily backups of my 3TB array
My motherboard is a number of years old now, and the onboard controllers could not do RAID for 3TB drives, as they only recognized them as 873GB. So I left these as standard drives, and set them up in software RAID.
My goal for this endeavor was to convert my 500GB and 2TB over to software RAID. The reasons being:
- Actually getting notifications regarding any issues
- Control over rebuilds, being able to add/remove disks
- Not being tied to a specific RAID controller with a specific firmware version. If the motherboard were to die, I can easily move the drives.
- No reboots required to work with the drives
- Linux can do SATA hot swap, so I don't need to power down to swap a disk
The minor performance hit isn't an issue, so the pros far outweigh the cons.
Fiasco #1: The 500GB Array
I decided to do the 500GB array first, since it was small and quick to work with.
I moved the data off the drive, rebooted the server to get into the BIOS, deleted the array, then booted the server back up. Then I (not showing any of these steps, you'll see why...):
- partitioned the drives using fdisk
- created the RAID1 array and waited for it to sync
- formatted it
- mounted the drive
- put all my files back on
Then I rebooted the server, and what do I get? NO OPERATING SYSTEM FOUND
I shut down the server and unplugged the two 500GB drives, and it found the operating system just fine. The 3TB array is using software RAID, but didn't trigger the same issue. Why? To have drives >2.2TB, you need a GUID Partition Table (GPT) [1] on the drive, not the standard msdos partition table. My motherboard won't attempt to boot from a GPT drive.
Now to rebuild the array using GPT drives...
Rebuilding The Array
I could not boot the server with the drives plugged in, and running them on USB to SATA converters is just horrible. What to do? Linux supports SATA hot swap! I booted the server up, then just plugged the drives in. They are instantly recognized by the system, and added in as sd[x] devices.
- Reenable the array
mdadm -A /dev/md1
- Mount the drive
mount /dev/md1 /mnt/500GB-array
- Move all the data off the drive
- Stop the array, zero the superblocks and remove the array
mdadm -S /dev/md1 mdadm --zero-superblock /dev/sdi1 mdadm --zero-superblock /dev/sdh1 mdadm --remove /dev/md1 rm /dev/md1
- Create a new partition table and partitions using parted on the first drive
[root@fileserver dev]# parted sdi GNU Parted 1.8.1 Using /dev/sdi Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) print Model: ATA ST3500630AS (scsi) Disk /dev/sdi: 500GB Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags 1 32.3kB 500GB 500GB primary raid (parted) rm 1 (parted) print Model: ATA ST3500630AS (scsi) Disk /dev/sdi: 500GB Sector size (logical/physical): 512B/512B Partition Table: msdos Number Start End Size Type File system Flags (parted) mklabel Warning: The existing disk label on /dev/sdi will be destroyed and all data on this disk will be lost. Do you want to continue? Yes/No? yes New disk label type? [msdos]? gpt (parted) unit GB (parted) mkpart primary 0.00GB 500.0GB (parted) print Model: ATA ST3500630AS (scsi) Disk /dev/sdi: 500GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 0.00GB 500GB 500GB primary (parted) quit
Update: This can be done with one CLI command:
parted /dev/sda mklabel gpt unit TB mkpart primary 0.00TB 2.00TB
- Do the same thing on the second drive
- Create the array
[root@fileserver dev]# mdadm --create /dev/md1 --level=1 --metadata=1.2 --raid-devices=2 /dev/sdh1 /dev/sdi1 mdadm: metadata format 1.02 unknown, ignored. mdadm: metadata format 1.02 unknown, ignored. mdadm: array /dev/md1 started. [root@fileserver dev]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] md1 : active raid1 sdi1[1] sdh1[0] 488386414 blocks super 1.2 [2/2] [UU] [>....................] resync = 0.1% (783360/488386414) finish=103.7min speed=78336K/sec md0 : active raid1 sdc1[0] sdd1[1] 5860532736 blocks super 1.2 [2/2] [UU] unused devices: <none>
- At any point during the resync, you can format the drive, mount it, and start using it. Obviously, it will not have working redundancy until it has fully synced.
To have the array detected during bootup, save the config like this:
mdadm -Q --detail --brief /dev/md1 > /etc/mdadm.conf
It's also a good idea to add this line to your /etc/mdadm.conf file:
MAILADDR you@yourdomain.com
Obviously, put your email address in there. Then you will get notifications of any RAID events.
By default, Linux reserves 5% of a drive for the root user. This really adds up, as drives get bigger. For example, 5% of 3TB is 150GB. You can adjust this using tune2fs.
tune2fs -m 0.5 /dev/sda1
The 0.5 is the percentage of the drive. It can be any integer or decimal.
Fiasco #2: The 2TB Array
Next on the agenda was to make the 2TB RAID1 array into a software array.
One of my motivators behind this project was to correct the 37 bad sectors that showed up on one of my 2TB drives. I figured I would work that in, between deleting the hardware array and creating the software array. I was going to use the linux program badblocks to verify and fix the drive.
Using badblocks
Badblocks is a handy program, similar to Spinrite. It works by taking 4 passes over the drive (by default):
- The first pass writes the pattern "10101010" to all bits on the drive, then it reads it back to verify
- The second pass writes the pattern "01010101" to all the bits, then verifies
- The third pass writes all ones to the drive, and verifies.
- The final pass writes all zeroes to the drive, and verifies.
If there are no issues with the drive, it takes about 24 hours to run. And at the end, you have a completely zeroed drive.
When it's successful, it looks like this:
[root@fileserver /]# badblocks -wvs /dev/sda Checking for bad blocks in read-write mode From block 0 to 390711384 Testing with pattern 0xaa: done Reading and comparing: done Testing with pattern 0x55: done Reading and comparing: done Testing with pattern 0xff: done Reading and comparing: done Testing with pattern 0x00: done Reading and comparing: done Pass completed, 0 bad blocks found.
To my dismay, my drive had more than 37 bad sectors. A lot more. The output looked like this:
[root@fileserver mnt]# badblocks -wvs /dev/sda Checking for bad blocks in read-write mode From block 0 to 1953514584 Testing with pattern 0xaa: done Reading and comparing: 1113920 1113920/ 1953514584 1115128 1115128/ 1953514584 1115248 1115248/ 1953514584 1116392 1116392/ 1953514584 1118944 1118944/ 1953514584 2340632 2340632/ 1953514584 2350736 2350736/ 1953514584 2356936 2356936/ 1953514584 2362000 2362000/ 1953514584 2399560 2399560/ 1953514584 2413312 2413312/ 1953514584 2430776 2430776/ 1953514584
Badblocks only managed to get 5% done its first pass in the span of five days. I decided to pull that drive and make the remaining 2TB drive into a standalone, without RAID. Getting a replacement drive for it was cost prohibitive, so I needed to find a new solution for redundant storage for my virtual machines.
SATA Hot Swap
Linux makes it really easy to hot swap SATA drives:
- Make sure that a drive is not mounted or part of an active array
- Use this command (change sda to the appropriate drive):
echo 1 > /sys/block/sda/device/delete
- Unplug the drive
Fiasco #3: The Surprise Array
I decided to repurpose the standalone 3TB drive, and upgrade my 3TB RAID1 array to a 6TB RAID5 array.
The first issue I ran in to is that when I went to delete the partitions and recreate them, it would always show the original file system that was on the disk. So to wipe them out, I used:
dd if=/dev/zero of=/dev/sdg bs=512 count=100000 conv=notrunc
That zeroed out the first 50MB of the drive, and allowed me to create the partition and filesystem from scratch.
Afterwards, the procedure was largely the same as making a RAID1 array. The only differences were:
- In parted, I set the unit as TB instead of GB, as I was working with 3TB drives.
- To create the array, I used this command:
mdadm --create --verbose /dev/md0 --level=5 --metadata=1.2 --raid-devices=3 /dev/sdc1 /dev/sdd1 /dev/sdg1
Update: To prevent the email mentioned below:
mdadm --create --verbose /dev/md0 --level=5 --metadata=1.2 --raid-devices=3 --spare-devices=0 /dev/sdc1 /dev/sdd1 /dev/sdg1
- I added it to the mdadm.conf file like this:
mdadm -Q --detail --brief /dev/md0 >> /etc/mdadm.conf
- I changed the reserved space to 0.2% (12GB)
tune2fs -m 0.2 /dev/md0
Afterwards, I got this email:
This is an automatically generated mail message from mdadm running on fileserver.home A SparesMissing event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sdc1[0] sdg1[3] sdd1[1] 5860532736 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] unused devices: <none>
To fix that, I edited /etc/mdadm.conf and change this line:
ARRAY /dev/md0 level=raid5 num-devices=3 metadata=1.02 spares=1 name=0 UUID=0036f465:a16a0eb0:2c35cd8e:079dfacb
to this:
ARRAY /dev/md0 level=raid5 num-devices=3 metadata=1.02 spares=0 name=0 UUID=0036f465:a16a0eb0:2c35cd8e:079dfacb
Dénouement
Everything is finally built, synced, and data stored where it should be. I stuck an extra 400GB drive in, to make up for some of the space I lost by not having a 3TB backup drive.
My home server now looks like this:
- a 160GB RAID1 array, for my boot volume, using hardware RAID
- a 500GB RAID1 array, for backups and pictures, using software RAID
- a 6TB RAID5 array, for home directories, virtual machines, and other stuff, using software RAID.
- a 2TB standalone drive, for daily backups from the 6TB array
- a 400GB standalone drive, for daily backups of the 6TB array
Update: Repairing A Faulty RAID5 Array
I had one of the 3TB disks suddenty discover 46,000 bad sectors. Luckily, I had an extra matching disk.
Mdadm had not yet notice the drive was faulty, so I faulted it manually:
mdadm --manage /dev/md124 -f /dev/sdb1
[root@fileserver done]# mdadm --detail /dev/md124 /dev/md124: Version : 1.2 Creation Time : Sun May 8 16:05:48 2016 Raid Level : raid5 Array Size : 5860268032 (5588.79 GiB 6000.91 GB) Used Dev Size : 2930134016 (2794.39 GiB 3000.46 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Wed Mar 15 22:27:51 2017 State : clean, degraded Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : fileserver.local:124 (local to host fileserver.local) UUID : a84efac6:0742aae1:46f6f79b:bff11d2c Events : 20570 Number Major Minor RaidDevice State - 0 0 0 removed 1 8 33 1 active sync /dev/sdc1 3 8 49 2 active sync /dev/sdd1 0 8 17 - faulty /dev/sdb1
Then I removed it from the array:
mdadm --manage /dev/md124 -r /dev/sdb1
Then I soft disconnected the drive:
echo 1 > /sys/block/sdb/device/delete
Then removed the physical disk from the server and plugged in the replacement.
I then set up the GPT partition table and single partition, using the procedure above.
Then I added the new partition into the array:
mdadm --manage -a /dev/md124 /dev/sdb1
[root@fileserver done]# mdadm --detail /dev/md124 /dev/md124: Version : 1.2 Creation Time : Sun May 8 16:05:48 2016 Raid Level : raid5 Array Size : 5860268032 (5588.79 GiB 6000.91 GB) Used Dev Size : 2930134016 (2794.39 GiB 3000.46 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Wed Mar 15 22:51:25 2017 State : clean, degraded, recovering Active Devices : 2 Working Devices : 3 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Rebuild Status : 0% complete Name : fileserver.local:124 (local to host fileserver.local) UUID : a84efac6:0742aae1:46f6f79b:bff11d2c Events : 20586 Number Major Minor RaidDevice State 4 8 17 0 spare rebuilding /dev/sdb1 1 8 33 1 active sync /dev/sdc1 3 8 49 2 active sync /dev/sdd1
Now to wait five hours until it rebuilds...
Update: Expanding A RAID5 Array
This is to add another drive to the RAID5 array to expand capacity.
Use parted as shown above.
Add the drive to the array. Note the (S) after the newly added drive (sdf1). That indicates it is a spare, which is only used when one of the other three drives fails.
[root@fileserver ~]# mdadm --add /dev/md124 /dev/sdf1 mdadm: added /dev/sdf1 [root@fileserver ~]# cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md124 : active raid5 sdf1[5](S) sdc1[4] sdd1[1] sde1[3] 5860268032 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU] bitmap: 0/22 pages [0KB], 65536KB chunk unused devices: <none>
Grow the array to use the additional disk. Note that it is now reshaping.
[root@fileserver ~]# mdadm --grow --raid-devices=4 /dev/md124 [root@fileserver ~]# cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md124 : active raid5 sdf1[5] sdc1[4] sdd1[1] sde1[3] 5860268032 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] [>....................] reshape = 0.0% (10240/2930134016) finish=9523.2min speed=5120K/sec bitmap: 0/22 pages [0KB], 65536KB chunk unused devices: <none>
To improve the reshape speed, do the following:
[root@fileserver ~]# echo 60000 > /proc/sys/dev/raid/speed_limit_min [root@fileserver ~]# echo 0 > /proc/sys/dev/raid/speed_limit_max [root@fileserver ~]# echo 32768 > /sys/block/md124/md/stripe_cache_size
Next, resize the md partition. Since I'm using XFS, the syntax is as follows:
[root@fileserver ~]# xfs_growfs /dev/md124
Update: Reusing Disks From A Hardware RAID Controller
dmraid -r -E /dev/sdc
If that returns errors, use this:
dmraid -r dd if=/dev/zero of=/dev/sdc bs=512 seek=$(( $(blockdev --getsz /dev/sdc) - 1024 )) dmraid -r
[root@fileserver ~]# dmraid -r -E /dev/sdg Do you really want to erase "ddf1" ondisk metadata on /dev/sdg ? [y/n] :y ERROR: ddf1: seeking device "/dev/sdg" to 512104901378048 ERROR: writing metadata to /dev/sdg, offset 1000204885504 sectors, size 0 bytes returned 0 ERROR: erasing ondisk metadata on /dev/sdg [root@fileserver ~]# dmraid -r ERROR: ddf1: reading /dev/sde[No such file or directory] /dev/sdi: ddf1, ".ddf1_disks", GROUP, unknown, 1952448512 sectors, data@ 0 /dev/sdh: ddf1, ".ddf1_disks", GROUP, ok, 1952448512 sectors, data@ 0 /dev/sdg: ddf1, ".ddf1_disks", GROUP, ok, 1952448512 sectors, data@ 0 /dev/sdd: ddf1, ".ddf1_disks", GROUP, ok, 1952448512 sectors, data@ 0 /dev/sda: ddf1, ".ddf1_disks", GROUP, nosync, 1952448512 sectors, data@ 0 [root@fileserver ~]# dd if=/dev/zero of=/dev/sdg bs=512 seek=$(( $(blockdev --getsz /dev/sdg) - 1024 )) count=1024 1024+0 records in 1024+0 records out 524288 bytes (524 kB) copied, 0.00419548 s, 125 MB/s [root@fileserver ~]# dmraid -r no raid disks
Update: Identifying The Serial Number Of A Drive
udevadm info --query=all --name=/dev/sda | grep ID_SERIAL