Buggy external storage
Saturday, October 8th, 2005I’m a strong believer in the potential for Firewire 800 to make it feasible for people to have large disk arrays on a budget. There’s enough bandwidth there (plus the ability to chain drives) to run a 1 TB RAID5 system at high end-consumer prices. All the pieces are there: cheap disks, multi-bay firewire enclosures, the hotplug support in firewire, decent software RAID support in Linux and the Logical Volume Manager. Once you figure out where all the pieces go, it’s actally not too hard…
Having tried doing this for 6 months and dealing with the problems, I’m amazed at how much can go wrong. So far LVM2 and RAID have been flawless. Haven’t lost a single byte of data. But I find the communication between the PC and the firewire to be unbelievably fragile. If you try doing too much I/O, the firewire devices jam up, causing the scsi subsytem to slow to a crawl and throw a bunch of errors, and sometimes bomb out. I had these problems earlier, but thankfully by telling the sbp2 driver (which implements the firewire protocol to access block devices) to be very conservative and talk to only one device at a time, I was able to get things running more or less correctly. Any stalls I experienced under high load were not permanent, and only needed some time to reset and clear out.
However, the last two days, I’ve been converting two RAID1 pairs into a combined RAID5 set. I moved all the data off the RAID1 pairs, then wiped their settings and had mdadm create a new RAID5 set out of them. Much like when you make a new RAID1 set, a new RAID5 set will need to do many hours of I/O to get all the disks into a consistent state. Unlike before, though, something about the increased load (now talking to 4 drives at a time instead of 2) caused the firewire system to lose contact with drives within a few minutes. I tried a bunch of different combinations, eventually ending up with sbp serialize_io=1 (like I had before), elevator=cfq (specifying an alternative I/O scheduler), and finally using the -f option with mdadm to make it stop being clever.
I’m still trying to understand the “clever” part. By default, when mdadm makes a new RAID5 array out of N disks, it creates a new array of N-1 good disks, 1 (non-existant) faulty disk, and one “hot spare” (really your last disk). It then proceeds to resync the array like the last disk had just been swapped in. The manpage claims this is faster than the normal resync process. My best guess is that the fast method only reads from the first N-1 drives and only writes to the last drive. The “normal” method (which I assume just reads normal blocks and writes parity blocks) would require reading and writing to all drives since the parity blocks are spread everywhere in RAID5. It seems that the fast method was somehow beating on the last drive too hard and making the firewire communication fail. Switching to the normal method avoided this problem, and appears to be going just as fast as the “fast” method did in the 60 seconds before locking up the drives.
(Update, Oct 9: I spoke too soon. I have been unable to find any arrangement of cables, sbp2 module options or disk configuration that will allow me to reliably write to a RAID5 disk set for any extended period of time without a drive or two getting stuck. I even reflashed the firmware on my Initio 2430L to a newer version just in case there had been stability fixes. I have accepted my fate and reverted back to a RAID 1 configuration.)
Still, the question remains, why is this possible? Why is it that I have to be so careful about the read/write usage patterns of my disks, lest I lose communication with them? I have no good answers. It either has to be hardware or software, and I’m starting to wonder if the problem is that there are a lot of really crappy Firewire-to-IDE bridge chipsets out there. I know that with USB enclosures, I’ve had nothing but problems, partly due to really poor error handling in the usb-storage drivers, but partly due to buggy USB hardware. I had assumed that Firewire would be better, due to a slightly more professional audience. (I’ve seen Firewire drives used in a lot of digital video labs and such.)
Perhaps that faith was misplaced. It’s not hard to imagine that low-end PC hardware has abysmal quality control, but I’ve never been able to tell how much more upscale you have go to get stuff that works. Is it only when a manufacturer knows that you can call them at 3am and they have fix your problem (like if you purchase high-end hardware with service agreements) that they will bother doing comprehensive testing of hardware?

