ZFS on Linux (not quite)
Monday, October 16th, 2006The more I read about ZFS, the more I become convinced it is the future of filesystems. And not surprisingly, I very much want to try it out on my 750 GB of storage spread over 3 pairs of disks. Progress on zfs-fuse for Linux is not moving as fast as my zealotry, though. We will probably see a testable filesystem on the order of months and not weeks. After the initial annoucement, I haven’t seen any further information about ZFS on FreeBSD, either. And Apple, aside from reserving an identifier for ZFS in the Leopard preview, has said nothing about ZFS on OS X.
Thus, for the time being, the only way to get ZFS is to use the Solaris kernel. That basically means Solaris 10 6/06, the OpenSolaris preview release (called “Nevada”), or Nexenta, a fusion of OpenSolaris and Ubuntu. I tried out Nexenta via Parallels on my Mac, and was impressed by how good it looked (thanks to the Ubuntu) and how totally frustrating it is to administer for someone who doesn’t know Solaris. Don’t be fooled by the pretty GUI! This is still Solaris under the hood, so most of your Linux admin experience is no help. The dmesg command is mostly useless, ps is different, and you ain’t gonna find anything in /proc but processes. There was no way I was going to replace my nicely working Gentoo install on the server with Solaris just for ZFS, as great as it looked.
But these days, with enough RAM and disk space, you don’t need to choose between operating systems. You can use several at once!
Failure to reach Xen
My first thought when I decided to go hybrid on my server was to use Xen. The lightweight hypervisor imposes minimal performance penalty, and (as I found later) Xen is much better at giving guest operating systems (domU in their terminology) access to physical disk. Xen support in Linux is a no-brainer, and there are domU images of OpenSolaris available. Unfortunately, I was never able to boot the Xen hypervisor on my Opteron, most likely due to my dodgy motherboard which generates machine check exceptions (thus the inspiration for the blog name) if I boot Linux without the “nomce” option. So, until I care enough to replace my motherboard, Xen is out.
VMware Server
I’ve been a long time user of VMware Workstation, and been very happy with it. VMware Server is now free, so I figured I could use it to run some Solaris-based OS and give the virtual machine access to a few of my hard drives. Getting VMware Server going was pretty easy (though there was a little confusion as I needed to clear out VMware Workstation kernel modules first), and installing Nexenta on a small virtual disk wasn’t too bad.
But the next step was quite annoying. VMware gives you the option of using physical disks or virtual disks. Virtual disks are just large file(s) on the host filesystem, and are the normal way to give the guest access to disk. Physical disks are actual block devices, but VMware’s support for physical disks basically sucks. They have hard coded the software to only accept /dev/hd* and /dev/sd* devices (the * can include a slash, thankfully), and the devices must be IDE or SCSI block devices. You cannot use a logical volume from LVM2, for example, as a disk. I had planned to do just this, with one logical volume fitting on each physical disk, in order to make the device name independent of the /dev/sd[a-f] name. That device name depends on the order of your disks on the SATA controller, and is very likely to change if you modify your system at all. But VMware will not accept a logical volume, even if you symlink it to a /dev/sd* name, because it does not support certain ioctls. There is a LD_PRELOAD hack out there called vmware-bdwrapper which will fake out VMware, but I was never able to get it to work.
Having been defeated in my attempt to use LVM2, I caved in and used the disk directly. This pretty much worked (once I figured out Solaris disk numbering) until I stopped the virtual machine and started it again. Then VMware complained that the physical disk geometry had changed and I would have to readd the device. Of course, if when I readded the device and rebooted Nexenta, the old zpool on the disk was unreadable. I recreated it, and then when I quit and restarted, VMware gave me the same error again. So pretty much no matter what, my data was guaranteed to be scrambled if I stopped and started the virtual machine ever.
So finally, I did the dumbest setup possible: I formatted the disk with JFS (which has pretty good performance on very large files with low CPU overhead), and told VMware to make a 280 GB virtual disk on it as one ENORMOUS file. Yes, this is insane, but it worked. I’ve moved one of my 6 disks over to ZFS and am now populating it with a copy of my files. There is no way I’m going to put all my archived stuff on just ZFS with this crazy setup, but I will host my backup copy there. I want to at least get familiar with how to administer ZFS so someday when I get Xen working I will be comfortable with the tools.
The next question is how I access my data from the Linux side, which I’ll say more about in another post.
