Archive for December, 2006

ZFS on Linux: zfs-fuse 0.4.0_alpha1

Friday, December 29th, 2006

As a slighly late Christmas present, Ricardo Correia posted the first version of zfs-fuse with write support, which is a major advance in bringing ZFS to Linux. The idea with zfs-fuse is to port ZFS by reusing as much of the OpenSolaris code as possible, but talking to the Linux kernel through the FUSE interface. FUSE allows you to implement a filesystem in user space (rather than kernel space), which has a variety of pros and cons. On the pro side, since the code you are writing runs in the user memory space, coding errors will not bring down the entire system. On the con side, a user space implementation will almost certainly be slower than a kernel space implementation. And on the legal side, until OpenSolaris is released with a GPL-compatible license, it would be difficult for anyone to distribute a port which used the CDDL-licensed ZFS code in the Linux kernel. By pushing that code into user space, you avoid the entire license issue without having to reimplement ZFS from scratch.

At this point, zfs-fuse has huge warnings about performance and stability, but curious to see how slow it really was, I took it for a spin on one of our AMD64 systems. This is a dual core Athlon 64 5000+ w/ 2 GB of memory running Scientific Linux 4.4. Since I didn’t have a spare disk for test, I had to create a zpool using an 8 GB file on one of the existing partitions instead. In order to be fair, I created a similarly sized ext2 filesystem in another 8 GB file and mounted it via the loopback device. During all tests, one unrelated CPU-intensive job was running.

Here are the results when I run bonnie++:

ext2

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           25557   5 15637   3           47999   4 118.3   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  6161  98 +++++ +++ +++++ +++  6573  98 +++++ +++ 19256  99

zfs

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           23899   1 13453   2           40544   2 122.6   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  6173   6 17566   9  5813   6  3718   3 18367   9  6847   6

zfs (compression=on)

Then for fun, I turned zfs compression on and tried bonnie++ again:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           51171   2 37265   4           120310   5  1796   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  5653   6 18051  11  3204   3  2959   5 19936   7  5261   4

The first time I did this test, the zfs-fuse daemon process died with a failed assertion. I had to manually unmount the zpool tree, start the process back up, and then zfs mount the partitions back. When I tried the second time, everything worked. The results are highly skewed in favor of compression because bonnie++ writes highly non-random data. The compression ratio for the test files was 28x! With this enormous compression factor, most of the data fit into the disk cache, and the test went very fast. (Don’t take these results seriously, of course.)

zfs (checksum=off)

Finally, I disabled the checksum option to see if this had any visible impact on performance. Turning off checksums in ZFS only disables them for the data blocks, but the metadata blocks are always checksummed.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           24985   1 14295   2           32105   3 116.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  4615   6  5023   3  2439   3  1802   2 18812  11  7768   7

The read speed is actually slower!? I don’t understand this at all. Clearly something strange is going on.

Conclusions

Given the handicap of using both filesystems effectively in loopback mode, the results are actually pretty promising. zfs-fuse writing is about 10% slower than ext2, and read is 20% slower than ext2. That’s not bad considering how unoptimized zfs-fuse is at this stage, and all the non-performance benefits that zfs has to offer. Hopefully in a few weeks I can give zfs-fuse a whirl on some real disks and test a more realistic use case.

Why ZFS Matters to Mac Users

Wednesday, December 20th, 2006

This morning I read a summary at Think Secret of the leaked news that ZFS will be a supported filesystem in OS X 10.5:

New to build 9A321 is support for Sun’s ZFS file system, a 128-bit open source file system introduced with Solaris 10 that offers support for vastly larger drives and arrays than 64-bit file systems. ZFS also delivers additional options for administrators.

This description totally misses the point of ZFS, focusing on a number that means less than megahertz for CPUs. I imagine that most Mac users who haven’t been following ZFS development incorrectly assume that ZFS probably doesn’t matter to them unless they are running some sort of Xserve server farm. Nothing could be further from the truth.

Why ZFS Does Not Matter

Let me begin with why you should not care about ZFS. ZFS is often described as a “128 bit filesystem.” This is mostly true, but in day-to-day use, completely irrelevant. The upper bound for most personal and small business filesystems is on the terabyte scale, which current filesystems can contain with their puny 64-bit block pointers. In 5-10 years, we might care about petabytes or exabytes, so the ZFS developers were smart to future-proof the filesystem format and avoid the growing pains that systems like ext2/3 have had to endure. The ZFS team lead, Jeff Bonwick, famously noted that populating a storage pool with 2128 blocks of information would require as much energy as would be needed to boil the world’s oceans.

With that bit of vivid imagery, Bonwick is basically saying that the capacity problem is solved for the long term. Overcoming capacity limits is really not all that interesting though, since the solution is obvious: use bigger numbers. The real genius of ZFS is in all of its other design decisions.

Why ZFS Matters to Laptop/Desktop Users

People with iBooks, MacBooks, Powerbooks, Mac Minis, and iMacs all have generally the same storage setup: a single hard disk with capacity ranging from 40-500 GB. A lot of the magic of ZFS does not become manifest until you have several disks, but even with one, you can benefit in several ways:

Filesystems can be compressed. Unlike a compressed disk image, a compressed ZFS filesystem is read/write. Moreover, the compression flag can be turned on and off on the fly. New data will be compressed (or not) as per the flag, and old data will be left as is. Compressed filesystems are great for data that you don’t access very often, or data that compresses very well.

Filesystems are nested and making them is as easy as making a directory. This in itself is not very interesting for laptop/desktop users, but combined with compression, this means that you can effectively turn on compression for just a subfolder on your drive.

Every block of data on the disk is checksummed so errors can be detected during read operations. Many common hard drive failures are catastrophic, and painfully obvious when they happen. But it is possible for your data to be corrupted on disk in ways that you, and the hard disk, will never notice. While checksumming will not allow you to recover your data, it will let you know when you should go retrieve a file from your backup. (You are backing up, right? Go buy an external Firewire disk and SuperDuper!, and start doing it right now. It is easy, fast, and you’ll thank me later.)

Space-efficient and fast snapshots. A snapshot allows you to see your filesystem as it was some time in the past. ZFS is designed to snapshot a filesystem in constant time, no matter how much data you have, or how frequently you snapshot it. Moreover, the snapshot is very space efficient. Identical blocks are shared between snapshots and the live filesystem until they are written to. The space required for snapshots is therefore mostly a function of how quickly your files change, and not so much how often you make a snapshot. It’s like version control for your entire computer!

Apple’s much discussed Time Machine feature in OS X 10.5 is a great example of the interface possibilies when you have snapshots available. However, Time Machine does not appear to require ZFS, which means that Apple had to bolt snapshots onto HFS+, a complex and awkward task. Snapshots in ZFS are cheap and easy.

Why ZFS Matters to Workstation Users

With that list of features, ZFS already beats most other filesystems out there. But with a workstation like the Mac Pro, you can have up to 4 internal drives (8 if you get creative) and start to explore the multi-drive capabilities of ZFS. Traditionally, there has been a hard separation between the volume manager and the filesystem layer. The volume manager takes your many disks, and makes them look like one disk (with mirroring or striping or whatever) to the filesystem layer. The separation of duties ensures that the volume manager knows nothing about files, and the filesystem knows nothing about disks. ZFS, on the other hand, breaks down the barriers between filesystems and volume managers with some amazing results:

Automatically growing filesystems. Once you add your disks to the storage pool, all of their space is available to all of the filesystems you have. You can reserve space for a filesystem, to guarantee a minimum amount is available when you need it, and you can also set quotas. But these are just flags which are easy to change on the fly. The default for every filesystem is automatically expanding capacity up to the limit of your storage pool. There are no manual volume or filesystem resizing operations, ever.

Dynamic striping of file blocks over all drives in the storage pool. If you throw 2 drives in your storage pool, then files are automatically distributed over both disks, making large reads and writes faster. The disks do not have to be the same size (unlike usual striping configurations) and you can expand the pool whenever you want by installing a new disk. New files will stripe over old and new disks, and the old files will stay where they are. But, when you modify old files, the changed blocks are spread over all the available disks again. After adding a new disk, ZFS will get faster as you use the filesystem!

Software mirroring with automatic error detection and self-healing. ZFS also incorporates features traditionally left to software RAID drivers. You can arrange your disks into mirrored pairs (or triples, etc), which speeds up data reads, and also protects against single disk failure. Moreover, since ZFS checksums all data blocks, if one disk returns bad data, ZFS knows without having to query the other disk every time. Having identified the problem, it can then access the failed block from the other disk(s) in the mirror set and return to you correct data. ZFS then writes the correct data back to the original disk which failed the checksum. If the data error was a fluke due to some correctable problem, perhaps a bad sector (which modern drives can reassign to a new physical location) or just a bad write, then this will solve the problem. If the disk is really dead, then ZFS will take it offline and wait for you to replace it.

Fast resync of mirrors. In the unfortunate circumstance where a drive does die and you replace it, the resync process is faster with ZFS. This is because, unlike many other RAID systems, ZFS knows which blocks on the were used, and which blocks were not used. During resynchronization, ZFS only copies blocks with actual filesystem data on them to the new disk. So, if your disk pair was only half-full, then you are back in business twice as fast.

Software pairity RAID that actually works. The most popular pairity RAID system is by far RAID-5, where for every N-1 data blocks, there is one parity block. The parity block allows you to recover all your data if any one disk fails, much like mirroring, but without as much space penalty. There is a seldom discussed problem with RAID-5, known as the “RAID-5 write hole.” When modifying a single block, you have to rewrite all N blocks (including the parity block). If a power or hardware failure happens in the middle of rewriting these N blocks, then you effectively lose all N blocks of data, with no way to recover them. (Update: As pointed out in the comments, I have incorrectly stated how writes happen in RAID-5. Only the changed block and the parity block need to be updated, rather than all N blocks. Nevertheless, there is still a write hole if a hardware failure happens between the two writes.) You can fix this in hardware with battery backup systems, or RAID controllers with non-volatile write caches. The structure of ZFS is such that you can also solve the problem in software using a variant of the RAID 5 algorithm called RAID-Z. RAID-Z behaves much like RAID-5, but has no write hole. Recent ZFS releases have also added a double parity version of RAID-Z, which allows you to withstand 2 disk failures at once.

Why ZFS Matters to Server Admins

By now, I’ve hit on nearly all of the neat features of ZFS, but there are a few left that might be of interest to people with Xserve/Xsan clusters:

Easy command line interface. I have no idea how Apple will choose to present ZFS to users, but regardless, they have to include the fantastic zpool and zfs commands. These two commands make it very easy to manage lots of disks and filesystems.

A stream format which allows you to copy snapshots to other systems. This feature is a little hard to explain, but it basically allows you to dump a ZFS filesystem, preserving the snapshot history, and reload it on another system. This could be used for maintaining a backup server, or loading a filesystem into another storage pool.

Highly SMP-friendly design. ZFS is designed to efficiently support many, many processes all accessing a filesystem at the same time.

Nearly unlimited capacity and scalability. We come full circle back to the capacity issue. For servers which need to manage a large number of disks, ZFS scales pretty well up from the single-disk scenario we started with. Sun certainly pushes ZFS on their 48 disk monster, the Sun Fire X4500.

Waiting for Leopard

Hopefully, I’ve got you excited about ZFS coming to Mac OS X. So far, all we’ve seen is a leaked screenshot showing ZFS in the disk image creator. It’s not clear yet how much Apple wants to promote ZFS, via GUI interface tools, or integration with Time Machine, or just marketing. We’ll certainly learn more at Macworld 2007. Until then, take a look at this presentation on ZFS to learn more about it.

Entries (RSS)