Archive for the ‘ZFS’ Category

Hoping for ZFS…

Sunday, June 10th, 2007

The tech news feeds have all gone crazy over the leak from Sun CEO that OS X 10.5 will use ZFS as the standard filesystem. As an unapologetic ZFS bigot, I think this is great news. (Especially if ditto blocks are in there! More on them later.) ZFS is just such a huge leap in both capability AND usability, the latter which Apple cares very much about.

However, the realist in me suspects this is the usual pre-keynote speculation madness. If Apple is really planning to switch to ZFS in 10.5, rather than just support it as a secondary filesystem, then I would be stunned. HFS+ and the Mac OS have grown up organically together, and replacing HFS+ (which will of course need to be supported for many more years) means that you need a new filesystem which can impersonate all of the quirky features of HFS+, most especially the case insensitivity. I knew that ZFS had support for extended attributes, which could be used as resource forks, but I thought case-sensitivity would be a deal breaker.

But this morning I read about new ZFS features which were fast-tracked to some kind of approval. In particular, they add a case-insensitive option to ZFS, and provide an interface for turning it on and off. I don’t know what stage these features are at, but it is possible they are already in the ZFS code that Apple has been playing with for a while.

So maybe the rumors are closer to true. The evidence suggests that ZFS is being tuned up to work as a reasonable replacement for HFS+. The only question is when. 10.5 or 10.6? We’ll find out in 15 hours…

ZFS on Linux: zfs-fuse 0.4.0_alpha1

Friday, December 29th, 2006

As a slighly late Christmas present, Ricardo Correia posted the first version of zfs-fuse with write support, which is a major advance in bringing ZFS to Linux. The idea with zfs-fuse is to port ZFS by reusing as much of the OpenSolaris code as possible, but talking to the Linux kernel through the FUSE interface. FUSE allows you to implement a filesystem in user space (rather than kernel space), which has a variety of pros and cons. On the pro side, since the code you are writing runs in the user memory space, coding errors will not bring down the entire system. On the con side, a user space implementation will almost certainly be slower than a kernel space implementation. And on the legal side, until OpenSolaris is released with a GPL-compatible license, it would be difficult for anyone to distribute a port which used the CDDL-licensed ZFS code in the Linux kernel. By pushing that code into user space, you avoid the entire license issue without having to reimplement ZFS from scratch.

At this point, zfs-fuse has huge warnings about performance and stability, but curious to see how slow it really was, I took it for a spin on one of our AMD64 systems. This is a dual core Athlon 64 5000+ w/ 2 GB of memory running Scientific Linux 4.4. Since I didn’t have a spare disk for test, I had to create a zpool using an 8 GB file on one of the existing partitions instead. In order to be fair, I created a similarly sized ext2 filesystem in another 8 GB file and mounted it via the loopback device. During all tests, one unrelated CPU-intensive job was running.

Here are the results when I run bonnie++:

ext2

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           25557   5 15637   3           47999   4 118.3   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  6161  98 +++++ +++ +++++ +++  6573  98 +++++ +++ 19256  99

zfs

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           23899   1 13453   2           40544   2 122.6   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  6173   6 17566   9  5813   6  3718   3 18367   9  6847   6

zfs (compression=on)

Then for fun, I turned zfs compression on and tried bonnie++ again:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           51171   2 37265   4           120310   5  1796   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  5653   6 18051  11  3204   3  2959   5 19936   7  5261   4

The first time I did this test, the zfs-fuse daemon process died with a failed assertion. I had to manually unmount the zpool tree, start the process back up, and then zfs mount the partitions back. When I tried the second time, everything worked. The results are highly skewed in favor of compression because bonnie++ writes highly non-random data. The compression ratio for the test files was 28x! With this enormous compression factor, most of the data fit into the disk cache, and the test went very fast. (Don’t take these results seriously, of course.)

zfs (checksum=off)

Finally, I disabled the checksum option to see if this had any visible impact on performance. Turning off checksums in ZFS only disables them for the data blocks, but the metadata blocks are always checksummed.

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
nubar5.localdoma 4G           24985   1 14295   2           32105   3 116.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  4615   6  5023   3  2439   3  1802   2 18812  11  7768   7

The read speed is actually slower!? I don’t understand this at all. Clearly something strange is going on.

Conclusions

Given the handicap of using both filesystems effectively in loopback mode, the results are actually pretty promising. zfs-fuse writing is about 10% slower than ext2, and read is 20% slower than ext2. That’s not bad considering how unoptimized zfs-fuse is at this stage, and all the non-performance benefits that zfs has to offer. Hopefully in a few weeks I can give zfs-fuse a whirl on some real disks and test a more realistic use case.

Why ZFS Matters to Mac Users

Wednesday, December 20th, 2006

This morning I read a summary at Think Secret of the leaked news that ZFS will be a supported filesystem in OS X 10.5:

New to build 9A321 is support for Sun’s ZFS file system, a 128-bit open source file system introduced with Solaris 10 that offers support for vastly larger drives and arrays than 64-bit file systems. ZFS also delivers additional options for administrators.

This description totally misses the point of ZFS, focusing on a number that means less than megahertz for CPUs. I imagine that most Mac users who haven’t been following ZFS development incorrectly assume that ZFS probably doesn’t matter to them unless they are running some sort of Xserve server farm. Nothing could be further from the truth.

Why ZFS Does Not Matter

Let me begin with why you should not care about ZFS. ZFS is often described as a “128 bit filesystem.” This is mostly true, but in day-to-day use, completely irrelevant. The upper bound for most personal and small business filesystems is on the terabyte scale, which current filesystems can contain with their puny 64-bit block pointers. In 5-10 years, we might care about petabytes or exabytes, so the ZFS developers were smart to future-proof the filesystem format and avoid the growing pains that systems like ext2/3 have had to endure. The ZFS team lead, Jeff Bonwick, famously noted that populating a storage pool with 2128 blocks of information would require as much energy as would be needed to boil the world’s oceans.

With that bit of vivid imagery, Bonwick is basically saying that the capacity problem is solved for the long term. Overcoming capacity limits is really not all that interesting though, since the solution is obvious: use bigger numbers. The real genius of ZFS is in all of its other design decisions.

Why ZFS Matters to Laptop/Desktop Users

People with iBooks, MacBooks, Powerbooks, Mac Minis, and iMacs all have generally the same storage setup: a single hard disk with capacity ranging from 40-500 GB. A lot of the magic of ZFS does not become manifest until you have several disks, but even with one, you can benefit in several ways:

Filesystems can be compressed. Unlike a compressed disk image, a compressed ZFS filesystem is read/write. Moreover, the compression flag can be turned on and off on the fly. New data will be compressed (or not) as per the flag, and old data will be left as is. Compressed filesystems are great for data that you don’t access very often, or data that compresses very well.

Filesystems are nested and making them is as easy as making a directory. This in itself is not very interesting for laptop/desktop users, but combined with compression, this means that you can effectively turn on compression for just a subfolder on your drive.

Every block of data on the disk is checksummed so errors can be detected during read operations. Many common hard drive failures are catastrophic, and painfully obvious when they happen. But it is possible for your data to be corrupted on disk in ways that you, and the hard disk, will never notice. While checksumming will not allow you to recover your data, it will let you know when you should go retrieve a file from your backup. (You are backing up, right? Go buy an external Firewire disk and SuperDuper!, and start doing it right now. It is easy, fast, and you’ll thank me later.)

Space-efficient and fast snapshots. A snapshot allows you to see your filesystem as it was some time in the past. ZFS is designed to snapshot a filesystem in constant time, no matter how much data you have, or how frequently you snapshot it. Moreover, the snapshot is very space efficient. Identical blocks are shared between snapshots and the live filesystem until they are written to. The space required for snapshots is therefore mostly a function of how quickly your files change, and not so much how often you make a snapshot. It’s like version control for your entire computer!

Apple’s much discussed Time Machine feature in OS X 10.5 is a great example of the interface possibilies when you have snapshots available. However, Time Machine does not appear to require ZFS, which means that Apple had to bolt snapshots onto HFS+, a complex and awkward task. Snapshots in ZFS are cheap and easy.

Why ZFS Matters to Workstation Users

With that list of features, ZFS already beats most other filesystems out there. But with a workstation like the Mac Pro, you can have up to 4 internal drives (8 if you get creative) and start to explore the multi-drive capabilities of ZFS. Traditionally, there has been a hard separation between the volume manager and the filesystem layer. The volume manager takes your many disks, and makes them look like one disk (with mirroring or striping or whatever) to the filesystem layer. The separation of duties ensures that the volume manager knows nothing about files, and the filesystem knows nothing about disks. ZFS, on the other hand, breaks down the barriers between filesystems and volume managers with some amazing results:

Automatically growing filesystems. Once you add your disks to the storage pool, all of their space is available to all of the filesystems you have. You can reserve space for a filesystem, to guarantee a minimum amount is available when you need it, and you can also set quotas. But these are just flags which are easy to change on the fly. The default for every filesystem is automatically expanding capacity up to the limit of your storage pool. There are no manual volume or filesystem resizing operations, ever.

Dynamic striping of file blocks over all drives in the storage pool. If you throw 2 drives in your storage pool, then files are automatically distributed over both disks, making large reads and writes faster. The disks do not have to be the same size (unlike usual striping configurations) and you can expand the pool whenever you want by installing a new disk. New files will stripe over old and new disks, and the old files will stay where they are. But, when you modify old files, the changed blocks are spread over all the available disks again. After adding a new disk, ZFS will get faster as you use the filesystem!

Software mirroring with automatic error detection and self-healing. ZFS also incorporates features traditionally left to software RAID drivers. You can arrange your disks into mirrored pairs (or triples, etc), which speeds up data reads, and also protects against single disk failure. Moreover, since ZFS checksums all data blocks, if one disk returns bad data, ZFS knows without having to query the other disk every time. Having identified the problem, it can then access the failed block from the other disk(s) in the mirror set and return to you correct data. ZFS then writes the correct data back to the original disk which failed the checksum. If the data error was a fluke due to some correctable problem, perhaps a bad sector (which modern drives can reassign to a new physical location) or just a bad write, then this will solve the problem. If the disk is really dead, then ZFS will take it offline and wait for you to replace it.

Fast resync of mirrors. In the unfortunate circumstance where a drive does die and you replace it, the resync process is faster with ZFS. This is because, unlike many other RAID systems, ZFS knows which blocks on the were used, and which blocks were not used. During resynchronization, ZFS only copies blocks with actual filesystem data on them to the new disk. So, if your disk pair was only half-full, then you are back in business twice as fast.

Software pairity RAID that actually works. The most popular pairity RAID system is by far RAID-5, where for every N-1 data blocks, there is one parity block. The parity block allows you to recover all your data if any one disk fails, much like mirroring, but without as much space penalty. There is a seldom discussed problem with RAID-5, known as the “RAID-5 write hole.” When modifying a single block, you have to rewrite all N blocks (including the parity block). If a power or hardware failure happens in the middle of rewriting these N blocks, then you effectively lose all N blocks of data, with no way to recover them. (Update: As pointed out in the comments, I have incorrectly stated how writes happen in RAID-5. Only the changed block and the parity block need to be updated, rather than all N blocks. Nevertheless, there is still a write hole if a hardware failure happens between the two writes.) You can fix this in hardware with battery backup systems, or RAID controllers with non-volatile write caches. The structure of ZFS is such that you can also solve the problem in software using a variant of the RAID 5 algorithm called RAID-Z. RAID-Z behaves much like RAID-5, but has no write hole. Recent ZFS releases have also added a double parity version of RAID-Z, which allows you to withstand 2 disk failures at once.

Why ZFS Matters to Server Admins

By now, I’ve hit on nearly all of the neat features of ZFS, but there are a few left that might be of interest to people with Xserve/Xsan clusters:

Easy command line interface. I have no idea how Apple will choose to present ZFS to users, but regardless, they have to include the fantastic zpool and zfs commands. These two commands make it very easy to manage lots of disks and filesystems.

A stream format which allows you to copy snapshots to other systems. This feature is a little hard to explain, but it basically allows you to dump a ZFS filesystem, preserving the snapshot history, and reload it on another system. This could be used for maintaining a backup server, or loading a filesystem into another storage pool.

Highly SMP-friendly design. ZFS is designed to efficiently support many, many processes all accessing a filesystem at the same time.

Nearly unlimited capacity and scalability. We come full circle back to the capacity issue. For servers which need to manage a large number of disks, ZFS scales pretty well up from the single-disk scenario we started with. Sun certainly pushes ZFS on their 48 disk monster, the Sun Fire X4500.

Waiting for Leopard

Hopefully, I’ve got you excited about ZFS coming to Mac OS X. So far, all we’ve seen is a leaked screenshot showing ZFS in the disk image creator. It’s not clear yet how much Apple wants to promote ZFS, via GUI interface tools, or integration with Time Machine, or just marketing. We’ll certainly learn more at Macworld 2007. Until then, take a look at this presentation on ZFS to learn more about it.

The Last Filesystem You’ll Ever Need (Almost)

Sunday, July 2nd, 2006

After setting up a 1.4 TB ext3 (that’s all Scientific Linux supports) filesystem at work this past week, I was doing a little research to see if there was anything better than my usual LVM2/ReiserFS solution for big filesystems. I discovered that EVMS is really just a frontend to other volume managers and filesystems (like LVM and ext3) and that JFS and XFS filesystems can only be increased in size, not decreased (boo). But, the most important discovery was ZFS, an intriguing filesystem developed at Sun and included in OpenSolaris.

ZFS is both conservative and radical in its approach: conservative in its interface but radical in its implementation. ZFS presents a POSIX view of filesystems to the user. There is still a hierarchy of directories, and files are just a linear stream of bytes, just like every other filesystem we are used to. However, ZFS has some very interesting properties deriving from a few key implementation decisions:

Blow away all previous metadata limits

ZFS is a 128-bit filesystem, and can hold more files, directories, and raw data than will likely be generated in the next 30 years. ZFS will have been long since replaced before we bump into its storage limits. This is something that ext3 has run into, leading to a need for ext4. By jumping right to 128 bits, the ZFS team has ensured they will never have to revisit this issue. (Or at least we will have to boil the oceans to get there.)

Eliminate the strict separation between the volume manager and the filesystem

Traditionally, the volume manager maps physical disk space onto a volume, which appears to be a large, contiguous sequence of blocks. The volume is then used by the filesystem to store directories and files. Usually layers of abstraction are good: a volume manager allows me to merge several disks to make one large filesystem of any format, even though the filesystem designer did not design it with multiple disks in mind.

The disadvantage to an abstraction layer is that it can prevent useful information from crossing between layers. The volume manager does not know where individual file blocks are, which makes data striping and mirroring less flexible than they could be. For example, if a disk in a mirror set fails and you replace it, the low level drivers must copy every single block to the new drive, even if the drive set was half full. This can slow down the replication process, and even leave you vulnerable as the disk wastes time copying unused sectors when it could be restoring useful data to the replacement disk.

The other storage feature provided by volume managers is data striping. Again, a lack of access to the filesystem information means that the volume manager must do this is in a very simple minded way. Physical blocks are mapped into the volume space in a round-robin order. With N stripes, logical block 0 is disk 0/block 0, logical block N-1 is disk N-1/block 0, logical block N is disk 0/block 1, and so on. This approach works because filesystems generally try to allocate files in a contiguous chunk of blocks, since that makes things go faster when your filesystem is on a single disk. On a striped volume, sufficiently large files will naturally be distributed over all the physical disks. So, the striping scheme generally does the right thing, though not due to any direct communication between the volume manager and the filesystem. Instead there is an implicit agreement between the two layers that makes it work. An unfortunate side effect of this round-robin technique is that it is nearly impossible to change the striping scheme of an active volume.

ZFS solves both of these problems. It has a disk pool (sounds like a volume manager) and you allocate filesystems from the pool. The difference is that the block allocator at the filesystem layer knows about the physical disks, and can quite easily distribute blocks over all available devices in any pattern it wants. Increasing the number of stripes when a new disk is added happens automatically. Files which are created or modified after the pool is expanded will have their blocks distributed over all disks, including the new one. Similarly, when reconstructing a mirror set, the system walks down the block tree, copying only useful data. Additionally, there is no need to create fixed size volumes to hold your filesystems. By default, all filesystems are dynamic, able to expand to fill the entire pool without any intervention. You can apply a quota to a filesystem to prevent it from expanding too much, and you can reserve a fixed amount of space to guarantee that a certain amount of disk will be available. Otherwise, “filesystems” in ZFS behave more like special directories.

Block checksums and Copy-On-Write

Every block has a 256-bit checksum (not stored in the same place as the block, though), which allows the filesystem to independently verify the integrity of each block as it is read. It’s kind of scary to realize that even in a mirrored disk configuration, your data is vulnerable to a lot of failures that will never be detected. The volume manager only knows something is wrong if the disk reports it (via a sector read error) or if the operating system reports it (the disk disappears completely). Other kinds of bit rot and stray writes are completely undetectable! By validating the checksum on each read, ZFS can immediately identify damaged blocks and go to the other disk in the mirror. Afterwards, it even tries to rewrite the correct block to the original disk, just in case it was a random data corruption.

The second data integrity feature of ZFS is its Copy-On-Write (COW) semantics. No block is ever modified in place. Instead a modified copy of the block is written to some other location and then the old block is freed. This sounds unusual at first, but then you realize it doesn’t slow things down too much (especially if the write operations are merged together before flushing to disk). Aside from making it much easier to ensure filesystem integrity at all times, it also gives you several neat features almost for free. Filesystem snapshots are now simple: Just don’t free the old block. The snapshot will still point to it, and you can now view the filesystem as it looked back at the time when the snapshot was made. With the right interface, cheap and quick snapshots could really improve the desktop experience. You could browse your files as they were yesterday, or last week, and find that file you deleted on accident. Dump the trashcan!

Conclusions

There a lot more cool stuff that ZFS makes possible. I would highly suggest everyone take a look at this slide presentation on ZFS. I’m totally impressed. ZFS does everything I’ve ever wanted in a non-networked filesystem except two things:

  • I want to be able to divide a pool into two parts, and split the filesystems cleanly between the parts. The use case I’m imagining is where I have a large disk set and many filesystems. Then I decide I want to move some of those filesystems to another server. It would be nice (at least in theory) to be able to move blocks around so that no filesystem spans the two groups. Then I could remove one group of disks and drop them into the other server without having to use additional disks as intermediaries.
  • Dynamic replication - ZFS’s dynamic striping feature is awesome. I would love to see a similar degree of flexibility in the replication options (mirror and RAID-Z, a RAID-5 replacement). Right now, you have to declare disks to be part of a mirror or RAID-Z set up front. The ZFS developers suggest you not mix replication levels within a storage pool, since your overall fault-tolerance will be that of your disk set with the worst replication settings. The problem here is that replication settings are associated with physical sets of disks rather than filesystems. I want to be able to throw all my disks in to one big storage pool, and then decide my home directory is mirrored, my large work directory is RAID-Z, and my local copy of experiment data (backed up elsewhere) is unprotected. Then the ZFS layer would just distribute blocks over the provided disks in order to ensure these requirements were met.

Of course, the next question is “Where do I get it?” Not surprisingly, ZFS is only available in OpenSolaris or recent versions of Solaris 10, as far as I can tell. However, the ZFS team seems to be extremely supportive of efforts to get ZFS supported on other operating systems. There is a Porting Guide, and a full specification of the on-disk format. As part of the 2006 Google Summer of Code, Ricardo Correia is porting ZFS to Linux using the FUSE interface. There has also been interest from Apple to port ZFS to OS X, which is awesome. I look forward to seeing more about ZFS (especially if they implement something on my wish list!) in the future.

Entries (RSS)