ZFS on Linux: my story and HOWTO you can have it too

by Rudd-O published 2008/06/24 14:45:00 GMT+0, last modified 2016-01-20T00:12:15+00:00
The best filesystem ever invented is now available on Linux. Here's a short story with instructions to set it up on your computer.

Update: I have written a guide that will help you install Fedora atop ZFS on Linux.


Have you heard about ZFS? It's a generation-defining stable high-performance high-end filesystems, created by Jeff Bonwick at Sun, and ported over to Mac OS X and the BSD family. Oh, and for Linux, using the FUSE (Filesystem in userspace) kernel abstraction. Here's my ZFS story.

I'm using Kubuntu Hardy, and my computer has two 400 GB SATA hard disks. Yes, that's all the storage I have at hand; as of three days ago, it was RAIDed using the multipath devices (md) kernel module, split in two LVM volumes: /and /home. Oh, and two same-size byte-aligned swap partitions, one on each disk, swapon'ed pri=0.

I had been salivating over the thought of using ZFS in my workstation because of several killer features:

  • The first one that comes to mind is end-to-end data integrity thanks to checksumming -- I've already had many disks go bad on me, while others corrupted my data silently (which is, believe it or not, the most insidious thing ever, because after you've noticed it, backups won't help you with that -- you've probably already papered over your backups with new, bad data).
  • The second one is compression. Together with tightly packed data, compression promises to increase performance and reduce disk utilization.
  • The third one is the advanced transactional algorithm that yields an always-consistent disk structure. Unlike log-based filesystems, ZFS does copy-on-write and ripples the changes up through the filesystem tree; before the topmost node is updated, the changes don't affect consistency; when the topmost node is updated, the disk is consistent as well. Never fsck again!

"Damn, gotta get me some of that, I thought"

Getting ZFS was actually a piece of cake: I went to the Mercurial repository for the project, selected the tip view, and downloaded a nice tarball. I then installed a couple of dependencies according to the README, and hit scons in a terminal window. Five commands were built:

  • zfs-fuse, the daemon that serves FUSE requests. The FUSE module is an odd beast: applications futzing with a FUSE-mounted filesystem talk to the kernel VFS, which talks to FUSE, which talks to the daemon backing that particular mount. This userspace-kernelspace-userspace-kernelspace--userspace overhead, you will see, is a big deal.
  • zfs and zpool, the main management commands that use IPC to talk to zfs-fuse.
  • two others that you won't care and I won't care either.

A cursory inspection with such important system binaries was in order, so I ldd the daemon and the commands.
zfs-fuse links to /usr/lib/libz*.so*. Not good, chicken and egg problem, linking to a library in a filesystem that will not be available before zfs-fuse is running? I rebuild it using a modified SConstruct file so it statically links that library in.

I had decided that my filesystem layout would be:

  • 1 GB swap partition on each disk
  • 1 GB / filesystem, composed of two RAID1 partitions (one on each disk), formatted with ext3 (in case of catastrophe, it's nice to have something the kernel can boot without initial RAM disks)
  • 398 GB ZFS volume, where I planned to drop /usr, /home and /var

But I didn't have extra hard disks to make the switch. No problem, croupier, everything I have on red please -- and spin that wheel! I installed ZFS directly on my running system. How did I do it? Well, if you must know:

  • I offlined the second disk with mdadm.
  • I swapoff'ed its swap partition. At this point the disk is no longer busy.
  • I repartitioned the disk (if the disk is non-busy, the kernel rereads the partition table just fine).
  • Then I relied on the first all the time.

Yes, realtime no-boot filesystem switchover -- or at least I thought it would be that easy (I was very wrong).

Then I mkfs.ext3ed the new 1 GB root filesystem, and mkswap'ed the swap one. A couple of rsyncs later (which I scripted for consistency and repeatability), I had a new, working /. I mounted it and went in it, to remove mdadm.conf and lvm.conf lines that could prove problematic on next boot. At this point I was panicking because it was superstitiously conceivable that, after a reboot, md would want to rebuild the arrays and destroy the second disk.

I then copied the ZFS binaries in /sbin and ran it. A cursory lsof inspection told me that the ZFS socket was on /etc/zfs/zfs_socket.
zpool create quickly gave me the 392 GB of disk space that were previously empty in the second disk, in which I created subvolumes, with adjusted mount points to end up under a temporary tree structure under /newfs. Curiously, after creating a subvolume, it's not mounted, but a zfs mount -a works as you probably would expect.

I enabled compression in the root volume (subvolumes inherit attributes) and started rsyncing /var, /usr and /home into each subvolume. Cue the movie 32 hours later to have an idea of how slow it was. It was unbelievably slow -- un-frigging-believable, with both CPUs nearly pegged and regularly hovering at 150% combined user+system. The worst part is, I was seeing disk throughput in the 2-3 MB/s range, using iostat 1 and zpool iostat 1. Keep in mind that performance (high write throughput, low responsiveness/latency during massive reads) is marketed as a ZFS selling point -- and I don't doubt the Sun guys... on Solaris, not Linux!

During that lengthy process I started finding out several things that would prove crucial later on:

  • FUSE does not support mmap in the Linux kernel that my distribution uses. Many, many applications rely on that feature to work.
  • There was no initscript for ZFS. I would have to write an initscript from scratch. On Kubuntu, where initscripts are being (1) phased out and (2) completely different to my beloved RPM distros.

At this point I was a bit nervous, if you'll allow me to understate. But I wrote the initscript anyway:

#! /bin/sh
### BEGIN INIT INFO
# Provides:          zfs
# Required-Start:    mountall
# Required-Stop:     sendsigs
# Should-Start:
# Should-Stop:
# Default-Start:
# Default-Stop:
# Short-Description: Enable/disable the ZFS-FUSE subsystem
# Description: Control ZFS-FUSE subsystem
### END INIT INFO

PIDFILE=/var/run/zfs-fuse.pid
LOCKFILE=/var/lock/zfs/zfs_lock

. /lib/init/vars.sh

. /lib/lsb/init-functions
. /lib/init/mount-functions.sh

export PATH=/sbin:/bin
unset LANG
ulimit -v unlimited

do_start() {
	test -x /sbin/zfs-fuse || exit 0
	PID=`cat "$PIDFILE" 2> /dev/null`
	if [ "$PID" != "" ]
	then
		if kill -0 $PID 2> /dev/null
		then
			echo "ZFS-FUSE is already running"
			exit 3
		else
			# pid file is stale, we clean up shit
			log_action_begin_msg "Cleaning up stale ZFS-FUSE PID files"
			rm -f /var/run/sendsigs.omit.d/zfs-fuse "$PIDFILE"
			log_action_end_msg 0
		fi
	fi

	pre_mountall

	log_action_begin_msg "Starting ZFS-FUSE process"
	zfs-fuse -p "$PIDFILE"
	ES_TO_REPORT=$?
	if [ 0 = "$ES_TO_REPORT" ]
	then
		true
	else
		log_action_end_msg 1 "code $ES_TO_REPORT"
		post_mountall
		exit 3
	fi

	for a in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
	do
		PID=`cat "$PIDFILE"`
		 [ "$PID" != "" ] && break
		sleep 1
	done

	if [ "$PID" = "" ]
	then
		log_action_end_msg 1 "ZFS-FUSE did not start or create $PIDFILE"
		post_mountall
		exit 3
	else
		log_action_end_msg 0
	fi

	log_action_begin_msg "Immunizing ZFS-FUSE against OOM kills and sendsigs signals"
	mkdir -p /var/run/sendsigs.omit.d
	cp "$PIDFILE" /var/run/sendsigs.omit.d/zfs-fuse
	echo -17 > "/proc/$PID/oom_adj"
	ES_TO_REPORT=$?
	if [ 0 = "$ES_TO_REPORT" ]
	then
		log_action_end_msg 0
	else
		log_action_end_msg 1 "code $ES_TO_REPORT"
		post_mountall
		exit 3
	fi
	
	log_action_begin_msg "Mounting ZFS filesystems"
	
	zfs mount -a
	ES_TO_REPORT=$?
	if [ 0 = "$ES_TO_REPORT" ]
	then
		log_action_end_msg 0
	else
		log_action_end_msg 1 "code $ES_TO_REPORT"
		post_mountall
		exit 3
	fi

	if [ -x /usr/bin/renice ] ; then
		log_action_begin_msg "Increasing ZFS-FUSE priority"
		/usr/bin/renice -15 -g $PID > /dev/null
		ES_TO_REPORT=$?
		if [ 0 = "$ES_TO_REPORT" ]
		then
			log_action_end_msg 0
		else
			log_action_end_msg 1 "code $ES_TO_REPORT"
			post_mountall
			exit 3
		fi
		true
	fi
	
	post_mountall
}

do_stop () {
	test -x /sbin/zfs-fuse || exit 0
	PID=`cat "$PIDFILE" 2> /dev/null`
	if [ "$PID" = "" ] ; then
		# no pid file, we exit
		exit 0
	elif kill -0 $PID 2> /dev/null; then
		# pid file and killable, we continue
		true
	else
		# pid file is stale, we clean up shit
		log_action_begin_msg "Cleaning up stale ZFS-FUSE PID files"
		rm -f /var/run/sendsigs.omit.d/zfs-fuse "$PIDFILE"
		log_action_end_msg 0
		exit 0
	fi

	pre_mountall

	log_action_begin_msg "Syncing disks"
	sync
	log_action_end_msg 0

	log_action_begin_msg "Unmounting ZFS filesystems"
	zfs unmount -a
	ES_TO_REPORT=$?
	if [ 0 = "$ES_TO_REPORT" ]
	then
		log_action_end_msg 0
	else
		log_action_end_msg 1 "code $ES_TO_REPORT"
		post_mountall
		exit 3
	fi
	
	post_mountall # restore /var/lock and /var/run to their right places

	log_action_begin_msg "Terminating ZFS-FUSE process gracefully"
	kill -TERM $PID

	for a in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
	do
		kill -0 $PID 2> /dev/null
		[ "$?" != "0" ] && break
		sleep 1
	done

	if kill -0 $PID 2> /dev/null
	then
		log_action_end_msg 1 "ZFS-FUSE refused to die after 15 seconds"
		post_mountall
		exit 3
	else
		rm -f /var/run/sendsigs.omit.d/zfs-fuse "$PIDFILE"
		log_action_end_msg 0
	fi

	log_action_begin_msg "Syncing disks again"
	sync
	log_action_end_msg 0
}

case "$1" in
  start)
	do_start
	;;
  stop)
	do_stop
	;;
  status)
	PID=`cat "$PIDFILE" 2> /dev/null`
	if [ "$PID" = "" ] ; then
		echo "ZFS-FUSE is not running"
		exit 3
	else
		if kill -0 $PID
		then
			echo "ZFS-FUSE is running, pid $PID"
			zpool status
			exit 0
		else
			echo "ZFS-FUSE died, PID files stale"
			exit 3
		fi
	fi
	;;
  restart|reload|force-reload)
	echo "Error: argument '$1' not supported" >&2
	exit 3
	;;
  *)
	echo "Usage: $0 start|stop|status" >&2
	exit 3
	;;
esac

:

The script should explain itself.

There were two problems, though. I derived my script from the NFS one and, in the process, I discovered that NFS was symlinked to be started at slot 31 in level 6 and 0. This means that the initscripts subsystem would call that script with a start argument when in reality, the action was in the stop block. Since I couldn't figure out what kind of magic the Upstart initscripts compatibility subsystem does to get a stop block to run when a start block is requested by its configuration, I just created two glue scripts: one to start ZFS no matter what, and one to stop ZFS no matter what:

 

-rwxr-xr-x 1 root root 481 2008-06-18 04:09 /etc/init.d/mountzfs
   -rwxr-xr-x 1 root root 488 2008-06-18 04:09 /etc/init.d/umountzfs

 

Then I studied the Kubuntu boot sequence very carefully, and used some elbow grease (update-rc.d) to symlink them to get the results I wanted:

 

lrwxrwxrwx 1 root root 19 2008-06-18 03:52 /etc/rc0.d/S35umountzfs -> ../init.d/umountzfs
   lrwxrwxrwx 1 root root 19 2008-06-18 03:52 /etc/rc6.d/S35umountzfs -> ../init.d/umountzfs
   lrwxrwxrwx 1 root root 18 2008-06-18 03:52 /etc/rcS.d/S36mountzfs -> ../init.d/mountzfs

 

Trust me, writing the script was the easy part -- figuring out how it interacts with the rest of the system was much harder.

Finally, I rebooted to my new root filesystem on the second disk. If you thought that my system booted correctly, you would be very, very wrong indeed. Eighty percent of the boot sequence were red [ fail ]s and sh: command not found errors. At the end, the system dropped me into a recovery console, where I could finally switch the ZFS mount points to their final destinations. Then, just to try out: zfs mount -a.

/homemounted.
/var couldn't be mounted, because the boot process graciously created incredibly important missing directories in it. And then, deadlock.

Crap, what was wrong?

Alt+SysRq+R. Boot again. What's wrong? No idea. Try strace. The friggin' command is in /usr. Hypotheses ran through my head for two hours. With me in front of a very, very broken system. I tried everything under the Sun that I could get my hands on -- which is not much when you don't have a CD-ROM drive, mind you.

And (summarizing two hours) then, I tried this: zfs set mountpoint=/tmp/usr vault/usr ; mkdir -p /tmp/usr ; zfs mount vault/usr.

Miracle of miracles, it worked. I copied the entire cast of characters of Linux Debugging: The Movie into the very tightly packed /. I strace --ffed the hell out of zfs-fuse and I found the problem. The moronic mount.fuse subcommand, that actually connects the kernel and user endpoints, tries to read /usr/lib/locale/locale-archive right in the middle of mounting the filesystem! Instant deadlock that you can only get out of by using the SysRq OOM key (yes, zfs-fuse is actually a great OOM candidate -- 1.5 GB VM size on this 1.0 GB RAM computer; yes, I discovered that on my own before I wrote the OOM immunization code in the initscript).

I then discovered two things: zfs-fuse didn't deadlock when started from the recovery command, but it did lockup when starting it from the initscript. What you can't see is that the version of the initscript that I initially wrote was sourcing the LANG variable from a configuration script in /etc. OK, so how do you solve locale problems? Instant fixup: unset LANG before running the command.

OK, so do I have a booting system now, or what? Wrong again. Some processes get started before the actual mounting of filesystems, and the ZFS subsystem can't actually be started earlier in the boot process without creating an initramfs dependency or another, different, chicken-and-egg problem. So I moved what I could move from the ZFS volume's /var into the /var directory of the / filesystem. I ended up with this structure backed up by ZFS (and the rest, you can safely assume, in a very tightly crammed ext3 filesystem):

 

zfs list
   NAME              USED  AVAIL  REFER  MOUNTPOINT
   vault             294G  69,8G    18K  none
   vault/home        290G  69,8G   290G  /home
   vault/usr        3,36G  69,8G  3,36G  /usr
   vault/var         842M  69,8G    18K  none
   vault/var/cache   515M  69,8G   515M  /var/cache
   vault/var/lib     282M  69,8G   282M  /var/lib
   vault/var/tmp    44,5M  69,8G  44,5M  /var/tmp

 

Boot again. Oh, yeah, I'm enjoying the 3-minute boot time on this formerly-a-screamer machine. D-Bus fails to start. D-Bus is actually very required for many things in Kubuntu, but I manage to start a GUI session up, if only to Google up what was wrong with it. That was probably not the best moment to find out that just starting the KDE 3.5 session took over ten minutes. All of this with less than 1 MB/s from the disk, according to iostat and 160% CPU usage, according to top.

Then I discovered the zfs-fuse Google group. It's a fantastic place where everyone (including Ricardo Correia) received me very well and had lots of tips. Only there did I find out what was wrong with D-Bus -- a bug that manifests itself only with FUSE filesystems, for which a patch exists and works.

At this point I'm extremely exhausted from this marathon session, so I basically just try to backport the patch into the dbus source package for my distribution. You've probably heard that Debian (and, by extension, Ubuntu) has a fantastic build system -- it failed on me. Not only was apt not working (remember the mmap issue?), but dpkg-source also failed while trying to apply the patches for the source package. Oh, yes, I manage to solve this problem by learning, on-the-spot, how the apt build "system" actually works, and manually replicating the entire process that should be automated. Many thanks to the gents at #debian in Freenode for their kind responses to my questions.

Bam, built dbus (it's yours if you want it). Installed it. Started it. And the chain of daemons that were depending on it just start up and take life. Neat trick, Upstart!

Back to performance questions and ZFS. Do you know what the real performance killer is? You'll never guess it...

...icons! While GTK+ applications take marginally more time to start under a ZFS regime, KDE applications take an order of magnitude more. Before, on a warm working set, a KDE application took about 2 seconds to start. Today, Kmail takes in excess of five minutes to start. Why? Here's why -- multiply that by fifty thousand and you'll get the idea. Each icon that the application requests results in thousands and thousands of access() and stat() calls. FUSE doesn't use a kernel cache by default (there are several reasons for that), so the only cache that backs those requests up is the ARC cache, which is an impressive caching regime and technical achievement but, in this case, it's very much like caching your car keys somewhere in Europe, because of the transatlantic userspace-kernelspace-userspace-kernelspace-userspace barrier. Per-call. When this is taking place, the CPUs remain pegged at 190%, eaten by ZFS alive, and the 12 case fans jump to 11.000 RPM.

The zfs-fuse Google groups guys came up with a couple of suggestions (all documented in the list, which I'm too lazy to link to again). These all are compile-time options, so a ZFS rebuild is in order for every one of them:

  • scons debug=0. A very slight CPU usage decrease.
  • Increasing the ARC cache. I doubled it from 128 to 256 MB. Turns out it's not a caching problem and it doesn't help at all.
  • Mount option big_writes for FUSE filesystems. Here's what I did about that:

Recompiled ZFS, this time enabling a FUSE mount option named big_writes that I've read about in the Google group. Yes, the daemon needs to be recompiled, and it's not fast. No, I'm not actually jumping to the part where I actually compiled ZFS with big_writes first, then booted, only to find out that I needed a new kernel. Oh, wait, I just did. Fortunately, I did back zfs-fuse up.

Next up? Latest 2.6.26-rc6 kernel, because of:

  • Hey, writable mmap is there for FUSE filesystems! Yeah! Now I can have apt-get back!
  • big_writes.

When was the last time a kernel compile took four hours for you? Mine was yesterday. But it's actually fun -- the process hasn't changed that much from 1998, and the distro already comes with a nice .config that you can reuse with
make oldconfig. And, this time, you get to do out-of-tree kernel builds! Yay!

Well, I ticked the wrong option in make menuconfig anyway, because my kernel modules don't fit my puny /, now at 400 MB free. Jeez, four hours. Google some more. Turns out I turned a debugging option on.

After this, FUSE userspace itself was due for a recompile. Another odyssey, whose fruits you can reap here (warning: CVS checkout).

OK, redo the initial RAM disk, adjust GRUB configuration, reboot with the latest kernel. It's all good. More surprisingly, I'm actually getting some of my performance back. Some of it. As in "Kmail no longer takes five minutes to start -- only three".

And, most importantly, applications that depend on mmap now work correctly. My boot process isn't an epic [ fail ] anymore -- and that's incredibly reassuring.

This is the point where my journey turns into smooth sailing. I zpool scrubed my new baby. After five hours, with the solid guarantee that my data was OK and nothing'd been lost or corrupted during the rsync, I nuked my first disk, replicated the new partition structure on it. A nice RAID1 array for the final /. A short rsync for the / filesystem. A quick mkswap for the new swap partition. A fast adjustment in /etc/fstab and another one in mdadm.conf for the new array. Reinstall and reconfigure GRUB on the first disk. And, finally, I leave the best for the latest:

zpool attach vault /dev/by-id/second-disk-huge-partition /dev/by-id/first-disk-huge-partition

Man, that rocked. It was unbelievably fast -- like, disk-platter fast, around 40 to 50 MB per second, and the system didn't get that much more slow when it was resilvering the first disk. Which kind of makes lots of sense, because zfs-fuse is now crossing the userspace-kernelspace barrier just once per operation. How do I know this? Well, strace: I know that what zfs-fuse does is, it opens the disk partition in direct I/O mode and then manages it for itself, responding to FUSE requests -- but the resilvering process doesn't involve FUSE at all, it's just the two disks practically chatting with each other through zfs-fuse. Now I know for sure that ZFS will give me platter speeds. It's just a matter of time (and maybe me pestering Ricardo Correia to collaborate with me on this same issue).

Questions that I haven't solved yet? Sure, there are a lot. Two that haunt me:

  • No root filesystem on ZFS. Others on the Google group have managed it. Me? I didn't want to mess with /etc/zfs inside the initramfs, thank you very much.
  • I know this for sure: the only active cache now is the userspace ARC cache from ZFS; I read the FUSE kernel code, and it clearly flushes files from the cache when programs open() them. Honestly, if I could wish for something to just become true overnight, I'd wish for the ARC to be moved into the kernel and to have it replace the page cache, but that won't happen anytime soon. There's a FUSE kernel_cache option, but I'm wary of enabling it. When I have been sufficiently reassured that the option won't corrupt my precious data, I will enable it. That will be a couple of hours of reading someone else's code, so I'm inclined to defer it for a few days. But, in theory, this should give me platter speeds instead of giving my 12 case fans 'speed'. At the hefty cost of RAM for two redundant caches.
  • Do filesystem readahead and Linux disk scheduler algorithms mess up in some way with ZFS' control of the platter? The data integrity question is closed, because the writes are submitted with barriers, but I'm worried that the Linux I/O scheduler is second-guessing the decisions of ZFS' one.
  • The /etc/init.d/sendsigs omit.d protocol I'm using on the initscript plain fails. I had to shunt the script with an exit 0 right before the killall5 in sendsigs because killall5 plain hung instead of ignoring ZFS as it should have done -- and it needs to ignore ZFS because ZFS is unmounted later. This won't be a problem once we get our own kernelspace ZFS implementation.

OK, that was my journey. I'm on ZFS now, my machine's rock-solid (if a bit CPU-tired) and my data's never been so safe. I also got compression, which saved me about 6 GB. Furthermore, I've given you the initscript, the steps and the software (except ZFS, but you can compile that yourself).

Go wild.