ZFS Tuning: L2ARC

A Note from the CTO:

This series of articles provides technical tips on how to construct, tune, and manage ZFS storage systems.

WARP Mechanics offers a wide range of commercially-supported Enterprise-class ZFS appliances. These have already been set up and tuned optimally at the factory. Therefore it is rarely necessary for WARP customers to get into this level of detail.

However, WARP believes in giving back to the community, and that WARP appliances may not be right for everyone. Documents such as this may be useful for those who need a ZFS system other than the types offered by WARP, and therefore need to build their own.

However, setting up an enterprise ZFS storage operating system is far more complex than can be explained in any short/simple “how to” document. WARP, for example, has been continually tuning the WARPos stack for six years and counting.

In short: no warranty or support can be offered for non-WARP appliances. And customers who do have supported appliances are encouraged to call WARP rather than attempting to change low-level parameters on their own.


 

Almost all WARP systems these days include some amount of “Level 2 Adaptive Replacement Cache”. The L2ARC is one or more large, read-optimized SSDs, which will receive data out of primary ARC which has to be evicted from RAM to make room for other data.

A typical WARP NAS system might have 5x to 10x more L2ARC vs. RAM. So a WARPnas appliance with 64GB RAM might have 0.5TB to a 1TB of L2ARC.

A system set up for high performance computing (HPC) might have 20x or more L2ARC vs. RAM. So a WARPhpc controller with 256GB RAM might have 5TB, 10TB, or even more L2ARC.

The benefit of a large L2ARC is that recently or frequently accessed data can be served off of a read-optimized SSD, instead of needing to move HDD heads around to get at it.

There are, however, some trade offs.

For each TB of L2ARC SSD, you consume 1GB to 3GB of RAM. So more L2ARC means less ARC.

Also, it takes CPU and SAS/SATA bus cycles to move data out of RAM and onto L2ARC. Meaning, you can improve read performance, but also reduce write performance, by configuring a large L2ARC.

ZFS has a few related useful tuning options which are safe to try out:

  • L2ARC fill rate
  • L2ARC behavior related to sequential IO
  • L2ARC data types to cache per filesystem

 

The first item above relates to how aggressively ZFS should try to write data onto the SSD.

The default values made perfect sense back when ZFS was created, because SSDs were slow on writes, small in capacity, and burned out quickly if over used.

But with modern SSDs, the default values are almost always wrong. The L2ARC SSDs now have much higher capacity, so the low fill rate ZFS defaults to will result in the SSD being mostly empty most of the time, and having very little recently accessed data all the time.

The second item tells ZFS whether or not sequential data is cached. This is sometimes called “streaming” data. By default, L2ARC does not cache sequential data. This may be OK for some applications. However, WARP has found that vastly more cache hits will occur by changing the default, and telling ZFS to always cache sequential data.

The first two items are controlled in /etc/modprobe.d. You should have a file called zfs.conf. This should contain an “options” line.

Something like this:

options zfs l2arc_write_max=16777216 l2arc_noprefetch=0

 

…would tell ZFS not to “feed” more than 16GB to L2ARC per “feed cycle”. (The big number above is 16-bytes times 1024-squared.) It would also tell ZFS to not not store “prefetch” data. This awkwardly phrased parameter really means “do store sequential data in L2ARC”.

The final tunable in the list above tells each individual ZFS filesystem whether or not to use L2ARC to cache (a) all types of data, (b) meta data only, or (c) no data.

Let’s say you have a single pool of HDDs, with attached L2ARC SSDs.

This pool is presented as three separate filesystems: poolname/nas, poolname/vdi, and poolname/archive.

The NAS filesystem is mounted by users as home directories. It has a large amount of unstructured data, which isn’t terribly performance intensive.

VDI is mounted by a hypervisor server and contains users’ virtual desktop OS images. When users log in first thing in the morning, this filesystem will get very busy, very quickly.

The Archive filesystem is mounted by another higher-performance NAS filer, and is used to “tier off” files as their business relevance reduces over time. (Using HSM software or manual processes.)

For NAS, we might want ZFS to aggressively cache meta data. If a user types “find -ls /home/myname | grep pattern” or similar, this will dramatically increase performance. But it would generally not be productive to try to cache the actual contents of files onto SSDs, simply because users home directories tend to contain a very high percentage of files that are not read frequently enough to justify it.

For VDI, aggressively caching everything is the way to go. The VDI filesystem is likely to be very small on disk, since it will use deduplication and compression. Basically, if you have 1,000 unique VDI images, these will compress+deduplicate about 1,000:1. An “apparent” data store of 100TB might only take 100GB of disk… And thus a relatively small L2ARC device might be able to provide close to 100% cache hits. This will improve performance dramatically on startup.

The Archive filesystem, however, is 99% writes, and 1% reads. So using a read cache would add cost, reduce write performance, and add no benefits whatsoever.

To change which L2ARC data types to cache per filesystem, use the “zfs” command line utility, like this:

zfs set primarycache=metadata  poolname/nas
zfs set primarycache=all       poolname/vdi
zfs set primarycache=none      poolname/archive

 

L2ARC has additional tunable parameters. You can define how long a “feed cycle” is, how aggressively to feed right after a system reboot, whether or not to allow reading from L2ARC and writing at the same time…

But those other parameters are more likely to cause performance reduction if changed. The three categories mentioned the this article should allow proper tuning for almost all cases.