ZFS Tuning: ZIL / SLOG

A Note from the CTO:

This series of articles provides technical tips on how to construct, tune, and manage ZFS storage systems.

WARP Mechanics offers a wide range of commercially-supported Enterprise-class ZFS appliances. These have already been set up and tuned optimally at the factory. Therefore it is rarely necessary for WARP customers to get into this level of detail.

However, WARP believes in giving back to the community, and that WARP appliances may not be right for everyone. Documents such as this may be useful for those who need a ZFS system other than the types offered by WARP, and therefore need to build their own.

However, setting up an enterprise ZFS storage operating system is far more complex than can be explained in any short/simple “how to” document. WARP, for example, has been continually tuning the WARPos stack for six years and counting.

In short: no warranty or support can be offered for non-WARP appliances. And customers who do have supported appliances are encouraged to call WARP rather than attempting to change low-level parameters on their own.


 

All ZFS filesystems have an “intent log”. This is (intuitively) called the “ZFS Intent Log”, or “ZIL” for short.

It’s possible to turn ZIL off on a per-filesystem basis.

But you should never (ever) do that, outside of niche “don’t try this at home” corner cases.

The benefit of ZIL is that it allows ZFS to recover consistently from a system crash, in a POSIX compliant manner, without needing to perform a filesystem check a la “fsck”.

Because of the size of modern WARP ZFS storage systems (up to multiple petabytes on a single server), an fsck style utility would take potentially weeks to run. And not performing some kind of consistency check would result in (potentially catastrophic) data corruption. With ZIL, you don’t need the (very long) fsck, and don’t get the (very dangerous) corruption.

No brainer, right? ZIL is good.

But like all good things, there are some trade-offs.

When you write small files or data chunks to a ZFS filesystem, and your application issues a sync after each write, the ZIL architecture can cause write performance to plummet. Each time you sync, ZFS has to stop everything, make sure the data is written to the correct location on disk, update the ZIL to make a note of the completed write, then get back to whatever it was trying to do when you issued the sync. Lots of head seeking on spinning magnetic disks = very bad performance.

That does not mean, “don’t issue sync commands,” by the way. Your application may be doing a ton of small IOs with syncs after each one for a very good reason.

So, if you have an application which issues a large number of small sync writes, ZFS provides another solution for keeping ZIL from “clobbering” performance: the “Separate Intent Log Device”.

The (unfortunate) shorthand for the ZFS Separate Intent Log is “SLOG”.

With ZIL SLOG, you designate a different device, or multiple devices, to host the log, so you don’t have to keep interrupting head motion on the primary storage disks. The primary storage can keep doing large sequential writes, and servicing read quests as the come in, while the SLOG deals with the small sync writes.

ZFS will plop the data payload and metadata for small sync writes onto the SLOG, and ack the sync’ing application immediately… even though the data is not on the primary storage yet.

Then, ZFS can write the payload to primary storage at leisure, as (very much more efficient) large sequential async IO.

Only after the payload is placed in its final location will ZFS update the intent log to indicate that the write has been cleanly completed. But in the mean time, the sync’ing application has been allowed to continue about its business. And no other applications were impacted because the sync did not require interrupting other operations on the primary storage disks. No “seek storm” will have occurred.

If the system crashes before payload is cleanly written to the final location, that’s fine: ZFS can “replay” the intent log off the SLOG device upon reboot, thus honoring the POSIX write sequence.

All that sounds great. So all ZFS systems should have a SLOG, right?

Not so much.

Neither ZIL nor SLOG does anything at all with most IO profiles, except add (lots) of cost.

E.g., if you have a system that’s 99% reads – for data analytics, VDI, web server backends, etc.. – then a SLOG will add cost yet provide no benefit. SLOG only applies to a portion of writes, and not (ever) to reads. For read intensive systems, you need ARC and L2ARC instead of SLOG.

Something similar applies for most archive systems: If you write large sequential async streams to the ZFS filesystem, SLOG will show 0% utilization. ZFS is smart enough to not use SLOG to turn large sequential async streams into… large sequential async streams. It’s more efficient to use ZFS’s normal async write process. So archives don’t need SLOG, or L2ARC. They just need big disks.

Any system that executes small writes, but does so asynchronously will also not benefit from SLOG. Those will be coalesced into large writes out of ARC. For these situations, you just need more RAM.

If an application sends small sync writes, but does so rarely… again, SLOG won’t help.

Yes, the occasional small IO will go through more efficiently. But a larger performance benefit would have been obtained by using the same budget to buy more primary disks, or more RAM for ARC. In order to do anything useful, SLOG has to be placed on enterprise grade SSDs, and to be reliable, it requires 2x or more of them for a given pool. So you can get a lot of additional RAM or HDDs if you don’t buy an SLOG solution.

SLOG may also not work with some upper layer filesystems. E.g., Lustre over ZFS may not trigger SLOG IO even if clients send small sync IOs to the Lustre layer. It isn’t that Lustre “breaks” SLOG. It just doesn’t need it. Lustre itself has a provision for accelerating IO in this case, which is actually even faster than SLOG.

So right off the bat, you should not use SLOG for anything except applications which send massive numbers of small sync writes, and don’t use Lustre. Else, SLOG will do nothing at all, or will do so little that the budget would have been better spent some other way.

The first “tuning” tip for SLOG is therefore “don’t use SLOG most of the time”.

But if you do have an application which sends large numbers of small sync writes, and don’t use Lustre over ZFS, then the SLOG should be architected as follows:

  1. Use low capacity, low latency, high write endurance SSDs. SLOG experiences 100% writes under almost all conditions. ZFS only reads the SLOG in the event of a system crash.
  2. They must have “power loss data protection” or a similar feature. You cannot use SSDs which accelerate writes by using unprotected RAM.
  3. If you have HA controllers, all SLOG SSDs must be visible by all controllers in the protection group.
  4. Use 2x SLOG SSDs per zpool, and mirror them. If your SLOG fails during a system reboot, you can get data corruption – and the whole point of SLOG was to prevent corruption. If you were OK with potential corruption, you should have set “sync=disable” and saved the money.
  5. Each SSD should be sized as follows: Approximate max write performance of the system in GB per second, times 10. So if a server has 2x 40GbE interfaces, you would say ~4GB/s * 2 * 10 = 80GB. A pair of 80GB SSDs, mirrored, would be indicated.
  6. Only attach the SLOG mirrors to particular pools which need it. E.g., you might have a filesystem called “archive_pool/nas” and another called “database_pool/mysql”. The archive would not need SLOG attached to it, whereas the database might need it.