Getting Started: Lustre over ZFS, in an hour or less

Getting Started: Lustre over ZFS, in an hour or less

Tutorial: By WARP Mechanics’ CTO.

Summary

This post will show you how to get a basic working Lustre + ZFS on Linux system working in a VM.

It is not intended to be a guide for installing a production system. I strongly recommend that you contact us for more comprehensive help if you’re thinking about a scale deployment along those lines.

It will provide a fast way to get a working test environment, to get you familiar with the basics.

ZFS on Linux (also known as “ZOL”; http://zfsonlinux.org) in general, and ZFS+Lustre integration in particular, are primarily developed by Lawrence Livermore National Labs (www.llnl.gov). They also provided assistance in creating the procedures in this guide. A portion of the text in this guide was adapted and updated from the ZOL website.

Lustre itself is community software, with a wide base of commercial and government contributors. If you find core defects in Lustre (and are certain they really are bugs!) then the best way to report them is through the JIRA system that is presently hosted by Intel. (https://jira.hpdd.intel.com)

However, this guide is maintained by WARP Mechanics. WARP is a premiere provider of turn-key and/or customized HPC solutions, including appliances that use Lustre/ZFS as a basis. If you find defects in the procedures below, or need help designing a full-scale HPC system, feel free to reach out to lustre@WARPmech.com.

Procedure overview

Installing a Lustre/ZFS test environment consists of the following major steps:

  1. Install the correct operating system and optional packages
  2. Set up the OS, Lustre, and ZFS configuration files
  3. Make a filesystem using loopback devices
  4. Start the services

Installing CentOS 6.4 and LUSTRE/ZFS packages

To get the correct base OS on a VM:

Create CentOS 6.4 VM

Download and install CentOS 6.4 into a VM. Do not use 6.5 or later. At the time of this writing, the 6.5 kernel does not work properly with the version of Lustre used in this procedure, although that is likely to change soon.

This is not a tutorial on installing Linux. The specific requirements for this procedure are:

  1. Ensure that the VM has at least 32GBytes of HDD space.
  2. For the specific commands in this example, “warpdemo.warpmech.com” is used as the hostname. If you change it, the copy/paste example commands will need to be changed, as the host name is used by the Lustre scripts.
  3. Use “minimal server” as a configuration baseline when installing.
  4. Select “customize now” and add the “development tools” group. Not all of these tools are strictly necessary, but tends to help with anything you are likely to do post-install.

Depending what other options you pick when installing the OS, after completing the process, you may need to perform additional steps. For instance, you may need to edit /etc/sysconfig/network-scripts/ifcfg-eth0, and change the “onboot” line to “yes”, and “network manager” line to “no”. Then execute “service network restart”. This is not a Lustre-specific step; it is a function of the “minimal server” installation method presently in CentOS.

Install the ZOL repository

The easiest way to get Lustre/ZFS working on RHEL-style systems is to install it from the ZOL repository. Simply install the repo RPM, then use yum to install Lustre as with any other package.

yum localinstall –nogpgcheck http://archive.zfsonlinux.org/epel/zfs-release$(rpm -E %dist).noarch.rpm

yum install lustre

Set up the OS and Lustre configuration files

Configuring Lustre for a large-scale production environment can be arbitrarily complex. However, for this example, it only requires a few steps.

Disable SE Linux and IP tables

Lustre and ZFS can be integrated with a robust site security policy. Indeed, disabling firewalls and security features is generally not a good plan in a full scale rollout. But for a VM test environment, you should defer security configuration until after the basic system is working. This will make troubleshooting much easier.

Accordingly:

Edit /etc/selinux/config and change the SELINUX line to SELINUX=disabled.

Set the init policy to not start IP tables and turn those services off now:

chkconfig iptables off ; chkconfig ip6tables off

service iptables stop ; service ip6tables stop

Create Lustre Configuration Files

Tell Lustre which network interface to use. This will almost certainly be eth0. This is done by creating a Lustre-specific modprobe file. You can vi the file, or simply echo the correct line into it:

echo “options lnet networks=tcp0(eth0)” >> /etc/modprobe.d/lustre.conf    

Next, tell the Lustre/ZFS startup scripts (contributed by LLNL) which Lustre services you want to start. Don’t worry that you haven’t created these filesystems yet. They will be created shortly.

If you prefer the vi method to echoing, then:

vi /etc/ldev.conf

Either way, make sure ldev.conf contains the following lines:

warpdemo – mgs     zfs:warp-mgt0/mgt0

warpdemo – mdt     zfs:warp-mdt0/mdt0

warpdemo – ost0    zfs:warp-ost0/ost0

warpdemo – ost1    zfs:warp-ost1/ost1

Note: if you set up a fully qualified domain name as the host name, you may have to use that in the ldev.conf. That is., if you set up the system as “warpdemo.warpmech.com” then you might need your ldev.conf to look like this:

warpdemo.warpmech.com – mgs     zfs:warp-mgt0/mgt0

warpdemo.warpmech.com – mdt     zfs:warp-mdt0/mdt0

warpdemo.warpmech.com – ost0    zfs:warp-ost0/ost0

warpdemo.warpmech.com – ost1    zfs:warp-ost1/ost1

At this point, reboot your system to remove the existing security policies and allow the Lustre configuration to take effect.

Make the Lustre filesystem

If this were a physical system, we would mkfs Lustre filesystems at this point, using real disks. But since this is a VM, we will create some empty loopback files.

dd if=/dev/zero of=/var/tmp/warp-mgt-disk0 bs=1M count=1 seek=256

dd if=/dev/zero of=/var/tmp/warp-mdt-disk0 bs=1M count=1 seek=256

dd if=/dev/zero of=/var/tmp/warp-ost-disk0 bs=1M count=1 seek=4095

dd if=/dev/zero of=/var/tmp/warp-ost-disk1 bs=1M count=1 seek=4095

This creates two sparse 256MB files, and two 4GB files, which will become the Lustre “disks”.

With previous tools, it would be necessary to create a RAID and then create a filesystem on top of the LUN. But to create Lustre filesystems with ZFS integration, we make each RAID and filesystem object using a single mkfs command.

NOTE: Substitute the IP address of your VM’s eth0 interface for “10.0.0.100” below.

MyIP=10.0.0.100

mkfs.lustre –mgs –backfstype=zfs warp-mgt0/mgt0 \

     /var/tmp/warp-mgt-disk0

mkfs.lustre –mdt –backfstype=zfs –index=0 \

     –mgsnode=${MyIP}@tcp –fsname warpfs \

     warp-mdt0/mdt0 /var/tmp/warp-mdt-disk0

mkfs.lustre –ost –backfstype=zfs –index=0 \

     –mgsnode=${MyIP}@tcp –fsname warpfs \

     warp-ost0/ost0 /var/tmp/warp-ost-disk0

mkfs.lustre –ost –backfstype=zfs –index=1 \

     –mgsnode=${MyIP}@tcp –fsname warpfs \

     warp-ost1/ost1 /var/tmp/warp-ost-disk1

At this point, “zpool list” should show the four pools like this:

NAME       SIZE ALLOC   FREE   CAP DEDUP HEALTH ALTROOT

warp-mdt0   252M   192K   252M     0% 1.00x ONLINE –

warp-mgt0   252M   198K   252M     0% 1.00x ONLINE –

warp-ost0 3.97G   192K 3.97G     0% 1.00x ONLINE –

warp-ost1 3.97G   201K 3.97G     0% 1.00x ONLINE –

Note: For now, we are using the simple case of one “disk” per Lustre store. If you want to experiment with other configurations later (mirrors/RAIDs) you can create sparse more files, and re-mkfs the filesystem. You would insert the type of ZFS RAID being created right after “–fsname warpfs” – like, “raidz”, “raidz2”, or “mirror” – and increase the number of “disks” being supplied accordingly – like, “warp-ost-disk1a”, “warp-ost-disk1b”, and so on.

Now you will need a place to mount the backing filesystems and the Lustre filesystem itself:

mkdir /mnt/lustre

mkdir /warpfs

Start Lustre and mount the filesystem

Starting Lustre is as simple as starting any other service:

service lustre start

This should produce four lines of output, each saying that it is mounting one of the Lustre services.

Mounting warp-mgt0/mgt0 on /mnt/lustre/local/mgs

Mounting warp-mdt0/mdt0 on /mnt/lustre/local/mdt

Mounting warp-ost0/ost0 on /mnt/lustre/local/ost0

Mounting warp-ost1/ost1 on /mnt/lustre/local/ost1

If it doesn’t produce that output, you probably have a hostname typo (e.g., used “warpdemo” instead of “warpdemo.warpmech.com”) or perhaps skipped a step in the procedure.

If the command completed successfully and produced the expected output, then the backing filesystems should have been mounted in subdirectories of /mnt/lustre. So “mount | grep lustre” would produce something like this:

warp-mgt0/mgt0 on /mnt/lustre/local/mgs type lustre (rw)

warp-mdt0/mdt0 on /mnt/lustre/local/mdt type lustre (rw)

warp-ost0/ost0 on /mnt/lustre/local/ost0 type lustre (rw)

warp-ost1/ost1 on /mnt/lustre/local/ost1 type lustre (rw)

Note: With some hypervisors, we have seen “service lustre start” take a long time to complete the first time it is executed, and fail to mount the filesystems. But executing “service lustre start” a second time succeeds.

Assuming it worked, you can mount the filesystem itself:

mount -t lustre ${MyIP}@tcp:/warpfs /warpfs

Now, using “mount | grep warpfs” will show the filesystem like this:

10.0.0.100@tcp:/warpfs on /warpfs type lustre (rw)

To set Lustre to start automatically on boot:

chkconfig lustre on

Because this process doesn’t use real devices, the above might not always work. The “fake” devices are located in a non-standard location for devices, so ZFS may not find them when starting services. If you find that Lustre is not starting, try:

zpool import -d /var/tmp -a; service lustre start

Tuning lustre and zfs

Tuning Lustre and ZFS (or any other scalable storage) for HPC workloads is complex, and beyond the scope of this document. WARP Mechanics will be happy to discuss performance tuning: just send a note to lustre@WARPmech.com and we will start a dialog.

But for this specific VM exercise, it is advised to set the following parameters:

zfs set compression=lz4 warp-mdt0 warp-mgt0 warp-ost0 warp-ost1

zfs set sync=disabled warp-mdt0 warp-mgt0 warp-ost0 warp-ost1

This may not be advisable in a production system. But it will provide better performance and space utilization if you want to try writing some test data to the VM’s filesystem.

Configuring a second VM as a client

At this point, the VM is a client of itself. So in a sense you already have a client.

However, you may want to demonstrate a separate “client VM” talking to the “server VM”.

The easiest way to do this is to clone the server VM, disable the Lustre server configuration, and turn it into a client.

  1. Shut down the “server VM” using “umount /warpfs; poweroff” and use your VM manager to clone it
  2. Start the newly cloned “client VM” and change it’s hostname to “warpclient.warpmech.com” or similar
  3. Use “chkconfig lustre off; service lustre stop” to disable the Lustre server
  4. Depending on how your network is set up, you may need changes to get the Ethernet port working, such as:
    1. Edit /etc/sysconfig/network-scripts/ifcfg-eth0 to update the MAC address
    2. Edit /dev/udev/rules.d/70*net* to remove the old eth0 definition, and change the new one to map to eth0
    3. Reboot
  5. Use “for i in warp-mdt0 warp-mgt0 warp-ost0 warp-ost1; do zpool destroy $i -f; done” to get rid of backing FSs
  6. Re-start the “server VM” and start Lustre, then “mount -t lustre ${MyIP}@tcp:/warpfs /warpfs
  7. Execute the same mount command on the “client VM” – but using the server’s IP address