19 June 2017

Hosting a large web-forum on ZFS (a case study)

Hosting a large web-forum on ZFS (a case study)

Over the course of last weekend I worked with a friend on deploying zfs across their infrastructure.

Their infrastructure in this case is a popular website written in php and administering to some 20,000+ users. They, like many gridpp sysadmins use CentOS for their back-end infrastructure. However due to being a regularly high profile target for attacks they have opted to run their systems using the latest kernel installed from the elrepo.
The infrastructure for this website is heavily docker orientated due to the (re)deployment advantages that this offers.
Due to problems with the complex workflow selinux has been set to permissive.

Data for the site was stored within a /data directory which stored both the main database for the site and files which are hosted by the site.
Prior to the use of zfs the storage used for this site was xfs.

The hardware used to run this site is a dedicated 8 intel cores, 32Gb RAM, 2 * 2Tb disks managed by soft-raid(mirror) and partitioned using lvm.

Installing zfs

Initially setting up ZFS couldn't have been easier. Install the correct rpm repo, update, install zfs and reboot:

yum update
yum install 
yum update
yum install zfs

Fixing zfs-dkms

As they are using the latest stable kernel they opted to install zfs using dkms which has pros/cons to the kmod install.

This unfortunately didn't work as it should have done (possibly due to a pending kernel update on reboot). After rebooting the following commands were needed to install the zfs driver:

dkms build spl/
dkms build zfs/
dkms install spl/
dkms install zfs/

This step triggered the rebuild and installation of the spl (solaris porting layer) and the zfs modules.
(Adding this to the initrd shouldn't be required but can probably be done as per usual once this has been build)

Migrating data to ZFS

The initial step was to migrate the storage backend and main database for the site. This storage is approximately 0.5Tb of data which was constructed of numerous files with an average file size close to 1Mb. The SQL database is approximately 50Gb in size containing most of the site data.

mv /data/webroot /data/webroot-bak
mv /data/sqlroot /data/sqlroot-bak
zfs create webrootzfs vgs/webrootzfs
zfs create sqlrootzfs vgs/sqlrootzfs
zfs set mountpoint=/data/webroot webrootzfs
zfs set mountpoint=/data/sqlroot sqlrootzfs
zfs set compression=lz4 webrootzfs
zfs set compression=lz4 sqlrootzfs
zfs set primarycache=metadata sqlrootzfs
zfs set secondarycache=none webrootzfs
zfs set secondarycache=none sqlrootzfs
zfs set recordsize=16k sqlrootzfs # Matches the db block size
rsync -avP /data/webroot-bak/* /data/webroot/
rsync -avP /data/sqlroot-bak/* /data/sqlroot/

After migrating these the site was then brought back up for approximately 24hr and there were no performance problems observed.

The webroot data which contained mainly user submitted files reached a compression level of about 1.1.
The sql database reached a compression level of about 2.4.

Given the increased performance of the site due to this migration it was decided 24hr later to investigate migrating the main website itself rather than just the backend.

Setting up systemd

The following systemd services and targets were enabled but rebooting the system has not (yet) been tested.

systemctl enable zfs.target
systemctl enable zfs-mount
systemctl start zfs-mount
systemctl enable zfs-import-cache
systemctl start zfs-import-cache
systemctl enable zfs-share 
systemctl start zfs-share

Impact of using ZFS

A nice solution for this was found to already exist quite well. This is the zfs storage driver for docker.


After this was setup the site was brought back online and the performance was notable.

Page load time for the site dropped from about 600ms to 300ms. That is a 50% drop in page load time entirely due to replacing the backend storage with zfs.
This was with the ARC cache running with a 95% hit rate.

Problems Encountered

Unfortunately about 30min of running after of migrating the docker service to use ZFS the site fell over.
(page load times increased to multiple seconds and the backend server load spiked.)

Upon initial inspection it was discovered that the zfs arc cache had dropped to 32M (almost absolute minimum) and the arc-reclaim process was consuming 100% of 1 CPU.

The ZFS arc cache maximum was increased to 10Gb but the cache refused to increase.

echo 10737418240 > /sys/module/zfs/parameters/zfs_arc_max

Increasing the minimum forced the arc cache to increase however the arc-reclaim process still was consuming 1 CPU core.

Fixing the Problems

A better workaround was found to be to disable the transparent_hugepage using:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

This stopped the arc-reclaim process from consuming 100% CPU as well as triggering the arc cache to start regrowing.
(For the interested this has been reported: https://github.com/zfsonlinux/zfs/issues/4869)

Summary of Tweaks made

A summary of some of the optimizations applied to these pools are:

# ZFS settings
zfs set compression=lz4 webrootzfs # Enable best compression
zfs set compression=lz4 sqlrootzfs # Enable best compression 

zfs set primarycache=all # This is default
zfs set primarycache=metadata sqlrootzfs # Don't store DB in cache
zfs set secondarycache=none webrootzfs # Not using l2arc
zfs set secondarycache=none sqlrootzfs # Not using l2arc
zfs set recordsize=16k sqlrootzfs # Matches the db block size

# Settings changed through /sys
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

echo 10737418240 > /sys/module/zfs/parameters/zfs_arc_max # 10Gb max
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_min # 4Gb min

# repeat the following for /sys/block/sda and /sys/block/sdb
echo 4096 > /sys/block/sda/queue/nr_requests
echo 0 > /sys/block/sda/queue/iosched/front_merges
echo noop > /sys/block/sda/queue/scheduler
echo 150 > /sys/block/sda/queue/iosched/read_expire
echo 1500 > /sys/block/sda/queue/iosched/write_expire
echo 4096 > /sys/block/sda/queue/nr_requests
echo 4096 > /sys/block/sda/queue/read_ahead_kb
echo 1 > /sys/block/sda/queue/iosched/fifo_batch
echo 16384 > /sys/block/sda/queue/ma

Additionally for the docker-zfs pool:

zfs set primarycache=all zpool-docker
zfs set secondarycache=none zpool-docker
zfs set compression=lz4 zpool-docker

All docker containers built using this engine inherit these properties from the base pool zpool-docker however, a remove/rebuild will be needed to take advantage of settings such as compression.


Marcus Ebert said...

Nicely done and the combination with docker is a very interesting use case.
Have you looked into the impact of "relatime=on" and "xattr=sa" ?

Rob said...

Hi Marcus,

Short answer is no.
I've not looked into this too much but the main reason is that zfs has been such a good solution (aside from the arc cache not growing).
I've not spent too much time formally optimising this once it went into production I just took some example configurables based on other user's experiences online.

With 'relatime=no' there was some fear that relatime is needed by certain lockfiles used within the php code on the website. I may suggest turning this off but some level of caution is needed I think.

As for 'xattr=sa' I suspect this could help disk performance, however we were surprised to discover 'just how good' ZFS is so we haven't explored this,

Migrating the main db and backend storage to ZFS reduced the site startup time from 5-10 minutes to about 1. In addition a huge amount of the disk access that used to be going on has disappeared now zfs was used.
The arc cache in the use case here was found to be extremely efficient!

I'm planning to have a look at tuning some of the configuration parameters within the spl and zfs for this use case so if we discover any effective performance improvements from playing with any settings or setting 'xattr=sa' I will likely make a 2nd blog post here.