26 January 2017

File system tests

Since there is interest in filesystem test, I put the script I used for the ZFS/Ext4/XFS tests on a web server. If you test file systems for Grid storage purpose, feel free to give it a try.
It can of course also be used by anyone else, but this test is not doing any random read/writes.

In general, it would be good to run this test (or any other test) under 3 different scenarios when using raid systems:

  1. normal working raid system
  2. degraded raid system
  3. rebuild of raid system


The script needs 2 areas:

  1. one where you have files that are read during the read tests, and 
  2. one where you want write files to. This write area should have no compression since the writes come from /dev/zero.                                                                                                                                                                                                                                                                                          

By default it is doing reads over all specified files, writes to large files, and writes to small files.
For the reads, first it is doing a sequential read for all files, and in a second pass it reads files in parallel for the same set of files.
For the writes, it is doing a sequential write first and in a second pass it is writing in parallel do different files.   That is the same for the writing of large files and of small files.

After each read/write pass there is a cache flush and also each single write issues a file sync after each file is written to make sure that the time is measured to really write a file to disk.


The script needs 3 parameters:

  1. location of a text files that contains the file name including absolute path for all files that you want to include in the read tests
  2. a name used as description for your test, it can be used to distinguish between different tests (e.g. ZFS-raid2-12 disks or ZFS-raidz3-12disks)
  3. absolute path to an area where the write test can write it files too; this area should have no compression enabled

The parameters inside the script, like number of parallel read/writes and  file sizes, can easily be configured. By default about 5TB space are needed for the write tests.

The script itself can be downloaded here

ZFS auto mount for CentOS7

When testing ZFS installs on servers running on CentOS7.3, it can happen that ZFS is not available after a restart. After some testing this seems to be related to systemd and probably affects other systemd Linux distributions too.

What I used were ZFS installs using different versions of ZFS on Linux. After looking into the system setup, I noticed that by default ZFS is just disabled. Doing the following solved the problem on the machines I tested:

systemctl enable zfs.target
systemctl start zfs.target
systemctl enable zfs-import-cache.service
systemctl enable zfs-mount.service
systemctl enable zfs-share.service 

This solved all auto mount issues for me on the CentOS systems.

 Note: At least when using the latest version 0.6.5.8, one can also use the following command as explained on the ZFSonLinux web page:

systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target


Everyone who is upgrading to the latest version should also have a look to the ZFS on Linux web page since the repository address has changed. While it should have updated it automatically, if you haven't run any updates since some month, then it can't get the new repository automatically.

29 December 2016

2016 - nice and boring?

We like GridPP services to be "nice and boring": running smoothly, and upgrades are uneventful

One cannot accuse the year 2016 of having been N&B, with lots of interesting and extraordinary events (also) in science. However, computing and physics also had their share of the seemingly extraordinary number of celebrities we will miss, such as Tomlinson and Minsky in computing, and arguably both Kibble, and Rubin should have won Nobel prizes in physics (or to be precise, shared the prizes that were awarded.)

In 2016 we continued to deliver services for LHC despite changing requirements, and also to support the much smaller non-LHC communities. With GridPP as a data e-infrastructure (and "data transfer zone"), have also revisited connecting GridPP to other data infrastructures and will continue this work in 2017.

GridFTP continues to be the popular and efficient workhorse of data transfers; xroot is also popular but mainly in high energy physics. Tier 2 sites are set to become SRM-less. Accounting will need more work in 2017; hopefully we can do this with the GLUE group and the EGI accounting experts. GridPP also looks forward to contributing to the EPSRC-funded Pathfinder pilot project which should eventually enable connecting DiRAC, eMedLab, and GridPP. So, perhaps, not N&B either.

21 December 2016

Comparative Datamanagementology

GridPP was well represented at the cloud workshop at Crick. (The slides and official writeup still have not appeared as of this blog post)

The general theme was hybrids, so it is natural to wonder whether it is useful to move data between infrastructures and what is the best way to do it. In the past we have connected, for example, NGS and GridPP (through SRB, using GridFTP), but there was not a strong need for it in the user community. However, with today's "UKT0" activities and more multidisciplinary e-infrastructures, perhaps the need for moving data across infrastructures will grow stronger.

xroot may be a bit too specialised, as it is almost exclusively used by HEP? but GridFTP is widely use by users of Globus, and is the workhorse behind WAN transfers (as an aside, we hear at the AARC meeting that Globus are pondering moving away from certificates, towards a more OIDC approach - which would be new as GridFTP has always required client certificate authentication.)

The big question is whether moving data between infrastructures is useful at all - will users make use of it? It is tempting to just upload the data to some remote storage service and share links to it with collaborators. Providing "approved" infrastructure for data sharing helps users avoid the pitfalls of inappropriate data management, but they still need tools to move the data efficiently, and to manage permissions.  For example, EUDAT's B2ACCESS was specifically designed to move data into and out of EUDAT (as EUDAT does not offer compute).

So far we have focused on whether it is possible at all to move data between the infrastructures, the idea being to offer users the ability to do so. The next step is efficiency and performance, as we saw with DiRAC where we had to tar the files up in order to make the transfer of small files more efficient, and to preserve ownerships, permissions, and timestamps. 

16 December 2016

XRDCP and Checksums with DPM

To check if a file transfer was successful, xrdcp is able to calculate a checksum at the destination. However, while this works well for plain xrootd installations it is not working at the moment when used together with DPM. The reason for that seems to be that xrootd is only used as disk server clients and doesn't use the redirector component which would do the translations between logical and physical file names. This translation is done within DPM.
(if the reason why it is not working at the moment is different please add a comment)

If a checksum is needed to verify a successful copy of the data, one way is to copy the file from the origin to the disk server first, and then transfer it back to the origin server and calculate the checksum on what was transferred back. That always works, but is not very efficient since it can involve a lot of additional network traffic, especially at sites with a small number of storage servers but large amount of compute resources or when transferring files to distant sites. This method is implemented by some experiments if xrdcp fails to give a checksum.

While the DPM developers work on a build-in solution for DPM, there is also another method that can be used to calculate the checksum without any additional network traffic, which can be used in the meantime.
Xrootd provides a very flexible interface for configurations. What we can use here is the possibility to specify an external program to calculate the checksum. This can be any executable, especially also a shell script.
To do so, one needs to add to the xrootd config file on the disk servers the following option

xrootd.chksum adler32 /PATH/TO/SCRIPT.sh


where "adler32" specifies the used checksum algorithm and "/PATH/TO/SCRIPT.sh" specifies which script is used to calculate the checksum and where it is.
(make sure the script is executable)
Xrootd will also automatically pass the logical file name as a parameter to the script

In the script it is then possible to do the logical filename to physical file name lookup and calculate the checksum.  To be able to do so, the DPM tools need to be installed at least on the DPM head node which can be found in dpm-contrib-admintools​ when using the EPEL repository. Also, the clients need to have a way to contact the DPM head node to do the lookup then.
An example script that can be adapted to own configurations can be found here.


28 September 2016

Co-evolving data nodes

Bing! a mail comes in from our friends in the States saying look! here's someone in New Zealand who has set up iRODS node to GridFTP data to/from their site. It is a very detailed document yet it looks a lot like the DiRAC/GridPP data node document. They have solved many of the same problems we have solved, independently.

The basic idea is to have a node outside your institute/organisation which can be used to transfer data to/from your datastore/cluster. With a GridFTP endpoint, you could move data with FTS (as we do with DiRAC), people can use Globus (used by STFC's facilities, for example), or data can be moved to/from other e-infrastructures (such as EUDAT's B2STAGE) or EGI. Regardless of the underlying storage, there will be common topics like security, monitoring, performance, how to (or not to) firewall it, how to make it discoverable, etc. It could be the data node in a Science DMZ.

The suggestion is that we (= GridPP, DiRAC, and in fact anyone else who is willing and able) contribute to a detailed writeup which can be published as an OGF document (open access publishing for free!, and because GridFTP is an OGF protocol), either community practice or experiences - and then have a less detailed paper which could be submitted to a conference or published in a journal. 

22 September 2016

Upgrading and Expanding Lustre Storage (part4)

With a 1.5PB Lustre file system set up we now need to transfer our data from the old lustre system, conveniently also 1.5 PB in size, before we can put it into production.

Migration of Data:

It was found that is was not possible to mount both Lustre 1.8 and 2.8 on the same client, therefore migration of data had to be done via rsync between two clients mounting the different Lustre file systems. Setting up an rsync demon on the clients was found to be an order of magnitude quicker than using rsync over ssh for transferring data between the two clients. Hard links, ACLs and extended attributes are preserved by using the “-HAX” option when transferring data. Up to a dozen clients were utilised over the course of about six weeks to transfer 1.5PB of data between the old and new Lustre file systems. After the initial transfer the old and new systems were kept in sync with repeated rsync runs, remembering to use the “—delete” option to remove files that no longer existed on the live Lustre system. MD5 checksums were compared for a small random selection of files. The final transfer from the old to new Lustre took about a day, during which the file system was unavailable to external users. Then all clients were updated to the new Lustre version.  

Real World Experience:

With the new Lustre system put into production we then recommissioned the old system to create a 3PB Lustre file system. The grid cluster has about 4000 job slots in over 200 Lustre client compute nodes. The actual cluster is shown below. Note that the compute nodes fill the bottom 12U of every rack, where the air is cooler, and storage above them in the next 24U.
Real world performance over half a year, March to September 2016, is shown below. When all job slots are running grid “analysis” workloads, requiring access to data stored in Lustre, no slow down in job efficiency was observed. An average of 4.8 Gb/s is seen for reading data from Lustre and 1.6Gb/s for writing to Lustre (which is always done through StoRM).
However, in one case a local user simultaneously ran more than a 1500 jobs each accessing a very large number of small files, in this case BIOinformatics data, on Lustre and a slow down in performance was observed. Once the user was limited to no more than 500 jobs no further issues were seen. It is expected that accessing small files on the Lustre filesystem is not efficient [1] and should be avoided or limited where possible. A future enhancement to Lustre is planned that will enable small files to be stored on the MDS which should improve small file performance [1]. 

The Queen Mary Grid site major workload is for the ATLAS experiment which keeps detailed statistics of site usage. We are responsible for processing about 2.5% of all ATLAS data internationally and about 20% of data processed in the UK. Remote data transfer statics are shown below. Over the last 6 months ATLAS has transferred 2.39 PB of data into the cluster (top left plot), the weekly totals are shown in the top left plot, with a maxim for one week of 340TB (an average 4Gb/s). The bottom plot shows that 2.3PB has been sent to other grid sites around the world from Queen Mary.

Future Plans:
  • Double the Storage of the cluster to 6PB in 2018.
  • Consider an upgrade to Lustre 2.9 which will have bug LU1482 fixed and also provide additional functionality such as user and group ID mapping which would allow the storage to be used in different clusters. However Lustre 2.9 is SL/Centos7 only.
  • Upgrade OSS servers to SL/CentOS 7 from SL6. 
  • Examine the use of ZFS in place of hardware raid which might help mitigate very long raid rebuild times after replacement of a failed hard drive.
Conclusions:

Over the past 4 Blogs we have shown a  successful major upgrade of Lustre. Including the specification, installation, configuration, migration of data, and operation of hardware and software.  

21 September 2016

Upgrading and Expanding Lustre Storage (part3)

In this post we will describe how we went about benchmarking and optimising our Lustre file system.

Performance Tuning:

A number of optimisation were made to improve the performance of the Lustre OSSs. To test these optimisation the IOzone [6] benchmarking program was used. IOzone is used to perform a variety of read and write tests. It is able to operate on a single server or on multiple clients at the same time. 

First it is useful to have an estimate of possible performance before undertaking benchmarking. The typical maximum sustained throughput of a single disk is quoted at approximately 200MB/s. For a 16 disk raid 6 array the maximum sustained throughput for a single server is expected to be 2.4GB/s (excluding the two parity disks). For a Lustre system made up of 20, Dell R730XDs, with 16 disks in each, this should scale to 56GB/s. However, each server is only connected with a 10Gb/s ethernet connection. Therefore the maximum sustained throughput obtainable is 25GB/s.

To test a single server IOzone was run with 12 threads (equal to the number of cpu cores) each transferring a file size of 24GB in chunks of 1024kB (iozone -e -+u -t 12 -r 1024k -s 24g -i0 -i1 -i 5 -i 8). As well as the standard sequential read and write tests, results were obtained for stride reads, and mixed workloads, which does reading and writing of a file with accesses being made to random locations within the file. The values were chosen to match the expected workload (i.e. the reading of large, GigaByte sized) to reduce cacheing effects and to match the 1024k buffer size used in Lustre network transfers. 

Using the BgFS Tips and Recommendations for Storage Server Tuning [7] as reference we applied different sets of optimisations to the storage server.  

Optimisation 1
echo deadline > /sys/block/sdb/queue/scheduler
echo 4096 > /sys/block/sdb/queue/nr_requests
echo 4096 > /sys/block/sdb/queue/read_ahead_kb
echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Optimisation 2 or (3), used in conjunction with optimisation 1, optimises the linux file system cacheing which is used by Lustre to help improve performance.   
echo 5(1) > /proc/sys/vm/dirty_background_ratio
echo 10(75) > /proc/sys/vm/dirty_ratio
echo 262144 > /proc/sys/vm/min_free_kbytes
echo 50 > /proc/sys/vm/vfs_cache_pressure

To reduce raid alignment complications the partition was made directly on to the storage device (e.g. /dev/sdb) taking into account the raid configuration (block size, stripe size and width). Lustre uses the EXT4 file system although it is possible to use ZFS instead.

The results of six different IOzone tests on a single server with different optimisations are shown in figure below (top). The results clearly show the benefits of applying optimisations to the OS to improve file system performance. As optimisation 1+3 show the highest throughput this has been applied to the Lustre file system. 

The single server tests were carried out for each of the 20 R730XD servers as a cross check of performance and as a check for hardware issues. All servers were found to produce similar performance.  A cross check for of the single server benchmark test, for optimisation 1 only, limiting the storage servers to only 2G RAM, to remove caching effects, was performed and results were found to be consistent with the results presented here.

A near complete 1.5 PB lustre file system with 20 Dell 730XD servers was created with up to 24 client nodes dedicated to the benchmark tests.  Lustre is set up such that individual files remain on a single OSS (i.e. there is no striping of files across OSSs). The well known Lustre clients tunes were included by default [1].
echo 256 > /proc/fs/lustre/osc/*/max_pages_per_rpc
echo 1024 > /proc/fs/lustre/osc/*/max_dirty_mb

For Lustre benchmarking using multiple clients IOzone is run with the “-+m filename” option to specify the client nodes (iozone -+m iozone_client_list_file -+h [IP of master IOzone node] -e -+u -t 10 -r 1024k -s 24g -i0 -i1 -i 5 -i 8). The figure above (bottom) shows the benchmark results for different number of clients. Each client has a 10Gb/s network connection so this sets the upper limit of the storage performance until we have more than 20 clients (black solid line). As the number of clients increase the performance first increases and then falls off for all but the initial write test. The maximum performance of the storage is seen with 18 clients. The anomalous reread result for 18 clients is reproducible and may to be due to client side cacheing effects. With 24 clients the mixed workload performance is below that for 8 clients. The reason for the fall off in performance for large number of active clients is probable due to contention for resource when seeking data on the file system, this would be less important for the initial writes tests. 

If we assume that a typical data analysis job uses 5MB/s and there is a maximum of 4000 job slots, then a throughput of the complete Lustre system of 20GB/s would be required for our cluster. The read performance measured of the benchmark Lustre system is of the order 15-20GB/s. The performance of the full Lustre file system, including 20 R730XDs and 70 R510s, is expected to be at least double that of the benchmarked system. If the real world workload is dominated by read type workflows, as is expected, then the full Lustre system should be able to provide the 20GB/s performance required.

NOTE: A number of network optimisations were deployed in production based on recommendations found on the faster data web site[8], for both data transfers within the cluster and for those done over the WAN by StoRM, these have not been benchmarked.

For the final part of this story we will discuss the real world Lustre system we have had in production for over 6 months.

[6] IOzone: 

[7] BeeGFS Tips and Recommendations for Storage Server Tuning: http://www.beegfs.com/wiki/StorageServerTuning

[8] ESnet Fasterdata Knowledge Base: