24 May 2012

Day 3 of CHEP - Data Plenaries

The third day of CHEP is always a half-day, with space in the afternoon for the tours.
With that in mind, there were only 6 plenary talks to attend, but 4 of those were of relevance to Storage.

First up, Fons Rademakers gave the ROOT overview talk, pointing to all the other ROOT development talks distributed across the CHEP schedule. In ROOT's I/O system, there are many changes planned, some of which reflect the need for more parallelism in the workflow for the experiments. Hence, parallel merges are being improved (removing some locking issues that still remained), and ROOT is moving to a new threading model where there can be dedicated "IO helper threads" as part of the process space. Hopefully, this will even out IO load for ROOT-based analysis and improve performance.
Another improvement aimed at performance is the addition of asynchronous prefetching to the IO pipeline, which should reduce latencies for streamed data - while I'm still on the fence about I/O streaming  vs staging, prefetching is another "load smearing" technique which might improve the seekiness on target disk volumes enough to make me happy.

The next interesting talk was this year's iteration of the always interesting (and a tiny bit Cassandra-ish) DPHEP talk on Data Preservation. There was far too much interesting stuff in this talk to summarise - I instead encourage the interested to read the latest report from the DPHEP group, out only a few days ago, at : http://arxiv.org/abs/1205.4667

In the second session, two more interesting talks with storage relevance followed.
First, Jacek Becla gave an interesting and wide-ranging talk on analysis with very large datasets, discussing the scaling problems of manipulating that much data (beginning with the statement "Storing petabytes is easy. It is what you do with them that matters"). One of the most interesting notes was that indexes on large datasets can be worse for performance, once you get above a critical size - the time and I/O needed to update the indices impairs total performance more than the gain; and the inherently random access that seeking from an index produces on the storage system is very bad for throughput with a sufficiently large file to seek in. Even SSDs don't totally remove the malus from the extremely high seeks that Jacek shows.

Second, Andreas Joachim Peters gave a talk on the Past and Future of very large filesystems, which was actually a good overview, and avoided promoting EOS too much! Andreas made a good case for non-POSIX filesystems for archives, and for taking an agile approach to filesystem selection.

No comments: