24 June 2015

The firewall did it

Now that we have sort of mostly finished setting up the DiRAC data transfers to RAL, we look at the weeks it took and wonder (a) was it worth it and (b) why did it take weeks?

While initially we only back up data from the DiRAC sites - initially Durham - into RAL Tier 1, the reason we set them up as a grid VO is so we can have the grid tools drive the data transfers. The thinking is that although there is an overhead in setting it up and getting it working, the tools that moved nearly a quarter of an exabyte last year will then move the data with the highest possible efficiency. Initially we are going to let it run as fast as it can until someone complains we hit a reasonable target/limit - 3-400 megabytes per second.


[Edit: updated the image as I had inadvertently put a link in to a 'live' image rather than the snapshot]

The green stuff in the plot is primarily DiRAC data coming in at some 250 MB/s; the spike is not related to DiRAC (this would be a case where the most prominent feature in the plot is of no interest to the discussion...a good way to capture readers, perhaps?)

The advantage of having them griddified is also that in the future if we decide to do more stuff, like move the data elsewhere or start doing analysis, it's all ready to go.

So why does it take time to set up?  Part of it is all the technical things that need setting up - VOs, local accounts, mailing lists, certificates, gridmap files, monitoring; none of them too onerous but they all take some time to fill in a form and process, they may have changed since the last time we did it, they take time to debug if they aren't working properly, and in the worst case scenario only one person knows how or is authorised to do it and is on leave/off sick/busy.

Then there are the processes: since access rights are to some very high end computing and storage systems, there are processes for reviewing authorisations, proposals, permissions, allocations and quotas, etc. These, too, take time, particularly if a panel review is involved.

Finally there's putting all the pieces together to see if it works. And when it doesn't, is it the VO's fault - they may be new to the business and do something strange - or is there something wrong with the infrastructure - not unlikely if something new is set up for them. In our case it didn't work, and it turns out that GridFTP as the data movement protocol now uses UDP and the Durham firewall blocked UDP. With firewalls there is a tradeoff between the efficiency of the transfer (less firewall is better) and the security they provide (more firewall is better). It needs both "control" ports where services are listening all the time and "data" ports which are ephemeral so need to be opened in a known port range.



No comments: