23 November 2011

The best rate to get from ATLAS's SONAR test involving RAL can be assumed to be internal transfers from one Space token at RAL to another space token at RAL. The Sonar plot for large files; (over 1 GB,) for the last six months is:

Averaging this leads to:

Leading to average of 18.4MB/s as the average rate with spikes in 12 hour average to above 80MB/s. (Individual file transmission rates across the network (excluding overhead) have been seen at over 110MB/s. This relates well to the 1Gbps NIC limit on the disk servers in question.

Now we know that of Storm,dCache,DPM and Castor systems within the UK that Castor tends to have the longest interaction overhead for transfers. Overhead for RAL-RAL transfer varies for the last week is between 14 and 196 seconds with an average of 47 seconds and a standard deviation of 24 seconds.

12 November 2011

Storage is popular

Storage is popular: why, only this morning GridPP storage received an offer of marriage from a woman from Belarus (via our generic contact list). I imagine they will stick wheels on the rack of disk servers so they can push it down the aisle. We need a health and safety risk assessment. Do they have doorsteps in churches? Do they have power near the altar or should we bring an extension?  If they have raised floors, can we lay the cables under the floor And what about cooling?

Back to our more normal storage management, it is worth noting that our friends in WLCG have kicked off a TEG working group on storage. TEG, since you ask, means Technical Evolution Group - the evolution being presumably the way to move forward without rocking the boat too much, ie. without disrupting services. The groups role is to look at the current state, successes and issues, and how to then move forward - looking ahead about five years.  In good and very capable hands with chairs Daniele Bonacorsi from INFN and our very own Wahid Bhimji from Edinburgh, the group membership is notable for being inclusive in the sense of having WLCG experiments, sites, middleware providers, and storage admins involved. Although the work focuses on the needs of WLCG, it will also be interesting to  compare with some of the wider data management activities.

11 November 2011

RAL T1 copes with ATLAS spike of transfers.

Following recent issues at the RAL T1 , we were worried about not just overall load on our SRM caused by ATLAS using the RAL FTS, but also the rate at which they put load on the system.
At ~10pm on the 10th November 2011 (UTC); ATLAS went from running almost empty to almost full on FTS channels involving RAL being controlled by the RAL FTS server. This can be seen in the number of active transfer plot:

This was caused by atlas suddenly putting into the ATLAS FTS many transfers which can be seen in the "Ready" queue:

This lead to a high transfer rate as shown here:
And is also seen in our own internal network monitoring:

The FTS rate is for transfers only going through the RAL FTS. ( I.e does not include puts by CERN FTS, Gets from other T1s or the chaotic background of dq2-gets, dq2-puts and lcg-cps not covered in these plots. Hopefully this means our current FTS settings can cope with start of these ATLAS data transfer spikes. We have seen from previous backlogs that these large spikes lead to a temporary backlog ( for a typical size of spike;) which clears well within a day.