GridPP storage news

05 February 2020

CentOS8.1 (teaching an old server new tricks)

At Edinburgh we recently retired some old(er) storage servers from our Tier2.

These storage servers consist of a Dell Poweredge R610 with 2 RAID controllers.

A PERC H710 internal for managing the boot disks and a PERC H800 external which connects to 3 MD1200 DAS.

We still maintain very similar hardware on our Tier2 as part of our grid Storage Element and we've been considering skipping CentOS7 and jumping straight to CentOS8 on this hardware.

As an experiment I recently attempted an update from CentOS7 to CentOS8 on a VM recently. Whilst this can be done in a Saturday I wouldn't recommend it as the resulting OS isn't really production stable. With that in mind a clean install is the way to go.

This would be simple if it wasn't for 1 gotcha. The RAID controllers on these servers are out of support in CentOS8: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8-beta/html/8.0_beta_release_notes/removed_functionality

There is, however a solution: https://wiki.centos.org/Manuals/ReleaseNotes/CentOS8.1905#Known_Issues

This solution is to make use of the driver update images compiled by the nice people behind CentOS. These images allow the installer to pick up the 3rd party module and install it into your system as-if it were fully supported.

However following this proved to be problematic.

For CentOS8.0 the installer failed to identify the disks behind either controller, and on more than 1 occasion lead to either the kernel or the installer crashing.

(although admittedly we had also mis-configured this card in a minor way during this testing).

CentOS8.1 on the other hand worked out of the box with the following images:

> sha256sum dd-megaraid_sas-07.707.51.00-1.el8_1.elrepo.iso CentOS-8.1.1911-x86_64-dvd1.iso

31a169d5eab1371893347c4d8482896e0fcc9b0a813b9210b1c0e77f68b09702 dd-megaraid_sas-07.707.51.00-1.el8_1.elrepo.iso

3ee3f4ea1538e026fff763e2b284a6f20b259d91d1ad5688f5783a67d279423b CentOS-8.1.1911-x86_64-dvd1.iso

To get the installer to work you will need 2 drives to be connected to the system you want to deploy on.,

NB: Don't do the 'obvious thing' of burning the dud driver image to a disk. It will not work if you do this.

The following worked for me:

Burn your CentOS8.1 install image
Format a 2nd drive fat(msdos) and copy the dud driver dd-megaraid....iso to it
Startup the install image
Before you boot into the installer add the following to the boot command to boot the installer kernel:
inst.dd
Now wait for the installer to ask for the dud driver. Navigate to the disk containing the iso, select it and continue to boot.
Now perform your install as usual.

This isn't quite the full story as there was a minor gotcha which seemed to hit occasionally.

To fix the system not booting after a kernel update:

echo 'force_drivers+="megaraid_sas"' >> /etc/dracut.conf.d/force_drivers.conf

The CentOS8 installer has changes in subtle but important ways compared to CentOS7 but nothing was scary other than on my first attempt being greeted by a GUI desktop thinking it was in New York because I was more focussed on my storage network configurations.

There are some unresolved problems with using non standard kernels such as kernel-ml and kernel-plus with 3rd party dkms drivers such as ZFS. However the out of the box kernel is 4.18 which is relatively recent and performs quite well form initial testing.

Next to test the performance of this as a mock storage node before we consider deploying these across the rest of our Tier2.

03 October 2019

Modern account mapping for a Ceph/Xrootd or Ceph/GridFTP service.

One of the advantages of the RAL ECHO service having "gone first" in terms of setting up a Ceph object store with direct connections to xrootd and gridftp services, is that when we are doing the same thing at Scotgrid-Glasgow, we can try new things.

One such change for us is how we do account authorisation and mapping.

The RAL Echo system is deliberately conservative, and has a two stage process:

User DNs are mapped via a simple grid-mapfile to a specific account.
That account name is then associated with a set of capabilities via an xrootd authdb file.

(These capabilities correspond to access permissions for a small number of ceph pools on the backend, usually one per VO.)

We know that works, but it's unwieldy - you need big grid-mapfiles full of DNs for all the users, and users are hard to map to more than one account.
Additionally, privacy and security concerns have led to policies for voms servers being restricted - it's hard or impossible to even request a list of member DNs for some VOs now.

It would be nice if we could do something more modern, using the VOMS extensions in the certificates. (It would be even nicer if we could, whilst we're doing this, call out to an ARGUS server for banning, as that's a cheap way to provide central banning for our SE.)

It turns out that we can do this, with the magic of a >6 year old technology from Nikhef called LCMAPS. The below replaces the grid-mapfile parts of the RAL configuration - you still need the authdb part to map the resulting account names to the underlying capabilities.
(And in the magical world where we just pass capability tokens around, we can probably make this a single step mapping.)

Doing this needs a bit of work, but since we're already compiling our own version of xrootd, and our own gridftp-ceph plugin, a bit more compilation never hurts.

The underlying LCMAPS configuration we're using (in /etc/lcmaps/lcmaps.db) looks like this, with a bit of unique data obscured:

vomsmapfile2 = "lcmaps_voms_localaccount.mod"
"-gridmap /etc/grid-security/voms-mapfile"
verifyproxynokey = "lcmaps_verify_proxy.mod"
"--allow-limited-proxy"
"--discard_private_key_absence"
" -certdir /etc/grid-security/certificates"
pepc = "lcmaps_c_pep.mod"
"--pep-daemon-endpoint-url https://ourargusserverauthzpoint"
"--resourceid ourcephresourceid"
"--actionid http://glite.org/xacml/action/execute"
"--capath /etc/grid-security/certificates/"
"--certificate /etc/grid-security/hostcert.pem"
"--key /etc/grid-security/hostkey.pem"
"--banning-only-mode"
good = "lcmaps_dummy_good.mod"
bad = "lcmaps_dummy_bad.mod"

mapping_pol:
verifyproxynokey -> pepc | bad
pepc -> vomsmapfile2 | bad
vomsmapfile2 -> good

Here, the grey backgrounded part uses the lcmaps_voms_localaccount plugin to map by VOMS extension only, to a small number of accounts. So, our local services don't need to maintain a large and brittle grid-mapfile, or call out anywhere with a cron to update it.

(The voms-mapfile is as simple as, for example:

/dteam* dteamaccount

to map all /dteam* VOMS extensions to the single dteamaccount )

The pink backgrounded part uses the lcmaps_c_pep plugin to call out to our local ARGUS server. Unlike for glExec on workernodes, or CEs, the only thing we care about here is if the ARGUS server returns a "Permit" or not. As a result, the policy on the local PAP in our ARGUS server has no obligations in it - in fact, including the local_environment_mapping obligation breaks our chain, since we don't have (or need) pool accounts on these servers by design. We still need to add a policy for the corresponding resourceid we pass, and remember to reload the config on the PDP and PEP afterwards.

So far, so easy (and all the packages needed are in UMD4 and easy to get).

Getting LCMAPS to work with the vanilla versions of globus-gridftp-server and xrootd is not completely trivial, however.

In gridftp's case:

globus-gridftp-server is perfectly capable of interfacing with lcmaps, but all of the shipped versions in EPEL and UMD come without the necessary configuration to do so. (In particular, a set of environment variables need to be present in the environment of the gridftp server daemon, and without them set, the configured LCMAPs will fail with odd errors about gridftp still being mapped to the root user.)

We can fix this with the addition of a /etc/sysconfig/globus-gridftp-server file containing:

export LCMAPS_DB_FILE=/etc/lcmaps/lcmaps.db

export LLGT_LIFT_PRIVILEGED_PROTECTION=1

export LLGT_RUN_LCAS=no

export conf=/etc/gridftp.conf

where the lower line also prevents the configured gridftp service from trying to load LCAS (which we don't need here - since banning is being farmed out to ARGUS).

We also need to install the lcas-lcmaps-gt4-interface rpm, which provides the glue to let gsi call out via LCMAPS.

and finally, install the /etc/grid-security/gsi-authz.conf file to tell gridftp how to authenticate gsi stuff:

globus_mapping liblcas_lcmaps_gt4_mapping lcmaps_callout

(The more exciting thing with gridftp is getting the ceph and authdb stuff to work, about which more in another post)

In Xrootd's case:

This needs a little more work: xrootd does not have an officially packaged security plugin for interfacing with lcmaps.

Luckily, however, OSG have done some sterling work on this (in fact, most of this blog post is based on their documentation, plus the nikhef LCMAPS docs), and there's a git repository containing a working xrootd-lcmaps plugin, here: https://github.com/opensciencegrid/xrootd-lcmaps.git

In order to build this, we also need the development libraries for the underlying technologies: voms-devel, lcmaps-devel and lcmaps-common-devel, as well as a host of globus libs that you probably already have installed (as well as the xrootd development headers, which we already have since we build xrootd locally too).

Building this, and installing the resulting libXrdLcmaps.so into a suitable place, we just need to add the following to our xrootd config for the externally visible service:

sec.protocol /opt/xrootd/lib64 gsi -certdir:/etc/grid-security/certificates \
-cert:/etc/grid-security/hostcert.pem \
-key:/etc/grid-security/xrd/hostkey.pem \
-crl:1 \
-authzfun:libXrdLcmaps.so \
-authzfunparms:lcmapscfg=/etc/lcmaps/lcmaps.db,loglevel=1,policy=mapping_pol \
-gmapopt:10 -gmapto:0

where here we configure the xrootd service to call out to the library we built (and we have to, unlike with gridftp, specify the policy to use from the file - gridftp will use the only policy present if there's just one).
We need a second copy of the hostkey, you'll notice, because the xrootd service doesn't run as the same user as the gridftp service - but gridftp won't let you have a hostkey which is accessible by more than one user. (So we need two copies, one for gridftp and one for xrootd.)

EXAMPLE

Once you configure your authdb for the capability mapping you're ready to go!

As you can see from the LCMAPS logs, when I do a transfer with a voms-enabled proxy, using, in this case, globus-url-copy, but it's the same with xrdcp:

Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: Starting policy: mapping_pol
... (some certificate verification) ...
Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_plugin_verify_proxy-plugin_run(): verify proxy plugin succeeded
Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_plugin_c_pep-plugin_run(): Using endpoint OURARGUSENDPPOINT, try #1
Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_plugin_c_pep-plugin_run(): c_pep plugin succeeded
Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_gridmapfile: Found mapping dteamaccount for "/dteam/*" (line 1)
Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_voms_localaccount-plugin_run(): voms_localaccount plugin succeeded
Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_dummy_good-plugin_run(): good plugin succeeded
Oct 3 15:33:25 cephs02 globus-gridftp-server: lcmaps: LCMAPS CRED FINAL: mapped uid:'xxx',pgid:'xxx',sgid:'xxx',sgid:'xxx'
Oct 3 15:33:25 cephs02 globus-gridftp-server: Callout to "LCMAPS" returned local user (service file): "dteamaccount"

and then we go into the gridftp.log for the authdb:

[344940] Thu Oct 3 15:33:25 2019 :: globus_l_gfs_ceph_send: started
[344940] Thu Oct 3 15:33:25 2019 :: globus_l_gfs_ceph_send: rolename is dteamaccount
[344940] Thu Oct 3 15:33:25 2019 :: globus_l_gfs_ceph_send: pathname: dteam:testfile1/
[344940] Thu Oct 3 15:33:25 2019 :: INFO globus_l_gfs_ceph_send: acc.success: 'RETR' operation allowed
[344940] Thu Oct 3 15:33:25 2019 :: ceph_posix_stat64 : pathname = /dteam:testfile1

where our capabilities are checked (and the dteamaccount is, indeed, allowed to READ from objects in the dteam pool).

15 March 2019

Some success in testing sites with no local production data storage for ATLAS VO.

For a while we have had a few of the smaller sites (T3s) in the UK running for ATLAS with out any storage at the site.We recently tried to run Birmingham as a completely diskless site with it using storage at Manchester. This was mostly successful; the saturation of the WAN connection at Manchester which was always considered a worrying possibility was seen. This has helped inform ATLAS's opinions on how to implement diskless sites which was then presented at the ATLAS Site Jamboree this month.

We intend to try using XCache at Birmingham instead to see that is an alternative approach which might succeed. WE are also looking into using the ARC Control tower to pre-place data for ARC-CEs. Main issue is how this conflicts with VO wish for last minute payload changes within pilot jobs.

I would also just remind why (IMHO) we are looking into optimising ATLASDATADISK storage.

From a small site perspective, storage requires a substantial amount of effort to maintain. This effort compared to the volume of storage provided could be efficiently used in other activities. Below is a plot of the percentage of current ATLASDATADISK provided by each site. The VO also benefits with not using smaller sites as it has fewer logical endpoints to track.

This plot shows that if the 10 smaller sites (of which 5 are in the UK) allows for ATLAS to use 99% of space form only 88% of sites. ATLASSCRATCHDISK and ATLASLOCALGROUPDISK usage/requirement also needs to be taken into consideration when deciding if a site should become a fully diskless or caching/buffering site.

08 March 2019

ATLAS Jamboree 2019 view from the offiste perspective.

I didn't go in person to the ATLAS Jamboree this year held at CERN. For those who are allowed to view I suggest looking at: https://indico.cern.ch/event/770307/

But I did join for some via vidyo!
Here is my musings about the talk givens I saw. ( Shame I couldn't get involved in coffee/dinner discussions which are often the most fruitful moments of these meetings):

Even before the main meeting started, there is an interesting talk regarding HPC data access in US an ANL.

In particular, I like the thought of globus usage and incorporating rucio into DTNs at the sites. Similar to what was discussed at other sites at the rucio community workshop last week.

In the preview talk, I picked out the switch to Fast Sim rather than Full sim will increase output rate by a factor of 10. A good reminder that user workflow changes could drastically alter computing requirements.
From the main meeting , the following meetings will be of interest on a data storage:

Data Organization and Management Activities: Third Party Copy and (storage) Quality of Service
TPC: details on DPM

DOMA ACCESS: Caches

DDM Ops overview

Diskless and lightweight sites: consolidation of storage

Data Carousel

Networking - best practice for sites, and evolution

WLCG evolution strategy

One thing it di was cause me to think what if; (and I stress the if is me not ATLAS musing,) ATLAS wanted to read 1PB of data a day from Tape at RAL and then distribute it across the world?

06 March 2019

Rucio 2nd Community Workshop roundup

There was an interesting workshop for members of the rucio community last week:
https://indico.cern.ch/event/773489/timetable/#all.detailed

Here is the summary from the workshop:
Summary
● Presentation from 25 communities, 66 attendees!
● Many different use-cases have been presented
○ Please join us on Slack and the rucio-users mailing list for follow-ups!
● Documentation!
○ Examples
○ Operations documentation
○ Easy way for communities to contribute to the documentation
○ Documentation/Support on Monitoring (Setup, Interpretation, Knowledge)
○ Recommendations on data layout/scheme → Very difficult decision for new communities
● Databases
○ Larger-Scale evaluation of non-Oracle databases would be very beneficial for the community
● Drop of Python 2.6 support for Rucio clients
● LTS/Gold release model
○ Will propose a release model with LTS/gold releases with Security/Critical bug fixes
● Archive support
● Metadata support
○ Existing metadata features (generic metadata) need more evaluation/support
○ More documentation/examples needed
● Additional authentication methods needed
○ OpenID, Edugain, …
● Interfacing/Integration with DIRAC
○ Many communities interested
○ Possibility for joint effort?

Here is a list of my summary of these and other snippets from the talks I find interesting/thought provoking. I 'll point out I was not at the meeting so tone of talks n=may have been lost on me.

Network talk from GEANT:
Lots of 100Gb links!

DUNE talk:
Replacing SAM ( THE product from where I started my data management journey...)
Unfortunately not able to use significant amount of storage at RAL Echo
– Dynafed WebDAV interface can’t handle files larger than 5GB
– Latest Davix can do direct upload of large files (as a multi-part upload), but not third-party
transfers
– Maybe use the S3 interface directly instead?
• Recent work done on improving Rucio S3 URL signing

CMS talk:
Some concerns called out by review panel which we will have to address or solve:
▪ Automated and centralized consistency checking — person is assigned
▪ No verification that files are on tape before they are allowed to be deleted from buffer
★ FTS has agreed to address this

BelleII talk:
My thoughts:
using a BelleII specific version of DIRAC
They are evaluating rucio
naming schema is "interesting"
power users seem a GoodIdea (TM)
BelleII thoughts:
• Looking ahead, the main challenge is achieving balance between conflicting requirements:
• bringing Rucio into Belle II operations quickly enough to avoid duplication of development effort
• supporting the old system for a running experiment

FTS talk:
I didn't know about multi hop support:
Multi-hop transfers support – Transfers from A->C, but also A->B->C

XENONnT talk:
This is a dark matter experiment with some familiar HEP sites using rucio.

ICE cube experiment :
Interesting data movement issues when base a south pole!
●Raw data is ~1 TB/day, sent via cargo ship; 1 shipment/year
●Filtered data is ~80 GB/day, sent via satellite; daily transfer

CTA talk:
CTA is an experiment for cosmic rays as well as CERN's news tape system!

Rucio DB at CERN talk:
rucio DB numbers for ATLAS are "impressive" (1014M DIDs)

SKA talk:
RAL members get a thank you specifically.

NSLSII talk:
Similar needs as Diamond Light source DLS , possible collaboration?
Another site which does both HEP and photonics.
Has tested using globus endpoints.

XDC talk:
rucio and dynafed and storage all in one!

CTA (tape system) talk:
Initial deployments: predict a need of 70PB of disk pace just in the disk cache ! (am I reading this slide correctly?)

LCLSII talk:
Linear version of NSLSII based at SLAC, similar to BNL> need to use FTS to be tested. prod system need in the next year.

LSST talk:
Using docker release of rucio
Nice set of setup tests. Things look promising!
– FTS has proven its efficiency for data transfers, either standalone or paired with Rucio
– Rucio makes data management easier in a multi-site context, and tasks can be highly automated
– These features could prove beneficial to LSST
● Evaluation is still ongoing
– discussions with the LSST DM team at NCSA are taking place

Dynafed talk:
Dynafed as a Storage Element is work in progress
– Not be the design purpose of Dynafed

RAL/IRIS talk:
I would be interested to hear how this tlak went down with th epeople present

ARC at Nordugrid talk:
I still think ARC control tower (ACT) are the future. rucio integration with volatile rse is nice.

28 February 2019

Fixing an iptables problem with DPM at Edinburgh

After spending some time examining the output from iptables on our dpm servers in Edinburgh I came across a small problem combining our iptables rules with SRM.

For brevity the iptables rule which caused the problems is:

 *filter  
 :INPUT ACCEPT [0:0]  
 -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT  
 -A INPUT -i lo -j ACCEPT  
 -A INPUT -p icmp --icmp-type any -j ACCEPT  
 ...  
 -A INPUT -p tcp -m multiport --dports 8446 -m comment --comment "allow srmv2.2" -m state --state NEW -j ACCEPT  
 ...

The problem caused by this is that packets look similar to the following in logs:

 IN=eth0 OUT= MAC=aa:bb:cc:dd:ee:ff:gg:hh:ii:jj:kk:ll:mm:nn SRC=192.168.12.34 DST=192.168.123.45 LEN=52 TOS=0x00 PREC=0x00 TTL=47 ID=19246 DF PROTO=TCP SPT=55012 DPT=8446 WINDOW=27313 RES=0x00 ACK FIN URGP=0

Here, ACK FIN shows how the dropped packet appears to be associated with closing a connection which iptables has already seen as closed.
(This is the case at least when with the DPM 1.10 srmv2.2 builds on both the latest security SL6 and CentOS7 kernels)

In Edinburgh we historically had problems with many connections which don't appear to close correctly, in particular with the SRM protocol. The service would if uncorrected run for several hours and then appear to hang not accepting any further connections.

We now suspect that this dropping of packets was potentially causing the issues we were seeing.

In order to fix this the above rule should either be changed to:

 ...  
 -A INPUT -p tcp -m multiport --dports 8446 -m comment --comment "allow srmv2.2" -m state --state NEW,INVALID -j ACCEPT  
 ...

or, the state module shouldn't be used to filter only NEW packets associated with the srmv2.2 protocol.

With this in mind we've now removed the firewall requirement that packets be NEW to be accepted by our srmv2.2 service. This has enjoyed an active uptime of several days without hanging and refusing further connections.

An advatage of this is most of the rejected packets by the firewall of our DPM head node were actually associated with this rule. Now that the number of packets being rejected by our firewall has dropped significantly examining connections which are rejected for further patterns/problems becomes much easier.

18 February 2019

Understanding Globus connect/online... is it doing a lot??

I have made further progress in understanding Globus transfer tool (one thing I still struggle with is what to call it...) What I know I still need to understand is its authentication and authorization mechanisms. Of interest (to me at least) was to look at the usage of our Globus endpoints at RAL. 20TB in last 3 months. Now to work out if that is a lot or not compared to other similar Globus endpoints and or other communities...