28 February 2007

dCache.org clarify release procedure

The dCache team have reviewed and updated their release procedure. In summary, any new releases (minor or major) will be announced on the user-forum and announce@dcache.org mailing lists, along with a change-log for each of the rpms in the release. In addition, the rpms in the dCache stable repository


(and similarly for other OS and architecture combinations) will be kept in synch with the rpms that are listed on the dCache.org webpage. If this is not the case, a bug should be submitted.

The change-log should also be available in cvs. In the future, an RSS feed of the release information may also be made available.

26 February 2007

New dCache release

There is a new release of dCache:


dCache server is now 1.7.0-29.
dCache client is now 1.7.0-28.

There was also a 1.7.1 release, but this was only for testing purposes. Sites are recommended to use YAIM to perform the upgrade. Email the support list if there are any problems.

23 February 2007

SE-posix SAM test (the trial run)

Here is a summary of a set of grid jobs that I ran yesterday to test posix access to site SEs. The job will eventually become the SAM test for this type of storage access. In summary it lcg-cr's a small test file to the SE, then reads it back again using a GFAL client, checks for consistency and then deletes the file using lcg-del.

srm.epcc.ed.ac.uk Passed
svr018.gla.scotgrid.ac.uk Passed

fal-pygrid-20.lancs.ac.uk Failed
lcgse1.shef.ac.uk Passed

epgse1.ph.bham.ac.uk Passed
serv02.hep.phy.cam.ac.uk Passed
t2se01.physics.ox.ac.uk Passed
lcgse01.phy.bris.ac.uk Passed

gfe02.hep.ph.ic.ac.uk Failed (both IC-HEP and LeSC)
se01.esc.qmul.ac.uk Passed
dgc-grid-34.brunel.ac.uk Passed
se1.pp.rhul.ac.uk Passed
gw-3.ccc.ucl.ac.uk Passed

For the sites that are not listed, the jobs that I submitted either still claim to be running or are scheduled (it turns out that both Durham and RAL-PPD are in downtime). I've tested RAL before and it was failing due to problems with CASTOR.

IC-HEP failed due to there being no route to host during the GFAL read. Possibly the relevant ports are not open in the firewall (22125 for dcap and 22128 for gsidcap)?

IC-LeSC failed with a Connection timed out error during the GFAL read.

Lancaster failed as the file could not even be lcg-cr'd to the SE. There was a no such file or directory error.

I'll run the tests next week once people have had a chance to look at some of these issues.

22 February 2007

SE-posix SAM test

Everyone should be aware of the fact that a new SAM test will soon be introduced that will test the posix-like access to your site SE from our WNs. The test uses lcg-cr first copy a file to the SE, then uses GFAL (the Grid File Access Library) to perform the open(), read() and close() of the file. GFAL does all of the translation between the LFN (Logical File Name) and the TURL that will be used to access the file. It supports rfio, (gsi)dcap and gridftp so it does not matter if your site has a dCache, DPM or CASTOR. The file is removed after the test via an lcg-del.

It is essential to test this property of the SEs since it is an expected access pattern for the VOs when they are running analysis jobs at sites. In many cases it is more efficient to use the posix-like access rather than copying the entire file (or files) to the WN before processing starts.

In the first instance this new test will be NON-CRITICAL, so sites should not worry if they are not passing. We will use the information gathered from the test to solve site-specific issues. Once things become more stable it is likely that this test will move into the existing replica management super-test.

13 February 2007

Draining dCache pools

I did a bit of housekeeping on the Edinburgh dCache yesterday. dteam was sharing a bunch of pools with atlas so I wanted to drain the dteam files to their own dedicated set of disks. Once you have been through the process a few times, it is relatively painless, but it's certainly not as easy as just submitting a single command.

First, you need to use the copy module to copy files from a source pool to a destination pool or pool group. The module gives you some tools that allow you to limit which files that have to be copied, i.e.

maintenance> load pool pool1_01

loads the source pool.

maintenance> exclude atlas:GENERATED

allows you to exlude all files in the atlas:GENERATED storage class (not SRM v2.2 classes BTW) from the list of files to be copied. You then just start the transfer by running,

maintenance> copyto pools pool1_20 pool1_21

Once the copy has finished, I found that there were a number of errors returned ( ls stat ) for certain PNFSids. Running pathfinder on these showed that these PNFSids were orphaned files in that although they appeared in the pool and physically on the disk, they were no longer in the namespace. This has been observed a number of times by many people. These files that have been left behind after the copy process can be clearly identified in the companion database by following the query in the wiki here. The wiki also contains some examples of scripts that can be used to interface with the dCache ssh admin shell in order to allow you to check the status of orphaned files and remove them if necessary.

Once these orphaned files were identified and removed I then ran the same query on the companion database to ensure that each of the (non-orphaned) source dteam files resided on 2 pools (source and destination). Once this was confirmed, I could then issue the relevant command (rep rm -force ) in each of the pool cells to force the removal of the dteam PNFSids from the source pools. This is best done by scripting the admin interface. You should follow the examples in the links given above for doing this.

In summary, it would be great if there was simply a drain-pool command that did all of this for you. In saying that, the situation has improved with the new functionality available in the copy module. Maybe I will try and combine the above scripts into a single (or at least a smaller number) of steps.

08 February 2007

It now seems that ATLAS don't want the DPM ACL changes for the atlas/generated directory. I got ths email from Simone:

On 9 Feb 2007, at 03:54, Simone Campana wrote:

Hi Graeme.

We discussed this with the DPM devels.
The reason to change the generated
directory ACL is to allow the SGM to run
SAM test. Not to complicate things furher,
I would request to have the change only to the
atlas/dq2 directory with the production
role and leave alone the generated directory.
I will change instructions accordingly and deal
with SAM test differently.

So it seems that the ACL script should only be run on the dq2 directory - at least that simplifies things greatly for the sites.
So I had suggested to the DPM team that what I needed to do to overcome the error reported in the blog was to add a new ACL, which would allow the ordinary (i.e., non-VOMS role) ATLAS users to write into the generated directories, regardless of who had first created them.

The magic command to do this is:

# dpns-setacl -m d:g:atlas:rwx /dpm/gla.scotgrid.ac.uk/home/atlas/generated

(Your domain may vary...)

The intital "d" means "default" and ensures that this ACL is inherited by all newly created sub-directories (and the "g" means a "group" ACL).

I set this two days ago and found a friendly Melbourne ATLAS user (thanks Glen!) to help me test this - and it worked. In the 2007-02-07 directory, which had been created with the atlas/Role=lcgadmin group Glen was able to write a file as a normal ATLAS user.

So, the current situation for fixing up your DPM for ATLAS involves:

  1. Running the new script, which doesn't mess up the ACL owning user and group of the "root" directory - you can get than from http://www.physics.gla.ac.uk/~graeme/misc/update_acl_formysql.tar.gz.
  2. Then adding the additional ACL above to the generated directory.

Of course, you might just want to wait for some more complete fix to emerge from WLCG.

In addition, I have also raised the issue of the dq2 directory with ATLAS, but have yet to receive any response - so at the moment I haven't added any ACLs here.

06 February 2007

It is now possible to use jython as a front end to the admin interface of dCache. This should make scripting administration tasks much easier (well, for those who use python). Ideally I would like to see tools provided that use this interface in order to carry out common task, like adding pools, pool groups, links, changing mover queues etc. Hopefully these will be made available over time, either from the developers or through community contributions. You can find an example of a script at the page below (no, it's not a typo). I've already started working on something.

Most of the DPM sites will be aware that Atlas recently asked for everyone to update the ACLs on the home/atlas/dq2 and home/atlas/generated directories of the DPM namespace. The fix to do this was initially provided as a binary (provided by the DPM devels) that would parse a configuration file and make changes in the MySQL database. Initially there were a few problems with this:

1. No source code was provided with the binary (actually being called a script in the atlas email). even though the operation was tagged as being EXTREMELY DELICATE.
2. The binary had already been through one bug fix after limited deployment., so confidence in it wasn't exactly high.
3. Subsequent bugs have been found after running it in the UK. For example, the looking at the atlas/generated directory on the Glasgow DPM:

drwxrwxr-x 298 143 103 0 Feb 06 11:43 2007-01-29
drwxrwxr-x 13 143 103 0 Feb 06 11:34 2007-01-30
drwxrwxr-x 1 117 117 0 Jan 31 22:40 2007-01-31
drwxrwxr-x 2 117 117 0 Feb 01 22:46 2007-02-01

GID 103 is the normal atlas group. GID 117 is atlas/Role=lcgadmin. (Thanks to Graeme for this).

It would have been better if the tools had been available for site admins to perform this ACL update without having to resort to direct connections to the MySQL DB. dpm-setacl could have been used, but since this does not have a recursive mode it wouldn't have been all that user friendly to use.

I think this is another example of where the administration tools of the storage middleware are lacking.