Vladimir's grid notes

From BeSTGRID

Jump to: navigation, search

Various notes I'm taking while working on setting up the Grid Node

Contents

[edit] Grid Setup TODO (both local config and gateway node)

[edit] create local gridmap configuration

  • list local users in /opt/vdt/edg/etc/grid-mapfile-local
  • list VOMSes in /opt/vdt/edg/etc/edg-mkgridmap.conf

Works for me:

 group vomss://vdtcentos.bestgrid:8443/voms/BeSTGRID?/BeSTGRID/Universities/Canterbury ucgriduser
 group vomss://vdtcentos.bestgrid:8443/voms/BeSTGRID?/BeSTGRID griduser
 group vomss://vdtcentos.bestgrid:8443/voms/BeSTGRID?/BeSTGRID/Role=GridUser griduser

Beware: no trailing slashes and no Group= in vomss URL! Otherwise, vomss silently fails (output in /opt/vdt/edg/log/edg-mkgridmap.log), reporting VOMS Internal Server Error.

  • re-create: run edg-mkgridmap

Note: All output (including edg-mkgridmap -help) goes to /opt/vdt/edg/log/edg-mkgridmap.log.

Doc: http://vdt.cs.wisc.edu/extras/edg-mkgridmap.html

man edg-mkgridmap.conf (7)

[edit] setup services to be started automatically

Usage from mailing list:

 vdt-register-service --name sshd -type init --init-script /opt/vdt/globus/sbin/SXXsshd --enable

Log entry from installing voms - service is installed as disabled

 /opt/vdt/vdt/sbin/vdt-register-service --name voms --type init --disable --init-script /opt/vdt/post-install/voms

Worked for me:

 /opt/vdt/vdt/sbin/vdt-register-service -name voms --enable
 /opt/vdt/vdt/sbin/vdt-register-service -name vomrs --enable --type init --init-script /opt/vdt/vomrs-1.3/etc/init.d/vomrs-wrap-all
 vdt-control --on voms 
 vdt-control --on vomrs 

Contents of hand-crafted /opt/vdt/vomrs-1.3/etc/init.d/vomrs-wrap-all

 #!/bin/sh
 #
 # Header written by hand by Vladimir
 # VDT_LOCATION = /opt/vdt
 #
 # chkconfig: 345 99 99
 # description: Virtual organization membership registration server
 ### BEGIN INIT INFO
 # Provides: voms
 # Required-Start: $network $mysql $voms $tomcat-5
 # Required-Stop:
 # Default-Start: 3 4 5
 # Default-Stop: 1 2 6
 # Description: Virtual organization membership registration server
 ### END INIT INFO
 
 if [ -e /opt/vdt/setup.sh ]; then source /opt/vdt/setup.sh; fi
 
 
 VOMRS_NAMES="BeSTGRID"
 ### could be automatically obtained from directory listings
 
 VOMRS_LOCATION=/opt/vdt/vomrs-1.3/
 export VOMRS_LOCATION
 
 for VO_NAME in $VOMRS_NAMES ; do
   /opt/vdt/vomrs-1.3/etc/init.d/vomrs "$@" "$VO_NAME"
 done

[edit] set up certificate request data

 $VDT_LOCATION/vdt/setup/setup-cert-request 

[edit] Apache / Tomcat 5 configuration

 <?xml version='1.0' encoding='UTF-8'?>
 <!DOCTYPE Server>
 <Server port='8005' shutdown='SHUTDOWN'>
   <Service name='Catalina'>
     <Connector sslProtocol='TLS' maxThreads='150' maxSpareThreads='75' secure='true' enableLookups='false' sslKey='/etc/grid-security/http/httpkey.pem' sslCAFiles='/etc/grid-security/certificates/*.0' crlFiles='/etc/grid-security/certificates/*.r0' minSpareThreads='25' disableUploadTimeout='true' sSLImplementation='org.glite.security.trustmanager.tomcat.TMSSLImplementation' acceptCount='100' clientAuth='true' debug='0' sslCertFile='/etc/grid-security/http/httpcert.pem' scheme='https' port='8443' log4jConfFile='/opt/vdt/tomcat/v5/conf/log4j-trustmanager.properties'/>
     <Engine name='Catalina' defaultHost='localhost'>
      <Logger className="org.apache.catalina.logger.FileLogger" prefix="catalina_log." suffix=".txt" timestamp="true"/>
       <Logger className="org.apache.catalina.logger.FileLogger" directory="logs"  prefix="localhost_log." suffix=".txt" timestamp="true"/>
       <Host name='localhost' appBase='webapps'/>
     </Engine>
   </Service>
 </Server>

Copy some .jar files to the right place

   # cd /opt/vomrs-1.3/server/lib && cp glite-security-trustmanager.jar glite-security-util-java.jar puretls.jar log4j-1.2.8.jar /opt/vdt/tomcat/v5/server/lib/
   # cd /opt/vdt/tomcat/v5/server/lib/ && chown daemon:daemon glite-security-trustmanager.jar glite-security-util-java.jar puretls.jar log4j-1.2.8.jar

Note: dissecting default (VDT) Tomcat server.xml: almost same content as asked for by APAC (except for Apache connector instead of SSL).

 Server contains
   Service name=Catalina contains
     x Conecctor
     1x Engine (special case of a Container)
         contains:
           Logger
           Realm (?some kind of data storage - linked to mem/fs/database)
           (virtual) Host appBase="webapps"
                 maycontain Cluster, Valve(s),
                 contains Logger

[edit] Crypto

[edit] OpenSSL useful commands

To test server or client with openssl

 openssl s_client -host vdtcentos.bestgrid -port 8443 -cert ~/.globus/usercert.pem -key ~/.globus/userkey.pem -verify 0 -CApath /etc/grid-security/certificates/
 openssl s_server -accept 18443 -key /etc/grid-security/hostkey.pem -cert /etc/grid-security/hostcert.pem -verify 0 -CApath /etc/grid-security/certificates/

To export a PEM certificate into PKCS12:

 openssl pkcs12 -export -chain -inkey ~/.globus/userkey.pem -in ~/.globus/usercert.pem -out ~/.globus/usercert.p12 -CApath /etc/grid-security/certificates/ -name MyCertificateName

To import a PKCS12 certificate into PEM format:

 openssl pkcs12 -in ~/.globus/usercert.p12 -out ~/.globus/usercert+key.pem


To remove passphrase from an RSA key:

 umask 077
 openssl rsa -in ~/.globus/userkey.pem -out ~/.globus/userkey.pem

To set a passphrase for an RSA key (encrypting with Triple DES):

 umask 077
 openssl rsa -in ~/.globus/userkey.pem -3des -out ~/.globus/userkey.pem

To view certificate, certificate request, private key:

 openssl x509 -text -in ~/.globus/usercert.pem
 openssl req -text -in ~/.globus/usercert_request.pem
 openssl rsa -text -in ~/.globus/userkey.pem

[edit] Grid Crypto commands

[edit] New user creation (with dummy CA)

As user:

 grid-cert-request -dir ~/.globus-other/ -nopw -verbose -cn "John Q Public" -int ### -int recommended

As root on machine with CA key:

 grid-ca-sign -in ~mencl/.globus-other/usercert_request.pem -out ~mencl/.globus-other/usercert.pem

If grid-ca-sign refuses to sign:

 openssl x509 -req -in ~mencl/.globus-testnamespace/usercert_request.pem -out ~mencl/.globus-testnamespace/usercert.pem -days 365 -set_serial 293 -CA /root/.globus/simpleCA/cacert.pem -CAkey /root/.globus/simpleCA/private/cakey.pem -extfile /root/.globus/simpleCA/grid-ca-ssl.conf -extensions x509v3_extensions

[edit] Host and Service certificate request

Host certificate:

 $VDT_LOCATION/globus/bin/grid-cert-request -service http -host vdtcentos.bestgrid
   /C=NZ/O=BeSTGRID/OU=Advanced Technologies Group/CN=http/vdtcentos.bestgrid
 The private key is stored in /etc/grid-security/http/httpkey.pem
 The request is stored in /etc/grid-security/http/httpcert_request.pem


[edit] Issues with requesting certificates

  • $VDT_LOCATION/vdt/setup/setup-cert-request reports a sed error and leaves the grid-security.conf.1e12d831 file empty.
    • OK, it's fine to select the CA with -ca <ca-hash> when calling grid-cert-request
  • non-intertactive grid-cert-request fails with openssl error - EOF received when email address expected.
    • works in interactive mode (-int)
  • BeSTGRID UoC Test CA:
    1. does not ask for email address (and hence works fine with non-interactive grid-cert-request)
    2. asks for a second-level OU (and insists on putting it into DN, default 2nd OU=bestgrid)

Command lines that work:

  1. BeSTGRID UoC Test CA:
 grid-cert-request -dir ~/.globus-test/  -verbose -cn "Test Hest" -ca 76f36b70
  1. APACGrid CA:
 grid-cert-request -dir ~/.globus-test/  -verbose -cn "Test Hest" -ca 1e12d831 -int

[edit] Miscellaneous

[edit] Start services currently needed

 . /opt/vdt/setup.sh
 /opt/vdt/post-install/voms start
 VOMRS_LOCATION=/opt/vdt/vomrs-1.3/ /opt/vdt/vomrs-1.3/etc/init.d/vomrs start BeSTGRID
 /opt/vdt/post-install/apache start 

[edit] edg-mkgridmap problem

Problem:

 edg-mkgridmap
 /opt/vdt/edg/sbin/edg-mkgridmap: line 100: [: missing `]'

Fix:

--- edg-mkgridmap.orig-vdt      2006-12-18 13:07:39.000000000 +1300
+++ edg-mkgridmap       2007-03-05 16:50:39.000000000 +1300
@@ -97,7 +97,7 @@
   # overwrite the grid-mapfile unless it's changed. (See below)
   # We also make sure that ${GRIDMAP}.new is empty if we don't have
   # an existing grid-mapfile.
-  if [ -e ${GRIDMAP}.new]; then
+  if [ -e ${GRIDMAP}.new ]; then
     rm ${GRIDMAP}.new
   fi
   if [ -e ${GRIDMAP} ]; then

Doc: http://vdt.cs.wisc.edu/extras/edg-mkgridmap.html

[edit] GRIS

From VPAC wiki: GRIS is old and fairly useless. We're using MonALISA for grid info at the moment.

[edit] User management issues

  • http://www.vpac.org/twiki/bin/view/APACgrid/PlanGridStaging
  • if a user already has a local account, use it, otherwise use the approprate 'generic user' for the project.
  • (eg) "access to Abaqus is only available via the grid when run as existing logon accounts".
  • ensure that grid usage mapped to a logon user is not counted twice by local sites.
  • ...=> Virtual Account

[edit] Understanding Job Ids

A job may have two externally visible IDs - which both have a similar form as a long string of hexadecimal digits, but be different for a single job.

One of these is called the Idempotence ID, is generated by the client, and is used to uniquely identify the client's attempt to submit the job - that is, have a way to find out whether a particular job submission attempt succeeded or not, to avoid double submission if job submission is interrupted and the client thinks it needs to resubmit the task. Example: c61344ae-e344-11dc-ac0e-00163e84b599.

The other is the job ResourceID, is generated by the server, is used as a part of the job's end-point reference (EPR), and is used throughout the Globus server to identify the job. Example: c698b490-e344-11dc-8ba9-ffac443c90f7.

Idempotence ID is used in the following:

  • When submitting a job with globusrun-ws -submit, the ID printed on standard output of globusrun-ws is the idempotence ID.
  • When submitting a job with globusrun-ws with streaming (-s), the standard output and error files in the grid user's home directory are named after the idempotence ID (~/${IDEMPOTENCE_ID}.{stdout,stderr}). This happens because the job description is created by the globusrun-ws client.

ResourceID is used in the following:

  • The directory created for the PBS submit scripts in ~/.globus is named after the ResourceID.
  • If a job delegated proxy is stored in ~/.globus/gram_job_proxy_some_hex_id, the file is named based on the ResourceID.

Furthermore, there is also a Local ID, which in the case of the Fork scheduler takes similar hexadecimal form, but is again based on a number different from both ResourceID and Idempotence ID, and has the PID of the processed appended. Example: 1cba76d2-e346-11dc-9053-00163e8b5002:28685. (with ResourceID being 1c4a9ce0-e346-11dc-8ba9-ffac443c90f7 and Idempotence ID being 1bc60ee4-e346-11dc-a13b-00163e84b599. For other local schedulers, the local job ID is generated by the scheduler, and typically includes the hostname of the cluster's headnode and a sequence number.

[edit] Debugging pbs.pm

Andrew Sharpe sent me a number of tips on how to debug what's happening in PBS.pm. The crucial part is this patch to several scripts around pbs.pm: http://www.hpc.jcu.edu.au/projects/apac/svn/gateway/globus/globus_perl.patch

The patch adds the missing pieces to allow the JobManager framework log to a file.

Additional logging code may go directly to pbs.pm:

  • this bit goes near the top
  if(defined($self->{logdir})) {
      $description->save($self->{logdir} . "/description.pl");
  }
  • this bit goes just before submission
  if(defined($self->{logdir})) {
      system("cp $pbs_job_script_name $self->{logdir}/pbs.sh");
  }
  • then all you have to do to enable the extras is uncomment the following line in $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager.pm (about line 90)
  $self->{logdir} = "/tmp/" . $ENV{'USER'} . "/" . $id;

[edit] Get gridftplist working

The gridftplist command is a part of the SRM-V1-Client VDT package. When invoked, it complains about SRM_PATH not set. It also needs to add the Apache logging API to the class path. Thus, the patch to get the command working is:

--- /opt/vdt/srm-v1-client/bin/gridftplist.orig      2006-11-22 11:53:19.000000000 +1300
+++ /opt/vdt/srm-v1-client/bin/gridftplist 2008-03-04 11:32:56.000000000 +1300
@@ -1,5 +1,9 @@
 #! /bin/sh
 
+### VM ###
+if [ -z "$SRM_PATH" ] ; then
+  SRM_PATH=/opt/vdt/srm-v1-client
+fi
 #DEBUG=true
 #SECURITY_DEBUG=true 
 #DEBUG=false
@@ -19,6 +23,9 @@
 SRM_CP=$SRM_PATH/lib/srm_client.jar
 SRM_CP=$SRM_CP:$SRM_PATH/lib/srm.jar
 
+### VM ###
+SRM_CP=$SRM_CP:$SRM_PATH/lib/axis/commons-logging-1.0.4.jar
+
 # globus cog
 SRM_CP=$SRM_CP:$SRM_PATH/lib/globus/cryptix.jar
 SRM_CP=$SRM_CP:$SRM_PATH/lib/globus/ce-jdk13-120.jar

However, I rather recommend installing UberFTP, which provides a nice text-mode shell in the style of the traditional unix FTP client.

[edit] Recommended reading

  1. Globus Toolkit Primer PDF
  2. VOMRS User Guide PDF - VOMRS Glossary

[edit] Grid related problems I've been strugling with

... and hopefully solved.

[edit] RFT staging fails

I occasionally saw that a gateway was rejected all job staging, including the implicit cleanup stage which occurs even for jobs submitted with globusrun-ws -s -s submit ... -c command. The errror message was:

   globusrun-ws: Job failed: Staging error for RSL element fileCleanUp.

After examining container-real.log, I saw that RFT's start method was failing with a NullPointer exception. Further reading revealed that the RFT failed to create a MySQL connection at the time it was started - as also shown by the following excerpt from the Globus-WS container startup captured in container-real.log:

2008-01-15 17:38:04,733 WARN  service.ReliableFileTransferHome [main,initialize:97] \
   All RFT requests will fail and all GRAM jobs that require file staging will      \
   fail.com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:

** BEGIN NESTED EXCEPTION **

java.net.ConnectException
MESSAGE: Connection refused

STACKTRACE:

java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:520)
        at java.net.Socket.connect(Socket.java:470)
        at java.net.Socket.<init>(Socket.java:367)
        at java.net.Socket.<init>(Socket.java:209)
        at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:173)
        at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:268)
        at com.mysql.jdbc.Connection.createNewIO(Connection.java:2745)

It is crystal clear - MySQL wasn't running when RFT was starting, RFT initialization failed, and all job staging indeed fails. The problem can be solved by restarting the Globus-WS container. I have however also looked at why it actually can happen.

# ls -1 /etc/rc.d/rc3.d/S99*
/etc/rc.d/rc3.d/S99globus-ws
/etc/rc.d/rc3.d/S99local
/etc/rc.d/rc3.d/S99mysql

Yes, that explains it: globus-ws is started before mysql, and it boils down to a race condition whether mysql succeeds to start before the globus-ws container gets in its background initialization to the point where it starts up RFT....

Well, I did not decide on the startup order, vdt-control did.... I have reported this to vdt-discuss and I'll see whether a fix emerges. Otherwise, a manual edit of the symlinks in rc.d would do...

[edit] Fixing startup order

To fix the startup order to avoid the above problem, run the following commands:

sed '/^# chkconfig:/c # chkconfig: 345 97 09' --in-place=.ORI /etc/rc.d/init.d/mysql 
sed '/^# chkconfig:/c # chkconfig: 345 98 04' --in-place=.ORI /etc/rc.d/init.d/globus-ws 
chkconfig mysql reset
chkconfig globus-ws reset

[edit] Fixing shutdown

In CentOS 4.6 (can't tell for past releases), /etc/rc.d/rc won't run the shutdown sequence for services which did not put a stamp in /var/lock/subsys with heir name. On Ng2 gateways, that particularly applies to mysql and globus-ws. The VDT-created control scripts do not do that, and consequently, globus-ws and mysql won't shutdown cleanly when the gateway is shutdown. Below are patches which add proper interaction with the CentOS subsystem management to the VDT-created scripts.

Note that even after applying these patches, to have the services stopped correctly the next time the virtual machine shuts down, you have to manually create the subsystem stamps:

touch /var/lock/subsys/{globus-ws,mysql}

Apply this patch to /etc/rc.d/init.d/mysql:

--- globus-ws-fixed-start-seq   2008-02-05 16:29:47.000000000 +1300
+++ globus-ws   2008-02-12 16:00:18.000000000 +1300
@@ -37,6 +37,7 @@
     fi

     container_exit=$?
+    if [ $container_exit -eq 0 ] ; then touch /var/lock/subsys/globus-ws ; fi

     if [ $container_exit -eq 3 ]; then
         # Error 3 means that it is already running. We don't consider that to be an error
@@ -47,6 +48,7 @@

 elif [ "$1" = "stop" ] ; then
     $VDT_LOCATION/globus/sbin/globus-stop-container-detached
+    rm -f /var/lock/subsys/globus-ws
 else

   echo "Usage: [start | stop]"


And this one to /etc/rc.d/init.d/mysql:

--- mysql-fixed-start-order     2008-02-05 16:29:43.000000000 +1300
+++ mysql       2008-02-12 15:50:45.000000000 +1300
@@ -219,6 +219,7 @@
       then
         touch /opt/vdt/mysql/var/mysql
       fi
+      touch /var/lock/subsys/mysql
     else
       log_failure_msg "Can't execute $bindir/mysqld_safe"
     fi
@@ -240,6 +241,7 @@
       then
         rm -f /opt/vdt/mysql/var/mysql
       fi
+      rm -f /var/lock/subsys/mysql
     else
       log_failure_msg "MySQL PID file could not be found!"
     fi

[edit] /C=NZ/O=BeSTGRID certificates not being recognized

That happened when I had old ~/.globus/certificates CA bundle that did have the old signing_policy for APACGrid CA - and this bundle took priority over /etc/grid-security/certificates. The old bundle was until recently still distributed with Grix - but Markus has fixed the Grix distribution recently, and also included a provision to replace the old bundle with a new one in user's ~/.globus/certificates directory.

[edit] GPIcalc job not running for a user

If the compile job of gpicalc fails with a GridFTP error, check that the user's credentials are accepted at ng2.vpac.org (where the source code for gpicalc is transfered from). That basically means that only members of NGAdmin can run the gpicalc test job.

[edit] GPIcalc compile job failing on HPC

The GPIcalc compile job may fail in the CleanUp state: this is because the job assumes a .o file would be created as an intermediate product by the Fortran compiler, and the compile job comes with two CleanUp directives: remove the source code (.f) and the .o file, and leave only the executable. The issue can be easily fixed by removing the directive for the .o file from the job description.