Vladimir's grid notes
From BeSTGRID
Various notes I'm taking while working on setting up the Grid Node
Contents |
[edit] Grid Setup TODO (both local config and gateway node)
[edit] create local gridmap configuration
- list local users in /opt/vdt/edg/etc/grid-mapfile-local
- list VOMSes in /opt/vdt/edg/etc/edg-mkgridmap.conf
Works for me:
group vomss://vdtcentos.bestgrid:8443/voms/BeSTGRID?/BeSTGRID/Universities/Canterbury ucgriduser group vomss://vdtcentos.bestgrid:8443/voms/BeSTGRID?/BeSTGRID griduser group vomss://vdtcentos.bestgrid:8443/voms/BeSTGRID?/BeSTGRID/Role=GridUser griduser
Beware: no trailing slashes and no Group= in vomss URL! Otherwise, vomss silently fails (output in /opt/vdt/edg/log/edg-mkgridmap.log), reporting VOMS Internal Server Error.
- re-create: run edg-mkgridmap
Note: All output (including edg-mkgridmap -help) goes to /opt/vdt/edg/log/edg-mkgridmap.log.
Doc: http://vdt.cs.wisc.edu/extras/edg-mkgridmap.html
man edg-mkgridmap.conf (7)
[edit] setup services to be started automatically
Usage from mailing list:
vdt-register-service --name sshd -type init --init-script /opt/vdt/globus/sbin/SXXsshd --enable
Log entry from installing voms - service is installed as disabled
/opt/vdt/vdt/sbin/vdt-register-service --name voms --type init --disable --init-script /opt/vdt/post-install/voms
Worked for me:
/opt/vdt/vdt/sbin/vdt-register-service -name voms --enable /opt/vdt/vdt/sbin/vdt-register-service -name vomrs --enable --type init --init-script /opt/vdt/vomrs-1.3/etc/init.d/vomrs-wrap-all vdt-control --on voms vdt-control --on vomrs
Contents of hand-crafted /opt/vdt/vomrs-1.3/etc/init.d/vomrs-wrap-all
#!/bin/sh # # Header written by hand by Vladimir # VDT_LOCATION = /opt/vdt # # chkconfig: 345 99 99 # description: Virtual organization membership registration server ### BEGIN INIT INFO # Provides: voms # Required-Start: $network $mysql $voms $tomcat-5 # Required-Stop: # Default-Start: 3 4 5 # Default-Stop: 1 2 6 # Description: Virtual organization membership registration server ### END INIT INFO if [ -e /opt/vdt/setup.sh ]; then source /opt/vdt/setup.sh; fi VOMRS_NAMES="BeSTGRID" ### could be automatically obtained from directory listings VOMRS_LOCATION=/opt/vdt/vomrs-1.3/ export VOMRS_LOCATION for VO_NAME in $VOMRS_NAMES ; do /opt/vdt/vomrs-1.3/etc/init.d/vomrs "$@" "$VO_NAME" done
[edit] set up certificate request data
$VDT_LOCATION/vdt/setup/setup-cert-request
[edit] Apache / Tomcat 5 configuration
- Source: http://www.vpac.org/twiki/bin/view/APACgrid/VmdetailsVomrs#Step_Four_Configuring_the_VOMS_V
- Do not run apache linked with tomcat (JkMount), instead configure TOMCAT for an extra HTTPS connector
- Backup the current tomcat-5 /opt/vdt/tomcat/v5/conf/server.xml and create a new one
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE Server>
<Server port='8005' shutdown='SHUTDOWN'>
<Service name='Catalina'>
<Connector sslProtocol='TLS' maxThreads='150' maxSpareThreads='75' secure='true' enableLookups='false' sslKey='/etc/grid-security/http/httpkey.pem' sslCAFiles='/etc/grid-security/certificates/*.0' crlFiles='/etc/grid-security/certificates/*.r0' minSpareThreads='25' disableUploadTimeout='true' sSLImplementation='org.glite.security.trustmanager.tomcat.TMSSLImplementation' acceptCount='100' clientAuth='true' debug='0' sslCertFile='/etc/grid-security/http/httpcert.pem' scheme='https' port='8443' log4jConfFile='/opt/vdt/tomcat/v5/conf/log4j-trustmanager.properties'/>
<Engine name='Catalina' defaultHost='localhost'>
<Logger className="org.apache.catalina.logger.FileLogger" prefix="catalina_log." suffix=".txt" timestamp="true"/>
<Logger className="org.apache.catalina.logger.FileLogger" directory="logs" prefix="localhost_log." suffix=".txt" timestamp="true"/>
<Host name='localhost' appBase='webapps'/>
</Engine>
</Service>
</Server>
Copy some .jar files to the right place
# cd /opt/vomrs-1.3/server/lib && cp glite-security-trustmanager.jar glite-security-util-java.jar puretls.jar log4j-1.2.8.jar /opt/vdt/tomcat/v5/server/lib/ # cd /opt/vdt/tomcat/v5/server/lib/ && chown daemon:daemon glite-security-trustmanager.jar glite-security-util-java.jar puretls.jar log4j-1.2.8.jar
Note: dissecting default (VDT) Tomcat server.xml: almost same content as asked for by APAC (except for Apache connector instead of SSL).
Server contains
Service name=Catalina contains
x Conecctor
1x Engine (special case of a Container)
contains:
Logger
Realm (?some kind of data storage - linked to mem/fs/database)
(virtual) Host appBase="webapps"
maycontain Cluster, Valve(s),
contains Logger
[edit] Crypto
[edit] OpenSSL useful commands
To test server or client with openssl
openssl s_client -host vdtcentos.bestgrid -port 8443 -cert ~/.globus/usercert.pem -key ~/.globus/userkey.pem -verify 0 -CApath /etc/grid-security/certificates/ openssl s_server -accept 18443 -key /etc/grid-security/hostkey.pem -cert /etc/grid-security/hostcert.pem -verify 0 -CApath /etc/grid-security/certificates/
To export a PEM certificate into PKCS12:
openssl pkcs12 -export -chain -inkey ~/.globus/userkey.pem -in ~/.globus/usercert.pem -out ~/.globus/usercert.p12 -CApath /etc/grid-security/certificates/ -name MyCertificateName
To import a PKCS12 certificate into PEM format:
openssl pkcs12 -in ~/.globus/usercert.p12 -out ~/.globus/usercert+key.pem
To remove passphrase from an RSA key:
umask 077 openssl rsa -in ~/.globus/userkey.pem -out ~/.globus/userkey.pem
To set a passphrase for an RSA key (encrypting with Triple DES):
umask 077 openssl rsa -in ~/.globus/userkey.pem -3des -out ~/.globus/userkey.pem
To view certificate, certificate request, private key:
openssl x509 -text -in ~/.globus/usercert.pem openssl req -text -in ~/.globus/usercert_request.pem openssl rsa -text -in ~/.globus/userkey.pem
[edit] Grid Crypto commands
[edit] New user creation (with dummy CA)
As user:
grid-cert-request -dir ~/.globus-other/ -nopw -verbose -cn "John Q Public" -int ### -int recommended
As root on machine with CA key:
grid-ca-sign -in ~mencl/.globus-other/usercert_request.pem -out ~mencl/.globus-other/usercert.pem
If grid-ca-sign refuses to sign:
openssl x509 -req -in ~mencl/.globus-testnamespace/usercert_request.pem -out ~mencl/.globus-testnamespace/usercert.pem -days 365 -set_serial 293 -CA /root/.globus/simpleCA/cacert.pem -CAkey /root/.globus/simpleCA/private/cakey.pem -extfile /root/.globus/simpleCA/grid-ca-ssl.conf -extensions x509v3_extensions
[edit] Host and Service certificate request
Host certificate:
$VDT_LOCATION/globus/bin/grid-cert-request -service http -host vdtcentos.bestgrid /C=NZ/O=BeSTGRID/OU=Advanced Technologies Group/CN=http/vdtcentos.bestgrid The private key is stored in /etc/grid-security/http/httpkey.pem The request is stored in /etc/grid-security/http/httpcert_request.pem
[edit] Issues with requesting certificates
- $VDT_LOCATION/vdt/setup/setup-cert-request reports a sed error and leaves the grid-security.conf.1e12d831 file empty.
- OK, it's fine to select the CA with -ca <ca-hash> when calling grid-cert-request
- non-intertactive grid-cert-request fails with openssl error - EOF received when email address expected.
- works in interactive mode (-int)
- BeSTGRID UoC Test CA:
- does not ask for email address (and hence works fine with non-interactive grid-cert-request)
- asks for a second-level OU (and insists on putting it into DN, default 2nd OU=bestgrid)
Command lines that work:
- BeSTGRID UoC Test CA:
grid-cert-request -dir ~/.globus-test/ -verbose -cn "Test Hest" -ca 76f36b70
- APACGrid CA:
grid-cert-request -dir ~/.globus-test/ -verbose -cn "Test Hest" -ca 1e12d831 -int
[edit] Miscellaneous
[edit] Start services currently needed
. /opt/vdt/setup.sh /opt/vdt/post-install/voms start VOMRS_LOCATION=/opt/vdt/vomrs-1.3/ /opt/vdt/vomrs-1.3/etc/init.d/vomrs start BeSTGRID /opt/vdt/post-install/apache start
[edit] edg-mkgridmap problem
Problem:
edg-mkgridmap /opt/vdt/edg/sbin/edg-mkgridmap: line 100: [: missing `]'
Fix:
--- edg-mkgridmap.orig-vdt 2006-12-18 13:07:39.000000000 +1300
+++ edg-mkgridmap 2007-03-05 16:50:39.000000000 +1300
@@ -97,7 +97,7 @@
# overwrite the grid-mapfile unless it's changed. (See below)
# We also make sure that ${GRIDMAP}.new is empty if we don't have
# an existing grid-mapfile.
- if [ -e ${GRIDMAP}.new]; then
+ if [ -e ${GRIDMAP}.new ]; then
rm ${GRIDMAP}.new
fi
if [ -e ${GRIDMAP} ]; then
Doc: http://vdt.cs.wisc.edu/extras/edg-mkgridmap.html
[edit] GRIS
From VPAC wiki: GRIS is old and fairly useless. We're using MonALISA for grid info at the moment.
[edit] User management issues
- http://www.vpac.org/twiki/bin/view/APACgrid/PlanGridStaging
- if a user already has a local account, use it, otherwise use the approprate 'generic user' for the project.
- (eg) "access to Abaqus is only available via the grid when run as existing logon accounts".
- ensure that grid usage mapped to a logon user is not counted twice by local sites.
- ...=> Virtual Account
[edit] Understanding Job Ids
A job may have two externally visible IDs - which both have a similar form as a long string of hexadecimal digits, but be different for a single job.
One of these is called the Idempotence ID, is generated by the client, and is used to uniquely identify the client's attempt to submit the job - that is, have a way to find out whether a particular job submission attempt succeeded or not, to avoid double submission if job submission is interrupted and the client thinks it needs to resubmit the task. Example: c61344ae-e344-11dc-ac0e-00163e84b599.
The other is the job ResourceID, is generated by the server, is used as a part of the job's end-point reference (EPR), and is used throughout the Globus server to identify the job. Example: c698b490-e344-11dc-8ba9-ffac443c90f7.
Idempotence ID is used in the following:
- When submitting a job with globusrun-ws -submit, the ID printed on standard output of globusrun-ws is the idempotence ID.
- When submitting a job with globusrun-ws with streaming (-s), the standard output and error files in the grid user's home directory are named after the idempotence ID (~/${IDEMPOTENCE_ID}.{stdout,stderr}). This happens because the job description is created by the globusrun-ws client.
ResourceID is used in the following:
- The directory created for the PBS submit scripts in ~/.globus is named after the ResourceID.
- If a job delegated proxy is stored in ~/.globus/gram_job_proxy_some_hex_id, the file is named based on the ResourceID.
Furthermore, there is also a Local ID, which in the case of the Fork scheduler takes similar hexadecimal form, but is again based on a number different from both ResourceID and Idempotence ID, and has the PID of the processed appended. Example: 1cba76d2-e346-11dc-9053-00163e8b5002:28685. (with ResourceID being 1c4a9ce0-e346-11dc-8ba9-ffac443c90f7 and Idempotence ID being 1bc60ee4-e346-11dc-a13b-00163e84b599. For other local schedulers, the local job ID is generated by the scheduler, and typically includes the hostname of the cluster's headnode and a sequence number.
[edit] Debugging pbs.pm
Andrew Sharpe sent me a number of tips on how to debug what's happening in PBS.pm. The crucial part is this patch to several scripts around pbs.pm: http://www.hpc.jcu.edu.au/projects/apac/svn/gateway/globus/globus_perl.patch
The patch adds the missing pieces to allow the JobManager framework log to a file.
Additional logging code may go directly to pbs.pm:
- this bit goes near the top
if(defined($self->{logdir})) {
$description->save($self->{logdir} . "/description.pl");
}
- this bit goes just before submission
if(defined($self->{logdir})) {
system("cp $pbs_job_script_name $self->{logdir}/pbs.sh");
}
- then all you have to do to enable the extras is uncomment the following line in $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager.pm (about line 90)
$self->{logdir} = "/tmp/" . $ENV{'USER'} . "/" . $id;
[edit] Get gridftplist working
The gridftplist command is a part of the SRM-V1-Client VDT package. When invoked, it complains about SRM_PATH not set. It also needs to add the Apache logging API to the class path. Thus, the patch to get the command working is:
--- /opt/vdt/srm-v1-client/bin/gridftplist.orig 2006-11-22 11:53:19.000000000 +1300 +++ /opt/vdt/srm-v1-client/bin/gridftplist 2008-03-04 11:32:56.000000000 +1300 @@ -1,5 +1,9 @@ #! /bin/sh +### VM ### +if [ -z "$SRM_PATH" ] ; then + SRM_PATH=/opt/vdt/srm-v1-client +fi #DEBUG=true #SECURITY_DEBUG=true #DEBUG=false @@ -19,6 +23,9 @@ SRM_CP=$SRM_PATH/lib/srm_client.jar SRM_CP=$SRM_CP:$SRM_PATH/lib/srm.jar +### VM ### +SRM_CP=$SRM_CP:$SRM_PATH/lib/axis/commons-logging-1.0.4.jar + # globus cog SRM_CP=$SRM_CP:$SRM_PATH/lib/globus/cryptix.jar SRM_CP=$SRM_CP:$SRM_PATH/lib/globus/ce-jdk13-120.jar
However, I rather recommend installing UberFTP, which provides a nice text-mode shell in the style of the traditional unix FTP client.
[edit] Recommended reading
- Globus Toolkit Primer PDF
- VOMRS User Guide PDF - VOMRS Glossary
[edit] Grid related problems I've been strugling with
... and hopefully solved.
[edit] RFT staging fails
I occasionally saw that a gateway was rejected all job staging, including the implicit cleanup stage which occurs even for jobs submitted with globusrun-ws -s -s submit ... -c command. The errror message was:
globusrun-ws: Job failed: Staging error for RSL element fileCleanUp.
After examining container-real.log, I saw that RFT's start method was failing with a NullPointer exception. Further reading revealed that the RFT failed to create a MySQL connection at the time it was started - as also shown by the following excerpt from the Globus-WS container startup captured in container-real.log:
2008-01-15 17:38:04,733 WARN service.ReliableFileTransferHome [main,initialize:97] \
All RFT requests will fail and all GRAM jobs that require file staging will \
fail.com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:
** BEGIN NESTED EXCEPTION **
java.net.ConnectException
MESSAGE: Connection refused
STACKTRACE:
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:520)
at java.net.Socket.connect(Socket.java:470)
at java.net.Socket.<init>(Socket.java:367)
at java.net.Socket.<init>(Socket.java:209)
at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:173)
at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:268)
at com.mysql.jdbc.Connection.createNewIO(Connection.java:2745)
It is crystal clear - MySQL wasn't running when RFT was starting, RFT initialization failed, and all job staging indeed fails. The problem can be solved by restarting the Globus-WS container. I have however also looked at why it actually can happen.
# ls -1 /etc/rc.d/rc3.d/S99* /etc/rc.d/rc3.d/S99globus-ws /etc/rc.d/rc3.d/S99local /etc/rc.d/rc3.d/S99mysql
Yes, that explains it: globus-ws is started before mysql, and it boils down to a race condition whether mysql succeeds to start before the globus-ws container gets in its background initialization to the point where it starts up RFT....
Well, I did not decide on the startup order, vdt-control did.... I have reported this to vdt-discuss and I'll see whether a fix emerges. Otherwise, a manual edit of the symlinks in rc.d would do...
[edit] Fixing startup order
To fix the startup order to avoid the above problem, run the following commands:
sed '/^# chkconfig:/c # chkconfig: 345 97 09' --in-place=.ORI /etc/rc.d/init.d/mysql sed '/^# chkconfig:/c # chkconfig: 345 98 04' --in-place=.ORI /etc/rc.d/init.d/globus-ws chkconfig mysql reset chkconfig globus-ws reset
[edit] Fixing shutdown
In CentOS 4.6 (can't tell for past releases), /etc/rc.d/rc won't run the shutdown sequence for services which did not put a stamp in /var/lock/subsys with heir name. On Ng2 gateways, that particularly applies to mysql and globus-ws. The VDT-created control scripts do not do that, and consequently, globus-ws and mysql won't shutdown cleanly when the gateway is shutdown. Below are patches which add proper interaction with the CentOS subsystem management to the VDT-created scripts.
Note that even after applying these patches, to have the services stopped correctly the next time the virtual machine shuts down, you have to manually create the subsystem stamps:
touch /var/lock/subsys/{globus-ws,mysql}
Apply this patch to /etc/rc.d/init.d/mysql:
--- globus-ws-fixed-start-seq 2008-02-05 16:29:47.000000000 +1300
+++ globus-ws 2008-02-12 16:00:18.000000000 +1300
@@ -37,6 +37,7 @@
fi
container_exit=$?
+ if [ $container_exit -eq 0 ] ; then touch /var/lock/subsys/globus-ws ; fi
if [ $container_exit -eq 3 ]; then
# Error 3 means that it is already running. We don't consider that to be an error
@@ -47,6 +48,7 @@
elif [ "$1" = "stop" ] ; then
$VDT_LOCATION/globus/sbin/globus-stop-container-detached
+ rm -f /var/lock/subsys/globus-ws
else
echo "Usage: [start | stop]"
And this one to /etc/rc.d/init.d/mysql:
--- mysql-fixed-start-order 2008-02-05 16:29:43.000000000 +1300
+++ mysql 2008-02-12 15:50:45.000000000 +1300
@@ -219,6 +219,7 @@
then
touch /opt/vdt/mysql/var/mysql
fi
+ touch /var/lock/subsys/mysql
else
log_failure_msg "Can't execute $bindir/mysqld_safe"
fi
@@ -240,6 +241,7 @@
then
rm -f /opt/vdt/mysql/var/mysql
fi
+ rm -f /var/lock/subsys/mysql
else
log_failure_msg "MySQL PID file could not be found!"
fi
[edit] /C=NZ/O=BeSTGRID certificates not being recognized
That happened when I had old ~/.globus/certificates CA bundle that did have the old signing_policy for APACGrid CA - and this bundle took priority over /etc/grid-security/certificates. The old bundle was until recently still distributed with Grix - but Markus has fixed the Grix distribution recently, and also included a provision to replace the old bundle with a new one in user's ~/.globus/certificates directory.
[edit] GPIcalc job not running for a user
If the compile job of gpicalc fails with a GridFTP error, check that the user's credentials are accepted at ng2.vpac.org (where the source code for gpicalc is transfered from). That basically means that only members of NGAdmin can run the gpicalc test job.
[edit] GPIcalc compile job failing on HPC
The GPIcalc compile job may fail in the CleanUp state: this is because the job assumes a .o file would be created as an intermediate product by the Fortran compiler, and the compile job comes with two CleanUp directives: remove the source code (.f) and the .o file, and leave only the executable. The issue can be easily fixed by removing the directive for the .o file from the job description.
