From bf41fff73cb57ae65e1e4941c16d2d1fa492f8b5 Mon Sep 17 00:00:00 2001 From: Olga <olga.vladimirovna.datskova@cern.ch> Date: Wed, 18 Nov 2020 14:47:18 +0100 Subject: [PATCH] Importing vobox docs from website --- docs/site/vobox.md | 82 ++++++++ docs/site/vobox_container.md | 3 +- docs/site/vobox_htc_arc.md | 390 +++++++++++++++++++++++++++++++++++ docs/site/vobox_proxy.md | 197 ++++++++++++++++++ mkdocs.yml | 8 +- 5 files changed, 676 insertions(+), 4 deletions(-) create mode 100644 docs/site/vobox.md create mode 100644 docs/site/vobox_htc_arc.md create mode 100644 docs/site/vobox_proxy.md diff --git a/docs/site/vobox.md b/docs/site/vobox.md new file mode 100644 index 0000000..3b40f32 --- /dev/null +++ b/docs/site/vobox.md @@ -0,0 +1,82 @@ +# VOBox + +This document describes how to install and configure the site VO-Box to support ALICE VO services. +This is a node on which long-lived agents and services will be deployed. +These are expected to be provided at the sites. +The agents/services deployment and support on the VO-Box is under VO responsibility. + +See the following quick links to setup steps depending on your preferred deployment approach: + +| | | +|-|-| +| __Generic/VM__ | Step 1: [General requirements](#requirements), [Network setup](#requirements)<br>Step 2: [WLCG VO-Box Installation](#wlcg-vo-box)<br>Step 3: [HTCondor/ARC Specifics](../vobox_htc_arc/) | +| __Container__ | Step 1: [Container requirements](../vobox_container/#requirements), [Network Setup](../vobox_container/#setup-networking)<br>Step 2: [Install HTCondor/ARC VOBox container](../vobox_container/#create-container) | + +## Requirements + +General requirements for the VO node agents/services are as follows: + +| | | +|-|-| +| __OS__ | SL6 or CentOS/EL7, 64-bit Linux. The machine usually will need to be a WLCG VOBOX | +| __Hardware__ | Minimum 4GB RAM, any standard CPU, 20GB for logs, 5GB cache | + +## Network + +The following network connectivity is expected for the VO-Box services: + +| Port | Access | Service | +|:----:|:-------|:--------| +| 8098 | TCP Incoming from your __site WN__ | JAliEn/Java Serialized Object stream | +| 8097 | TCP Incoming from your __site WN__ | JAliEn/WebSocketS | +| 8084 | TCP Incoming from __CERN__ and you __site WN__ | ClusterMonitor | +| 1093 | TCP Incoming from the World | MonALISA FDT server, SE tests | +| 8884 | UDP Incoming from your __site WN__ and your __site SE nodes__ | Monitoring info | +| 9930 | UDP Incoming from your __site SE nodes__ | XRootD metrics | +| _9991_ | TCP Incoming from __CERN__ | PackMan _(Only if not using CVMFS)_ | +| | ICMP Incoming and Outgoing | Network topology for file placement and access | + +!!! note "" + In general, the assumption is that the __outgoing__ connectivity from the VO-box and the WNs is __unrestricted__. + +__CERN__ has multiple networks that may all be used for Central Services, already now or in the future: + +| Protocol | IP Range | Note | +|:--------:|:---------|:-----| +| IPv4 | __128.141.0.0/16__ | | +| | __128.142.0.0/16__ | | +| | __137.138.0.0/16__ | <- part of Central Services are here | +| | __188.184.0.0/15__ | <- part of Central Services are here | +| | __185.249.56.0/22__ | | +| | __192.65.196.0/23__ | | +| | __192.91.242.0/24__ | | +| | __194.12.128.0/18__ | | +| IPv6 | __2001:1458::/32__ | | +| | __2001:1459::/32__ | | + +!!! hint + Please mind the address masks in the above table + +## WLCG VO-Box + +The VO-Box usually should be preinstalled as a standard __WLCG VO-Box__, following the instructions given at:<br><br> +<https://twiki.cern.ch/twiki/bin/view/LCG/WLCGvoboxDeployment> + +This procedure sets up a standard gLite UI, with the following additions (in particular provided by ```lcg-vobox``` RPM): + +* Only one local user account __alicesgm__ (or equivalent), with no special privileges. Please DO NOT configure pool accounts for the SGM user on the VO-Box! +* Access via gsissh, with selected users from the ALICE LCG VO mapped to the __alicesgm__ account (YAIM handles that) +* A proxy renewal service running, for the automatic renewal or registered proxies via the MyProxy mechanism (```idem```) +* A host certificate, issued by one of the trusted LCG Certification Authorities. The machine also needs to be registered as a trusted host in the CERN MyProxy server, ```myproxy.cern.ch```. + +!!! hint "MyProxy" + To have the machine registered as trusted host in myproxy.cern.ch, send an email with the host certificate DN to <Maarten.Litmaath@cern.ch>. You can get the host certificate DN by issuing the following command: + ```console + VO-Box> openssl x509 -in /etc/grid-security/hostcert.pem -noout -subject + ``` + +Additionally, specifically for ALICE, the following configuration details are required: + +* The home directory should not be mounted via NFS from some server (for performance reasons and because lock files may be kept there) +* The experiment software is provided on the VO-box and Worker nodes through CVMFS. See the 'Setup CVMFS' section. + diff --git a/docs/site/vobox_container.md b/docs/site/vobox_container.md index 1f309cd..57ade26 100644 --- a/docs/site/vobox_container.md +++ b/docs/site/vobox_container.md @@ -1,6 +1,7 @@ # VOBox Container -This guide describes how to create a networked Docker container for VO-Box use. +This guide describes how to create a networked Docker container for VO-Box use.<br> +For the previous guide refer to the [this](../vobox_legacy) page. ## Requirements diff --git a/docs/site/vobox_htc_arc.md b/docs/site/vobox_htc_arc.md new file mode 100644 index 0000000..8dfa636 --- /dev/null +++ b/docs/site/vobox_htc_arc.md @@ -0,0 +1,390 @@ +# HTCondor/ARC Installation on VOBox + +This documentation describes how to configure VOBox to enable it submit ALICE jobs to [HTCondor CEs](#htcondor) or [ARC](#arc). +Reference the appropriate section as needed. + +## HTCondor + +The VOBox will run its own HTCondor services that are __independent__ of the HTCondor services for your CE and batch system. +The following instructions assume you are using __SL6.8+__ or __CentOS/EL 7.5+__. + +### Install HTCondor + +1. Go to the repositories folder: + + ```console + ~# cd /etc/yum.repos.d/ + ``` + +2. Depending on your OS version, download the relevant repository: + + | | | + |-|-| + |__SL6__| ```~# wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo``` | + |__CentOS/EL7__| ```~# wget http://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel7.repo``` | + +3. Import RPM key for the repository: + + ```console + ~# cd /etc/pki/rpm-gpg/ + ~# wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor + ~# rpm --import RPM-GPG-KEY-HTCondor + ``` + +4. Install HTCondor 8.5.5 or later: + + ```console + ~# cd + ~# yum update + ~# yum install condor + ``` + +### AliEn Configuration + +This configuration is needed for HTCondor running a _JobRouter_. + +1. Go to the HTCondor configuration folder: + + ```console + ~# cd /etc/condor + ``` + +2. Create local configuration for HTCondor: + + ```console + ~# touch config.d/01_alice_jobrouter.config + ``` + +3. Add and adjust the following configuration content: + + ??? info "config.d/01_alice_jobrouter.config" + + ```bash + DAEMON_LIST = MASTER, SCHEDD, JOB_ROUTER, COLLECTOR + + # the next line is needed since recent HTCondor versions + + COLLECTOR_HOST = $(FULL_HOSTNAME) + + CERTIFICATE_MAPFILE = /etc/condor/certificate_mapfile + GSI_DAEMON_DIRECTORY = /etc/grid-security + GSI_DAEMON_CERT = $(GSI_DAEMON_DIRECTORY)/hostcert.pem + GSI_DAEMON_KEY = $(GSI_DAEMON_DIRECTORY)/hostkey.pem + GSI_DAEMON_TRUSTED_CA_DIR = $(GSI_DAEMON_DIRECTORY)/certificates + + SEC_CLIENT_AUTHENTICATION_METHODS = FS, GSI + SEC_DEFAULT_AUTHENTICATION_METHODS = FS, GSI + SEC_DAEMON_AUTHENTICATION_METHODS = FS, GSI + + COLLECTOR.ALLOW_ADVERTISE_MASTER = condor@fsauth/$(FULL_HOSTNAME) + COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(FULL_HOSTNAME) + + GRIDMAP = /etc/grid-security/grid-mapfile + + ALL_DEBUG = D_FULLDEBUG D_COMMAND + SCHEDD_DEBUG = D_FULLDEBUG + + # NOTE: the max jobs parameters below will need to be increased + + # MaxJobs: typically ~10% more than the number of 1-core slots in the batch system + + JOB_ROUTER_DEFAULTS = \ + [ requirements=target.WantJobRouter is True; \ + EditJobInPlace = True; \ + MaxIdleJobs = 50; \ + MaxJobs = 200; \ + delete_WantJobRouter = true; \ + delete_JobLeaseDuration = True; \ + set_JobUniverse = 9; \ + set_remote_jobuniverse = 5; \ + ] + + # NOTE: it typically is better _not_ to use such static entries, but rather the command below + + #JOB_ROUTER_ENTRIES = \ + # [ GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \ + # eval_set_GridResource = "condor your-CE.your-domain your-CE.your-domain:9619"; \ + # name = "My cluster"; \ + # ] + + # configure a script to get the proper entries from the ALICE LDAP server (provided below) + + JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh + + JOB_ROUTER_ENTRIES_REFRESH = 300 + + JOB_ROUTER_POLLING_PERIOD = 10 + + JOB_ROUTER_ROUND_ROBIN_SELECTION = True + + JOB_ROUTER_SCHEDD2_NAME = $(FULL_HOSTNAME) + + JOB_ROUTER_SCHEDD2_POOL = $(FULL_HOSTNAME):9618 + JOB_ROUTER_DEBUG = D_FULLDEBUG + + GRIDMANAGER_DEBUG = D_FULLDEBUG + JOB_ROUTER_SCHEDD2_SPOOL=/var/lib/condor/spool + + FRIENDLY_DAEMONS = condor@fsauth/$(FULL_HOSTNAME), root@fsauth/$(FULL_HOSTNAME), $(FULL_HOSTNAME) + + ALLOW_DAEMON = $(FRIENDLY_DAEMONS) + + SCHEDD.ALLOW_WRITE = $(FRIENDLY_DAEMONS), *@cern.ch/$(FULL_HOSTNAME) + ALLOW_DAEMON = $(ALLOW_DAEMON) $(FRIENDLY_DAEMONS) + + # ========== FULL DEBUGS ============= + + GRIDMANAGER_DEBUG = D_FULLDEBUG + + # more stuff from the CERN VOBOXes + + CONDOR_FSYNC = False + GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 1000 # to be increased (see MaxJobs above) + + GRIDMANAGER_JOB_PROBE_INTERVAL = 600 + + GRIDMANAGER_MAX_PENDING_REQUESTS = 500 + GRIDMANAGER_GAHP_CALL_TIMEOUT = 3600 + GRIDMANAGER_SELECTION_EXPR = (ClusterId % 2) # 2 should be enough already + GRIDMANAGER_GAHP_RESPONSE_TIMEOUT = 300 + GRIDMANAGER_DEBUG = + ALLOW_DAEMON = $(ALLOW_DAEMON), $(FULL_HOSTNAME), $(IP_ADDRESS), unauthenticated@unmapped + COLLECTOR.ALLOW_ADVERTISE_MASTER = $(COLLECTOR.ALLOW_ADVERTISE_MASTER), $(ALLOW_DAEMON) + COLLECTOR.ALLOW_ADVERTISE_SCHEDD = $(COLLECTOR.ALLOW_ADVERTISE_SCHEDD), $(ALLOW_DAEMON) + + DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0 + + GSI_SKIP_HOST_CHECK = true + ``` + +4. Restart HTCondor now and automatically at boot time: + + ```console + ~# service condor restart + ~# chkconfig condor on + ``` + +5. Check HTCondor is running and produces the following initial output: + + ```console + ~# pstree | grep condor + + |-condor_master-+-condor_collecto + | |-condor_job_rout + | |-condor_procd + | |-condor_schedd + | `-condor_shared_p + ``` + +### LDAP and VOBox Configuration + +In the __Environment__ section add/adjust the values as needed: + +| Definition | Description | +|:-----------|:------------| +| ```USE_JOB_ROUTER=( 1 | 0)``` | Whether is is necessary to use job router service | +| ```GRID_RESOURCE=condor your-CE.your-domain your-CE.your-domain:9619``` | HTCondor resource for explicitly defined for<br> submission to vanilla universe, otherwise<br> system default resource will be selected | +| ```ROUTES_LIST=[ your-ce01.your-domain:9619 ] [ your-ce02.your-domain:9619 ]``` | Routes list example | +| ```USE_EXTERNAL_CLOUD=(1 | 0)``` | Whether to use external cloud | +| ```SUBMIT_ARGS=-append "+TestClassAd=1"```<br>```SUBMIT_ARGS=<String>``` | Specify extra options for condor_submit<br> command. Example: add extra ClassAds<br> to the job description | + +In ```~/.alien/Environment``` on the VOBox: + +```console +d=$HOME/htcondor +mkdir -p $d + +export HTCONDOR_LOG_PATH=$d +``` + +!!! warning "" + Mind the firewall settings on the VOBox. See [Network setup](../vobox/#network) for more details. + +### Miscellaneous Scripts + +The following script helps fill the routes list from LDAP: + +??? info "Routes script" + + ```bash + #!/bin/bash + # print HTCondor job routes obtained from the ALICE LDAP server + # + # example settings in /etc/condor/config.d: + # + # JOB_ROUTER_ENTRIES_CMD = /var/lib/condor/get_job_routes.sh + # JOB_ROUTER_ENTRIES_REFRESH = 600 + # + # version 1.3 (2017/04/04) + # author: Maarten Litmaath + + usage() + { + echo "Usage: $0 [-n] [ FQHN ]" >&2 + exit 1 + } + + LOG=/tmp/job-routes-$(date '+%y%m%d').log + LDAP_ADDR=alice-ldap.cern.ch:8389 + h=$(hostname -f) + + case $1 in + -n) + LOG= + shift + esac + + case $1 in + -*) + usage + ;; + ?*.?*.?*) + h=$1 + ;; + ?*) + usage + esac + + f="(&(objectClass=AlienCE)(host=$h))" + + # + # wrapped example output lines returned by the ldapsearch: + # + # environment: ROUTES_LIST=\ + # [ "condor ce503.cern.ch ce503.cern.ch:9619" ] \ + # [ "condor ce504.cern.ch ce504.cern.ch:9619"; optional extra stuff ] \ + # [ "condor ce505.cern.ch ce505.cern.ch:9619" ] \ + # [ "condor ce506.cern.ch ce506.cern.ch:9619" ] + # + # or a simpler format (the port currently is needed for the SAM VO feed): + # + # environment: ROUTES_LIST=\ + # [ ce503.cern.ch:9619 ] \ + # [ ce504.cern.ch:9619; optional extra stuff ] \ + # [ ce505.cern.ch:9619 ] \ + # [ ce506.cern.ch:9619 ] + # + # the next line may even be absent: + # + # environment: USE_EXTERNAL_CLOUD=0 + # + + if [ "x$LOG" = x ] + then + LOG=/dev/null + else + echo == $(date) >> $LOG + exec 2>> $LOG + fi + + ldapsearch -LLL -x -h $LDAP_ADDR -b o=alice,dc=cern,dc=ch "$f" environment | + perl -p00e 's/\r?\n //g' | perl -ne ' + if (s/^environment: ROUTES_LIST *= *//i) { + s/\[ *([^]" ]+)(:\d+) *([];])/[ "condor $1 $1$2" $3/g; + s/\[ *([^]" ]+) *([];])/[ "condor $1 $1:9619" $2/g; + s/\[ *[^"]*"/[ "/g; + s/\[ *("[^"]+")/[ GridResource = $1; eval_set_GridResource = $1/g; + $routes = $_; + next; + } + if (s/^environment: USE_EXTERNAL_CLOUD *= *//i) { + $extern = "; set_WantExternalCloud = True" if /1/; + next; + } + END { + $extern .= " ]"; + $routes =~ s/;? *]/$extern/eg; + print $routes; + } + ' | tee -a $LOG + ``` + +Cleanup script for job logs and stdout/stderr files removal: + +??? info "Clean up script" + ```bash + #!/bin/sh + + cd ~/htcondor || exit + + GZ_SIZE=10k + GZ_MINS=60 + GZ_DAYS=2 + RM_DAYS=7 + + STAMP=.stamp + prefix=cleanup- + log=$prefix`date +%y%m%d` + exec >> $log 2>&1 < /dev/null + echo === START `date` + for d in `ls -d 20??-??-??` + do + ( + echo === $d + stamp=$d/$STAMP + [ -e $stamp ] || touch $stamp || exit + if find $stamp -mtime +$RM_DAYS | grep . > /dev/null + then + echo removing... + /bin/rm -r $d < /dev/null + exit + fi + cd $d || exit + find . ! -name .\* ! -name \*.gz \( -mtime +$GZ_DAYS -o \ + -size +$GZ_SIZE -mmin +$GZ_MINS \) -exec gzip -9v {} \; + ) + done + find $prefix* -mtime +$RM_DAYS -exec /bin/rm {} \; + echo === READY `date` + ``` + +Crontab line for the cleanup script: + +```console +37 * * * * /bin/sh $HOME/htcondor-cleanup.sh +``` + +## ARC + +!!! warning "ARC Instructions: Work In Progress" + Please note that these instructions are being actively updated and may not be complete. + +### LDAP Configuration + +Add and adjust the following configuration as needed: + +??? info "LDAP configuration" + + ```bash + # VOMS organization to be with BDII job status search + ALIEN_VOBOX_ORG=“alice" + + # a LDAP address of a BDII for a site. VobBox CE module takes number of running and queued jobs from it. + + CE_SITE_BDII=ldap://site-bdii.gridpp.rl.ac.uk:2170/mds-vo-name=RAL-LCG2,o=grid + + # specifies whether to use BDII for number of running/queued jobs retrieval + CE_USE_BDII - (1 - use it, 0 - use arcstat instead) + + # a list of ARC CEs to be used for jobagent submission + # a list of resources can also be set through ~/.arc/client.conf file (see `man client.conf`) + CE_LCGCE=(arc-ce01.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M,arc-ce02.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M,arc-ce03.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M,arc-ce04.gridpp.rl.ac.uk:2811/nordugrid-Condor-grid3000M) + + # arguments for arcsub command + CE_SUBMITARG =" -b FastestQueue" + + # Submit additional parameters to arcsub, it will be used in case CE_USE_BDII=1, additional params for XRSL generation can be passed with a space-separated list: “xrsl:a=b xrsl:c=d" + CE_SUBMITARG_LIST + + # specifies delay in minutes after which try to clean up completed jobs (default: 1440 mins) + ARCCLEAN_RUN_DELAY=1000 + + # specifies delay in minutes after which try to check whether jobs file is sane (default: 60 minutes) + ARC_VALIDATE_JOBS_FILE_DELAY=100 + ``` + +!!! example "Debug ARC for Operations" + Set the following variable in ```~/.alien/Environment``` file to get ```arc*``` CLI tools to debug output into ```CE.log``` file: + ```bash + ARC_DEBUG=1 + ``` diff --git a/docs/site/vobox_proxy.md b/docs/site/vobox_proxy.md new file mode 100644 index 0000000..09141ef --- /dev/null +++ b/docs/site/vobox_proxy.md @@ -0,0 +1,197 @@ +# Manage VOBox Proxy + + +## For the Impatient! + +In all of the following, commands prompted ```by VO-Box>``` are to be issued once logged on the VO-Box itself, while ```LCG-UI>``` means a generic gLite/EMI User Interface: + +* Register a fresh proxy on the VO-Box: + ```bash + LCG-UI> voms-proxy-init --voms alice:/alice/Role=lcgadmin + + LCG-UI> export GT_PROXY_MODE=rfc # note: will only work with a WLCG VOBOX + + LCG-UI> myproxy-init -s myproxy.cern.ch -d -n -t 48 -c 3000 + + LCG-UI> gsissh -p 1975 your-VOBOX + + VO-Box> vobox-proxy register -t 48 + ``` +* Then define the correct proxy in your environment before (re)starting AliEn services on the VOBOX (replace the dots with the long file name of your registered proxy): + ```bash + VO-Box> export X509_USER_PROXY=/var/lib/vobox/alice/proxy_repository/..... + + VO-Box> /cvmfs/alice.cern.ch/bin/aliend restart + ``` + +## Involved Proxies + +The VO-Box uses several proxies for different tasks. Apart from the proxy used by the VO-Box administrator to log in on the machine, there are two more resident on the machine and one remotely stored in a server. The machinery is better understood by first describing sequentially the operations, with details to follow. + + +1. The user generates a proxy on some LCG UI, which he will use to ```gsissh``` to the VO-Box +2. The user stores a long-lived (e.g. one month) myProxy on ```myproxy.cern.ch```. +3. From the VO-Box, the user registers the _login_ proxy to the VO-Box Proxy Renewal Service. The proxy thus generated will be called, in the following, the _user proxy_. Since gLite 3.1, this needs do be a VOMS-extended proxy (i.e. a proxy that carries extra VO-specific information). +4. The VO-Box Proxy Renewal Service keeps the _user proxy_ alive by periodically getting a new one from the MyProxy server. To authenticate to the latter, it uses its copy of the machine proxy. +5. The AliEn CE running on the VO-Box uses the _user proxy_ to submit jobs to the CREAM CE. Just before submitting a bunch of job agents, the AliEn CE will itself check lifetime of the _user proxy_ and try to restart the proxy renewal daemon if the duration is significantly less than 48h. + +!!! hint "Summary" + The proxy certificates involved in the management of an LCG VO-Box for ALICE are as follows: + + * the __[login proxy](#the-login-proxy)__, which is used by the manager to login (via ```gsissh```) to the VO-Box + * the __[myProxy](#the-myproxy)__ registered on the MyProxy server (```myproxy.cern.ch```) + * the __[machine proxy](#the-machine-proxy)__, which is used by the VO-Box to authenticate to the myproxy server + * the __[user proxy](#the-user-proxy-and-the-proxy-renewal-service)__ (a VOMS proxy), which is used by AliEn to submit jobs to LCG. + +Proxies can be examined in two ways: + +* By using Globus tools (possibly wrapped in AliEn commands). If you don't specify the proxy file name, the ```$X509_USER_PROXY``` environment variable will be used. By default (i.e. if neither is specified) proxies are stored in ```/tmp/x509up_uXXXX```, where XXXX is the local numeric userid of the user owning the proxy. + ```bash + LCG-UI> grid-proxy-info [-f ] + ``` +* By directly using ```openssl``` tools (```man openssl``` for more detailed help, probably more than you will ever want to know), e.g.: + ```bash + VO-Box> openssl x509 -in -noout -text + ``` +* However, in order to show also the VOMS extensions, you'll need a different command: + ```bash + LCG-UI> voms-proxy-info -all [-f ] + ``` + +!!! info "VOMS-extended Proxy" + Since gLite 3.1, the job submission services need a VOMS-extended proxy, i.e. a proxy with some VO-specific information attached; in our implementations, these are just VO membership and role. + For more information about VOMS please refer to the relevant gLite user guide section and the [VOMS user guide](http://www.google.it/url?sa=t&ct=res&cd=3&url=http%3A%2F%2Fegee.cesnet.cz%2Fen%2Fvoce%2Fvoms-guide.pdf&ei=QQJySL_lI4mC7gXLgvCCBA&usg=AFQjCNGcfSHrU1aE0QeP0mojS02ppdtYTA&sig2=HclwwnXSfEKv39I4Ml931Q). + Thus, two proxies need to have VOMS extensions; the user proxy, and the login proxy. + +### The _login proxy_ + +This is a plain user proxy that the VO-Box administrator uses to log in on the machine, via: + +```bash +LCG-UI> voms-proxy-init --voms alice:/alice/Role=lcgadmin +LCG-UI> gsissh -p 1975 your-VOBOX +``` + +On the VO-Box, upon login the ```$X509_USER_PROXY``` variable will point to it, i.e. to a file in the ```/tmp``` directory called something like ```/tmp/x509up_p17069.fileuEDDS2.1```. Please note that this proxy is __not__ the one used to start services or submit jobs, nor it is in any way automatically managed. If this proxy expires, the AliEn services should take no exception (they shouldn't even notice). + +!!! info "" + You will need ```$X509_USER_PROXY``` to point to a valid [__user__ proxy](#the-user-proxy-and-the-proxy-renewal-service) in order to make the AliEn services work. + +### The _myProxy_ + +The _myProxy_ is a special long-lived proxy that resides on a remote server (in our instance, ```myproxy.cern.ch```) and is used to obtain shorter-lived _delegated proxies_ by a service. For details, please check the relevant section ("Advanced proxy management") in the [LCG User Guide](http://egee.itep.ru/User_Guide.html#SECTION00068000000000000000) and possibly the [original MyProxy literature](http://grid.ncsa.illinois.edu/myproxy/). + +To generate a MyProxy to be subsequently used by the VO-Box, from an LCG UI on which the user's certificate and key are installed issue the following command: + +```bash +LCG-UI> export GT_PROXY_MODE=rfc # note: will only work with a WLCG VOBOX + +LCG-UI> myproxy-init -s myproxy.cern.ch -d -n -t 48 -c 3000 +``` + +The command lines options are important, and the meaning is as follows: + +| Option | Description | +|:-------|:------------| +| ```-d``` | Use the user's certificate subject as username | +| ```-s myproxy.cern.ch``` | Use this particular MyProxy server. This is the one all VO-Boxes are registered to. | +| ```-n``` | Allow retrieval of a proxy without a password | +| ```-c 3000``` | Lifetime, in hours, of the MyProxy stored in the server. <br> This value (4 months) is a suggestion: the actual value can be as large as the remaining <br> number of hours in the lifetime of your certificate. The proxy renewal daemon running <br> on the VO-Box can warn you (by email) some time before the expiration date (see below). | +| ```-t 48``` | The maximum lifetime (in hours) of derived proxies | + +!!! info "Note" + The default value for the ```-t``` option (12 hours) is too short for our application, so it is important not to forget this option. + One fishy issue with this is that if you forget the option and the system tries to obtain a longer proxy, no error message will be issued and the derived proxy will just be of the maximum allowable length. + There is a check in the AliEn code (since v2-10), so one way to diagnose this problem is to check the CE log file ```~/ALICE/alien-logs/CE.log``` and look for something like this: + ```console + Dec 31 21:30:23 info Proxy timeleft is 43188 (threshold is 165600) + ``` + It should rather look like this instead: + ```console + Dec 31 21:30:23 info Proxy timeleft is 172090 (threshold is 165600) + ``` + This is done by checking the actual remaining lifetime of the received proxy, so please disregard differences of a few seconds. + To fix the problem, generate again a fresh MyProxy with a longer value for ```-t```. + Unfortunately, there is apparently no way for querying the MyProxy and get the value that was used for ```-t```, except trying to request a very long proxy and see what comes back. + +The MyProxy server can be queried (to obtain e.g. the lifetime of the MyProxy) by issuing: + +```console +LCG-UI> myproxy-info -s myproxy.cern.ch -d +``` + +### The _user proxy_ and the Proxy Renewal Service + +This is the most important proxy, since it is the one used by AliEn to start the services and to submit jobs to the LCG. +It is generated by registering the login proxy to a database, which is managed by the VO-Box Proxy Renewal Service (PRS) which will take care of renewing it: + +```console +VO-Box> vobox-proxy register -t 48 +``` + +There's a couple more options to vobox-proxy other that can be useful: + +| Option | Description | +|:-------|:------------| +| ```--proxy-safe 3600``` | This tells the PRS to warn you 3600 seconds (one hour) before the user proxy expires.<br> Since the PRS renews it more often, you should never get such a message; <br> if you get one, it probably means the VOBOX has some problem. | +| ```--myproxy-safe 864000``` | Tells the PRS to warn you 10 days before the long-lived proxy stored on the server expires.<br> If you get such a message, you are supposed to generate a fresh one by [```myproxy-init```](#the-myproxy). | +| ```--email your-address``` | The email address for the alert messages.<br>Please note that site firewall rules in most cases will prevent mail messages from the VO-Box to be sent. | + +The _user certificate_ owner must match the AliEn user declared in ```~alicesgm/.alien/Environment```. + +!!! warning "Important" + If you change the ```ALIEN_USER``` in ```~alicesgm/.alien/Environment```, it is __mandatory__ to restart all the services, in order to have them running with the appropriate credentials.<br><br> + Upon such registration, in ```/var/lib/vobox/alice/proxy_repository``` a delegated proxy will be put with a file name that matches the DN of the user.<br><br> + This proxy is periodically renewed by the PRS, each time obtaining a fresh proxy with the requested duration, by default 12h. + This being too short for most of ALICE jobs, please use the ```-t 48``` arguments of the ```vobox-proxy register``` command, to allow the PRS to handle all renewals. + +The PRS database can be queried with the following command (see ```vobox-proxy --help``` for more options): + +```bash +VO-Box> vobox-proxy query -dn all +``` + +The script that ```starts/stops``` the Proxy Renewal Service is ```/etc/init.d/alice-box-proxyrenewal```. +This should be already in the init.d services list, so you should not need to do anything. <br><br> +When you log in on the VO-Box, the ```$X509_USER_PROXY``` points to your [login proxy](#the-login-proxy). Please define the correct proxy to be used before (re)starting AliEn services (replace the dots with the long file name of your registered proxy): + +```bash +VO-Box> export X509_USER_PROXY=/var/lib/vobox/alice/proxy_repository/..... + +VO-Box> /cvmfs/alice.cern.ch/bin/aliend restart +``` + +### The _machine proxy_ + +This (last!) proxy is used by the Proxy Renewal Service running on the VO-Box to authenticate with the MyProxy server. +This is kept in the ```/var/lib/vobox/alice``` directory: + +```console +[alicesgm@alibox2 ~]$ grid-proxy-info -f /var/lib/vobox/alice/renewal-proxy.pem + + +subject : /C=IT/O=INFN/OU=Host/L=Torino/CN=alibox2.to.infn.it/CN=441535218 +issuer : /C=IT/O=INFN/OU=Host/L=Torino/CN=alibox2.to.infn.it +identity : /C=IT/O=INFN/OU=Host/L=Torino/CN=alibox2.to.infn.it +type : RFC 3820 compliant impersonation proxy +strength : 1024 bits +path : /var/lib/vobox/alice/renewal-proxy.pem +timeleft : 21:47:13 +``` + +As you can see, it's not a user proxy, but it's periodically generated from the host certificate ```/etc/grid-security/hostcert.pem``` (which is, of course, root-owned) by a cron job that calls the following command: + +```console +~$ /etc/init.d/alice-box-proxyrenewal proxy +``` + +!!! hint "Summary" + The ```/etc/init.d/alice-box-proxyrenewal``` script has two different functions: + + * It is called in the ```/etc/rc.d``` sequence with ```[start|stop]``` options. + In this mode, it will start (or stop) the proxy renewal service and go through the services (if any) in ```/var/lib/vobox/alice/[start|stop]``` and start/stop them. + * It is then called by a cron job (check ```/etc/cron.d/alice-box-proxyrenewal```) with the argument ```proxy```. + In this mode, it renews the machine proxy ```/var/lib/vobox/alice/renewal-proxy.pem```. + + Since the ```alicesgm``` user has no root privileges on the VO-Box, if this proxy expires you need to ask the site manager (or whoever manages the VO-Box) to check what's wrong and generate a new one. + diff --git a/mkdocs.yml b/mkdocs.yml index 9bd6ccf..74bf77d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -34,10 +34,12 @@ nav: - Tutorials: user/tutorials.md - Authentication: user/auth.md - Migration to JAliEn: user/migration.md - - Troubleshooting: user/help.md + - Troubleshooting: user/help.md - Site: - - VOBox Container: site/vobox_container.md - - VOBox Container (Legacy): site/vobox_legacy.md + - VOBox: site/vobox.md + - VOBox (Container): site/vobox_container.md + - VOBox (HTCondor/ARC): site/vobox_htc_arc.md + - Manage VOBox Proxy: site/vobox_proxy.md - Computing Element Guide: site/ce_guide.md - Reference: - alien.py: alienpy_commands.md -- GitLab