README 5.99 KB
Newer Older
Andrew McNab's avatar
Andrew McNab committed
1
2
Machine/Job Features Scripts 
============================
Andrew McNab's avatar
Andrew McNab committed
3

Andrew McNab's avatar
Andrew McNab committed
4
5
See https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeaturesImplementations
for more about the mjf-scripts implementations and Machine/Job Features
Andrew McNab's avatar
Andrew McNab committed
6
7

These files can either be used directly, or the torque-rpm and htcondor-rpm 
Andrew McNab's avatar
Andrew McNab committed
8
9
10
Makefile targets can be used to build RPMs for SL 6.x. This README assumes
you have built the RPM yourself or downloaded the pre-built RPM from
https://repo.gridpp.ac.uk/machinejobfeatures/mjf-scripts/
Andrew McNab's avatar
Andrew McNab committed
11

Andrew McNab's avatar
Andrew McNab committed
12
13
14
1. Common configuration
2. Torque/PBS configuration
3. HTCondor configuration
Andrew McNab's avatar
Andrew McNab committed
15
16
4. Only Machine Features 
5. DIRAC Benchmark (DB12)
Andrew McNab's avatar
Andrew McNab committed
17

Andrew McNab's avatar
Andrew McNab committed
18
19
20
1. Common configuration
-----------------------

Andrew McNab's avatar
Andrew McNab committed
21
22
23
24
In the simplest case, just install either Torque or HTCondor RPM and the 
/etc/rc.d/init.d/mjf script will be run to create /etc/machinefeatures. 
$MACHINEFEATURES is set to this value by /etc/profile.d/mjf.sh and 
/etc/profile.d/mjf.csh which are (likely to be) sourced by new logins/jobs.
Andrew McNab's avatar
Andrew McNab committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

The files /etc/sysconfig/mjf and /var/run/mjf are read when creating the
$MACHINEFEATURES (and $JOBFEATURES) directories, and can provide default 
values for Machine/Job Features key/value pairs. /var/run/mjf values take
precedence. Note that files in /var/run are deleted at system boot time. 

Values given this way override values obtained from the system (eg the
total number of logical processors), but are overridden in turn when 
per-job values can be determined from the batch system (eg the number 
of logical processors allocated to this job.)

If you know the HS06 of the worker node, you can also include a line like
hs06=99.99  which will be picked up when populating /etc/machinefeatures/
(you can force updates after changing that file with  service mjf start  
as the mjf script looks like a SysV service.) This is then used to create
$MACHINEFEATURES/hs06 for the whole WN.
Andrew McNab's avatar
Andrew McNab committed
41

Andrew McNab's avatar
Andrew McNab committed
42
By default, the per-job $JOBFEATURES directories will be created under
Andrew McNab's avatar
Andrew McNab committed
43
/tmp/mjf-$USER but you can use a directory other than /tmp by setting 
Andrew McNab's avatar
Andrew McNab committed
44
45
46
47
48
49
50
51
mjf_tmp_dir=/DESIRED/PATH in either mjf file.

2. Torque/PBS
-------------

The mjf-torque RPM installs /var/lib/torque/mom_priv/prologue.user which is 
run by Torque at the start of each job to create 
$JOBFEATURES=/tmp/mjf-$USER/jobfeatures-$PBS_JOBID (by default), and installs 
Andrew McNab's avatar
Andrew McNab committed
52
/var/lib/torque/mom_priv/epilogue.user runs at the end of the job to clean up
Andrew McNab's avatar
Andrew McNab committed
53
54
55
that directory. 

$JOBFEATURES/hs06_job is calculated from $MACHINEFEATURES/hs06 with a pro-rata
Andrew McNab's avatar
Andrew McNab committed
56
share for the job in question, based on $JOBFEATURES/allocated_cpu which is in
Andrew McNab's avatar
Andrew McNab committed
57
58
59
60
61
62
turn taken from the Torque ppn for the job (default 1.)

When creating $MACHINEFEATURES/total_cpu, the /usr/sbin/mjf-get-total-cpu 
script uses the value obtained by running the pbsnodes command for the node.
This can be overriden by setting total_cpu in either mjf file. If the value
cannot otherwise by found, it is obtained by counting 'processor' lines in 
Andrew McNab's avatar
Andrew McNab committed
63
64
/proc/cpuinfo.

Andrew McNab's avatar
Andrew McNab committed
65
66
67
68
69
70
3. HTCondor
-----------

The mjf-htcondor RPM installs /usr/sbin/make-jobfeatures script which must
be run as part of the HTCondor user job wrapper. If a job wrapper is not
already defined, then this can simply be done by setting
Andrew McNab's avatar
Spaces    
Andrew McNab committed
71
USER_JOB_WRAPPER = /usr/sbin/mjf-job-wrapper in the HTCondor configuration.
Andrew McNab's avatar
Andrew McNab committed
72
If a job wrapper is already being used, then it must be modified to run
73
/usr/sbin/make-jobfeatures in the way mjf-job-wrapper does,
74
including exporting $JOBFEATURES and $MACHINEFEATURES to the job itself.
Andrew McNab's avatar
Andrew McNab committed
75
76
77
78
79
80
81
82
83
84
85

$JOBFEATURES/hs06_job is calculated from $MACHINEFEATURES/hs06 with a pro-rata
share for the job in question, based on $JOBFEATURES/allocated_cpu which is
turn taken from the CpusProvisioned value in the job ad (default 1.)

When creating $MACHINEFEATURES/total_cpu, the /usr/sbin/mjf-get-total-cpu 
script uses the value obtained by running  condor_config_val NUM_CPUS  to
discover the number of logical processors HTCondor can allocated to jobs.
This can be overriden by setting total_cpu in either mjf file. If the value
cannot otherwise by found, it is obtained by counting 'processor' lines in 
/proc/cpuinfo.
Andrew McNab's avatar
Andrew McNab committed
86

Andrew McNab's avatar
Andrew McNab committed
87
88
89
90
91
92
93
94
95
96
97
98
99
100
4. Only Machine Features
------------------------

The mjf-onlymf RPM only installs the common scripts to create
$MACHINEFEATURES/hs06 (if hs06 is defined) and $MACHINEFEATURES/total_cpu.
total_cpu can also be overriden by setting total_cpu in either mjf file. 
If the value cannot otherwise by found, it is obtained by counting 'processor'
lines in /proc/cpuinfo.

$JOBFEATURES is neither defined nor the files created. The mjf-onlymf RPM
should only be used on systems other than Torque/PBS or HTCondor so at 
least $MACHINFEATURES is available.

5. DIRAC Benchmark (DB12)
Andrew McNab's avatar
Andrew McNab committed
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
-------------------------

Support for the DIRAC fast benchmark (DB12) is also included, which is
implemented by analogy with HEPSPEC06: $MACHINEFEATURES/db12 and
$JOBFEATURES/db12_job are created if the DB12 measurements are available.
The key/value pairs db12 and db12_job can be included in /etc/sysconfig/mjf
or /var/run/mjf as with hs06 and hs06_job as described above.

However, it will normally be more convenient to create the file /etc/db12/db12
by simply installing the mjf-db12 RPM which runs the DB12 benchmark early in the
boot process when the machine is otherwise idle. The /etc/rc.d/init/db12 script 
stores the result in /etc/db12/db12 along with /etc/db12/total_cpu, equal to the
number of DB12 benchmark instances run in parallel to make the measurement.

If /etc/db12/total_cpu exists before the db12 script is run, then it is used
to determine the number of instances to run. Otherwise the number of logical
processors is counted from the operating system and /etc/db12/total_cpu is
created. 

Since /etc/rc.d/init.d/db12 is run very early in the boot process,
if /etc/db12/total_cpu is different from the number of logical processors,
then it must be created during the original installation (typically by Kickstart)
and not by subsequent configuration by a system such as Puppet which will be
started after db12 has run. 

/etc/db12/total_cpu should match $MACHINEFEATURES/total_cpu so that the number of
DB12 instances run matches the number of processors available to be allocated to
jobs.