README 7.53 KB
Newer Older
Andrew McNab's avatar
Andrew McNab committed
1
2
Machine/Job Features Scripts 
============================
Andrew McNab's avatar
Andrew McNab committed
3

Andrew McNab's avatar
Andrew McNab committed
4
5
See https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeaturesImplementations
for more about the mjf-scripts implementations and Machine/Job Features
Andrew McNab's avatar
Andrew McNab committed
6

7
8
9
10
11
These files can either be used directly, or the torque-rpm htcondor-rpm 
gridengine-rpm and onlymf-rpm Makefile targets can be used to build RPMs for
SL 6.x. This README assumes you have built the RPM yourself or downloaded the
pre-built RPM from
  https://repo.gridpp.ac.uk/machinejobfeatures/mjf-scripts/
Andrew McNab's avatar
Andrew McNab committed
12

Andrew McNab's avatar
Andrew McNab committed
13
14
15
1. Common configuration
2. Torque/PBS configuration
3. HTCondor configuration
Andrew McNab's avatar
Andrew McNab committed
16
17
18
4. Grid Engine configuration
5. Only Machine Features 
6. DIRAC Benchmark (DB12)
Andrew McNab's avatar
Andrew McNab committed
19

Andrew McNab's avatar
Andrew McNab committed
20
21
22
1. Common configuration
-----------------------

Andrew McNab's avatar
Andrew McNab committed
23
24
25
26
27
In the simplest case, just install either the Torque, HTCondor, or Grid
Engine RPM and the /etc/rc.d/init.d/mjf script will be run to create 
/etc/machinefeatures. $MACHINEFEATURES is set to this value by 
/etc/profile.d/mjf.sh and /etc/profile.d/mjf.csh which are (likely to be)
sourced by new logins/jobs.
Andrew McNab's avatar
Andrew McNab committed
28
29
30
31
32
33

The files /etc/sysconfig/mjf and /var/run/mjf are read when creating the
$MACHINEFEATURES (and $JOBFEATURES) directories, and can provide default 
values for Machine/Job Features key/value pairs. /var/run/mjf values take
precedence. Note that files in /var/run are deleted at system boot time. 

Andrew McNab's avatar
Andrew McNab committed
34
35
36
37
38
39
40
These files can contain the following $MACHINEFEATURES keys:
  total_cpu hs06 grace_secs shutdowntime

These files can contain the following $JOBFEATURES keys:
  allocated_cpu hs06_job wall_limit_secs cpu_limit_secs 
  max_rss_bytes max_swap_bytes scratch_limit_bytes 

Andrew McNab's avatar
Andrew McNab committed
41
42
43
44
45
Values given this way override values obtained from the system (eg the
total number of logical processors), but are overridden in turn when 
per-job values can be determined from the batch system (eg the number 
of logical processors allocated to this job.)

46
47
48
The values cpu_limit_secs_per_cpu, max_rss_bytes_per_cpu, 
max_swap_bytes_per_cpu, and scratch_limit_bytes_per_cpu can be set in
either mjf file to cause the scripts to calculate the corresponding 
Andrew McNab's avatar
Andrew McNab committed
49
50
per-job value (eg cpu_limit_secs) by multiplying by $JOBFEATURES/allocated_cpu
(which will be determined from the batch system if available, otherwise 1.)
51

Andrew McNab's avatar
Andrew McNab committed
52
If you know the HS06 of the worker node, you can include a line like
Andrew McNab's avatar
Andrew McNab committed
53
54
55
56
hs06=99.99  which will be picked up when populating /etc/machinefeatures/
(you can force updates after changing that file with  service mjf start  
as the mjf script looks like a SysV service.) This is then used to create
$MACHINEFEATURES/hs06 for the whole WN.
Andrew McNab's avatar
Andrew McNab committed
57

Andrew McNab's avatar
Andrew McNab committed
58
By default, the per-job $JOBFEATURES directories will be created under
Andrew McNab's avatar
Andrew McNab committed
59
/tmp/mjf-$USER but you can use a directory other than /tmp by setting 
Andrew McNab's avatar
Andrew McNab committed
60
61
62
63
64
65
66
67
mjf_tmp_dir=/DESIRED/PATH in either mjf file.

2. Torque/PBS
-------------

The mjf-torque RPM installs /var/lib/torque/mom_priv/prologue.user which is 
run by Torque at the start of each job to create 
$JOBFEATURES=/tmp/mjf-$USER/jobfeatures-$PBS_JOBID (by default), and installs 
Andrew McNab's avatar
Andrew McNab committed
68
69
/var/lib/torque/mom_priv/epilogue.user which runs at the end of the job to
clean up that directory. 
Andrew McNab's avatar
Andrew McNab committed
70
71

$JOBFEATURES/hs06_job is calculated from $MACHINEFEATURES/hs06 with a pro-rata
Andrew McNab's avatar
Andrew McNab committed
72
share for the job in question, based on $JOBFEATURES/allocated_cpu which is in
Andrew McNab's avatar
Andrew McNab committed
73
74
75
76
77
78
turn taken from the Torque ppn for the job (default 1.)

When creating $MACHINEFEATURES/total_cpu, the /usr/sbin/mjf-get-total-cpu 
script uses the value obtained by running the pbsnodes command for the node.
This can be overriden by setting total_cpu in either mjf file. If the value
cannot otherwise by found, it is obtained by counting 'processor' lines in 
Andrew McNab's avatar
Andrew McNab committed
79
80
/proc/cpuinfo.

Andrew McNab's avatar
Andrew McNab committed
81
82
83
3. HTCondor
-----------

Andrew McNab's avatar
Andrew McNab committed
84
The mjf-htcondor RPM installs the /usr/sbin/make-jobfeatures script which must
Andrew McNab's avatar
Andrew McNab committed
85
86
be run as part of the HTCondor user job wrapper. If a job wrapper is not
already defined, then this can simply be done by setting
Andrew McNab's avatar
Spaces    
Andrew McNab committed
87
USER_JOB_WRAPPER = /usr/sbin/mjf-job-wrapper in the HTCondor configuration.
Andrew McNab's avatar
Andrew McNab committed
88
If a job wrapper is already being used, then it must be modified to run
89
/usr/sbin/make-jobfeatures in the way mjf-job-wrapper does,
90
including exporting $JOBFEATURES and $MACHINEFEATURES to the job itself.
Andrew McNab's avatar
Andrew McNab committed
91
92
93
94
95
96
97
98
99
100
101

$JOBFEATURES/hs06_job is calculated from $MACHINEFEATURES/hs06 with a pro-rata
share for the job in question, based on $JOBFEATURES/allocated_cpu which is
turn taken from the CpusProvisioned value in the job ad (default 1.)

When creating $MACHINEFEATURES/total_cpu, the /usr/sbin/mjf-get-total-cpu 
script uses the value obtained by running  condor_config_val NUM_CPUS  to
discover the number of logical processors HTCondor can allocated to jobs.
This can be overriden by setting total_cpu in either mjf file. If the value
cannot otherwise by found, it is obtained by counting 'processor' lines in 
/proc/cpuinfo.
Andrew McNab's avatar
Andrew McNab committed
102

103
104
4. Grid Engine (in early development!)
--------------------------------------
Andrew McNab's avatar
Andrew McNab committed
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121

The mjf-gridengine RPM installs the /usr/sbin/make-jobfeatures script which
must be run as part of the user environment set up and creates the $JOBFEATURES
directory. The files /etc/profile.d/mjf.sh and mjf.csh are installed to do this.

NOTE that the value of wall_limit_secs MUST be set in either /etc/sysconfig/mjf
or /var/run/mjf as this value is not supplied to jobs by Grid Engine.

$JOBFEATURES/hs06_job is calculated from $MACHINEFEATURES/hs06 with a pro-rata
share for the job in question, based on $JOBFEATURES/allocated_cpu which is in
turn taken from $NSLOTS set by Grid Engine for the job (default 1.)

Setting total_cpu in either mjf file will set the value to use for
$MACHINEFEATURES/total_cpu . Otherwise it is obtained by counting 'processor'
lines in /proc/cpuinfo.

5. Only Machine Features
Andrew McNab's avatar
Andrew McNab committed
122
123
124
125
126
127
128
129
130
------------------------

The mjf-onlymf RPM only installs the common scripts to create
$MACHINEFEATURES/hs06 (if hs06 is defined) and $MACHINEFEATURES/total_cpu.
total_cpu can also be overriden by setting total_cpu in either mjf file. 
If the value cannot otherwise by found, it is obtained by counting 'processor'
lines in /proc/cpuinfo.

$JOBFEATURES is neither defined nor the files created. The mjf-onlymf RPM
Andrew McNab's avatar
Andrew McNab committed
131
132
should only be used on systems other than Torque/PBS, HTCondor, or Grid
Engine so at least $MACHINFEATURES is available.
Andrew McNab's avatar
Andrew McNab committed
133

Andrew McNab's avatar
Andrew McNab committed
134
6. DIRAC Benchmark (DB12)
Andrew McNab's avatar
Andrew McNab committed
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
-------------------------

Support for the DIRAC fast benchmark (DB12) is also included, which is
implemented by analogy with HEPSPEC06: $MACHINEFEATURES/db12 and
$JOBFEATURES/db12_job are created if the DB12 measurements are available.
The key/value pairs db12 and db12_job can be included in /etc/sysconfig/mjf
or /var/run/mjf as with hs06 and hs06_job as described above.

However, it will normally be more convenient to create the file /etc/db12/db12
by simply installing the mjf-db12 RPM which runs the DB12 benchmark early in the
boot process when the machine is otherwise idle. The /etc/rc.d/init/db12 script 
stores the result in /etc/db12/db12 along with /etc/db12/total_cpu, equal to the
number of DB12 benchmark instances run in parallel to make the measurement.

If /etc/db12/total_cpu exists before the db12 script is run, then it is used
to determine the number of instances to run. Otherwise the number of logical
processors is counted from the operating system and /etc/db12/total_cpu is
created. 

Since /etc/rc.d/init.d/db12 is run very early in the boot process,
if /etc/db12/total_cpu is different from the number of logical processors,
then it must be created during the original installation (typically by Kickstart)
and not by subsequent configuration by a system such as Puppet which will be
started after db12 has run. 

/etc/db12/total_cpu should match $MACHINEFEATURES/total_cpu so that the number of
DB12 instances run matches the number of processors available to be allocated to
jobs.