Edition 1
1801 Varsity Drive
Raleigh, NC 27606-2072 USA
Phone: +1 919 754 3700
Phone: 888 733 4281
Fax: +1 919 754 3701
condor_master
Configuration File Macros condor_startd
Configuration File Macros condor_schedd
Configuration File Entriescondor_starter
Configuration File Entriescondor_negotiator
Configuration File MacrosMono-spaced Bold
To see the contents of the filemy_next_bestselling_novel
in your current working directory, enter thecat my_next_bestselling_novel
command at the shell prompt and press Enter to execute the command.
Press Enter to execute the command.Press Ctrl+Alt+F2 to switch to the first virtual terminal. Press Ctrl+Alt+F1 to return to your X-Windows session.
mono-spaced bold
. For example:
File-related classes includefilesystem
for file systems,file
for files, anddir
for directories. Each class has its own associated set of permissions.
Choose Mouse Preferences. In the Buttons tab, click the Left-handed mouse check box and click to switch the primary mouse button from the left to the right (making the mouse suitable for use in the left hand).→ → from the main menu bar to launchTo insert a special character into a gedit file, choose → → from the main menu bar. Next, choose → from the Character Map menu bar, type the name of the character in the Search field and click . The character you sought will be highlighted in the Character Table. Double-click this highlighted character to place it in the Text to copy field and then click the button. Now switch back to your document and choose → from the gedit menu bar.
Mono-spaced Bold Italic
or Proportional Bold Italic
To connect to a remote machine using ssh, typessh
at a shell prompt. If the remote machine isusername
@domain.name
example.com
and your username on that machine is john, typessh john@example.com
.Themount -o remount
command remounts the named file system. For example, to remount thefile-system
/home
file system, the command ismount -o remount /home
.To see the version of a currently installed package, use therpm -q
command. It will return a result as follows:package
.
package-version-release
Publican is a DocBook publishing system.
mono-spaced roman
and presented thus:
books Desktop documentation drafts mss photos stuff svn books_tests Desktop1 downloads images notes scripts svgs
mono-spaced roman
but add syntax highlighting as follows:
package org.jboss.book.jca.ex1; import javax.naming.InitialContext; public class ExClient { public static void main(String args[]) throws Exception { InitialContext iniCtx = new InitialContext(); Object ref = iniCtx.lookup("EchoBean"); EchoHome home = (EchoHome) ref; Echo echo = home.create(); System.out.println("Created Echo"); System.out.println("Echo.echo('Hello') = " + echo.echo("Hello")); } }
=
) sign. There must be a space character on each side of the equals sign. Valid configuration parameters look like this:
name = value
condor.rpm
package. An example of what the global configuration file looks like is in Example A.1, “The default global configuration file”.
CONDOR_CONFIG
environment variable
/etc/condor/condor_config
/usr/local/etc/condor_config
~condor/condor_config
CONDOR_CONFIG
environment variable and there is a problem reading that file, MRG Grid will print an error message and exit. It will not continue to search the other options. Leaving the CONDOR_CONFIG
environment variable blank will ensure that MRG Grid will search through the other options.
/etc/condor/config.d
. This location is defined by the default global configuration file. The local configuration directory provides an easy way to extend the configuration of MRG Grid by placing files that contain configuration parameters inside the directory.
00
- personal condor (included by default)
10-40
- user configuration files
50-80
- MRG Grid package configuration files
99
- Reserved for the remote configuration feature
LOCAL_CONFIG_FILE
parameter in the global configuration file can be used to specify the location of files with configuration to be read.
LOCAL_CONFIG_FILE
parameter is used by the remote configuration feature. Do not set this parameter if remote configuration is used.
LOCAL_CONFIG_DIR_REGEXP
configuration parameter informs MRG Grid of the files to ignore in the local configuration directory. Text editors, install scripts or user activity may leave extra files in the local configuration directory. Without this parameter, MRG Grid is unable to differentiate between the extra files and legitimate configuration files. The default setting of this parameter is for MRG Grid to ignore any files that begin with ".
" or "#
" or ending with "~
", ".rpmsave
" or ".rpmnew
". For more information about this parameter, see Section A.3, “System Wide Configuration File Variables”
/etc/condor/config.d
directory:
# touch /etc/condor/config.d/10myconfigurationfile
condor
service:
# service condor restart Stopping condor services: [ OK ] Starting condor services: [ OK ]
_CONDOR_
or _condor_
. MRG Grid parses environment variables last, subsequently any settings made this way will override conflicting settings in the configuration files.
#This is a comment
SUBSYSTEM_NAME
.CONFIG_VARIABLE
=VALUE
=
#
symbol will be treated as a comment and ignored. The #
symbol can only appear at the beginning of a line. It will not create a comment if it is used in the middle of a line.
SUBSYSTEM_NAME
is optional
=
sign
\
character at the end of the line to be continued. For example:
ADMIN_MACHINES = condor.example.com, raven.example.com, \ stork.example.com, ostrich.example.com \ bigbird.example.com
# This comment has line continuation \ characters, so FOO will not be set \ FOO = BAR
/etc/condor/condor_config
and /etc/condor/config.d/00personal_condor.config
before starting MRG Grid.
Personal Condor
. Personal Condor
is a specific style of installation suited for individual users who do not have their own pool of machines. To allow other machines to join the pool, specify the ALLOW_WRITE
option in the local configuration directory.
/etc/condor/config.d
:
# touch /etc/condor/config.d/10pool_access
ALLOW_WRITE
configuration parameter must be specified in order to allow machines to join your pool and submit jobs. Any machine that you give write access to using the ALLOW_WRITE
option should also be given read access using the ALLOW_READ
option:
ALLOW_WRITE = *.your.domain.com
ALLOW_WRITE = *
in the configuration file. However, this will allow anyone to submit jobs or add machines to your pool. This is a serious security risk and therefore not recommended.
|
character at the end of the line. This syntax will only work with the configuration variable LOCAL_CONFIG_FILE
. For example, to run a program located at /bin/make_the_config
, use the following entry:
LOCAL_CONFIG_FILE = /bin/make_the_config|
/bin/make_the_config
must output the configuration parameters on standard output for the configuration parameters to be included in the configuration.
condor_q
, condor_status
, and condor_userprio
WM_CLOSE
message, or hard-killed automatically based upon policy expressions. For example, a job can be suspended whenever keyboard or mouse, or non-Condor created CPU activity is detected, and continue the job after the machine has been idle for a specified amount of time
run_as_owner
feature is disabled
HKEY_CURRENT_USER
. This can be useful if the job requires direct access to the user's registry entries. It can also be useful when the job requires an application that needs registry access.
condor_startd
, but only operates with the dedicated run account. For security reasons, the profiles are removed after the job has completed and exited. This ensures that any malicious jobs cannot discover the details of any previous jobs, or sabotage the registry for future jobs. It also ensures that the next job has a fresh registry hive.
load_profile = True
.exe
) file.
ActivePerl
. It is also possible to use Windows Scripting Host scripts, although some configuration changes are necessary.
HKEY_CLASSES_ROOT
registry hive.
HKEY_CLASSES_ROOT\FileType
\Shell\OpenVerb
\Command
OpenVerb
identifies the verb. This is set in the Condor configuration file, and aids the registry lookup.
FileType
is the name of a file type, and is obtained from the file name extension. The file name extension sets the name of the Condor configuration variable. This variable name is of the form:
OPEN_VERB_FOR_EXT
_FILES
EXT
represents the file name extension. In the following example, the Open2
verb is specified for a Windows Scripting Host registry lookup for several scripting languages:
OPEN_VERB_FOR_JS_FILES = Open2 OPEN_VERB_FOR_VBS_FILES = Open2 OPEN_VERB_FOR_VBE_FILES = Open2 OPEN_VERB_FOR_JSE_FILES = Open2 OPEN_VERB_FOR_WSF_FILES = Open2 OPEN_VERB_FOR_WSH_FILES = Open2
Open2
verb has been specified instead of the default Open
verb for several scripts, including Windows Scripting Host scripts (using the extension .wsh
). The Open2
verb in Windows Scripting Host scripts allows standard input, standard output, and standard error to be redirected as needed for Condor jobs.
CScript Error: Loading your settings failed. (Access is denied.)
load_profile
command in the job's submit description file:
load_profile = True
ActivePerl
does not by default require access to the user's registry hive.
condor_startd
service on the execute machine spawns a condor_starter
process (referred to as the starter). The starter then creates:
condor-reuse-slotX
, where X
is the slot number of the starter. This account is added to the group Users
.
dir_XXX
, where XXX
is the process ID of the starter. The directory is created in the $(EXECUTE)
directory as specified in the configuration file. Condor then grants write permission to this directory for the user account newly created for the job.
USE_VISIBLE_DESKTOP = True
in the job submit file.
condor_shadow
service (referred to as the shadow), which is running on the submitting machine, and copies the job's executable and input files. These files are placed into the temporary working directory for the job.
$(EXECUTE)/dir_XXX
, where XXX
is the process ID of the starter).
STARTER_UPDATE_INTERVAL
configuration parameter.
condor_submit
command. Once all the output files are safely transferred back, the job is removed from the queue.
condor_startd
is forced to kill the job before all output files are transferred, the job is not removed from the queue but is instead transitioned back to the Idle
state.
condor_startd
vacates a job prematurely, the starter sends a WM_CLOSE
message to the job. If the job spawned multiple child processes, the WM_CLOSE
message is only sent to the parent process (that is, the one started by the starter). The WM_CLOSE
message is the preferred way to terminate a process on Microsoft Windows, since this method allows the job to clean up properly and free any resources that have been allocated.
when_to_transfer_output
is set to the default ON_EXIT
in the submit description file, the job will switch states from Running
to Idle
, and no files will be transferred back. However, if it is set to ALWAYS
, any files in the job's temporary working directory which were changed or modified will be sent back to the submitting machine. The shadow will put the files into a subdirectory under the SPOOL
directory on the submitting machine. The job is then switched back to the Idle
state until a different machine is found on which it can be run. When the job is restarted, Condor puts the executable and input files into the temporary working directory as before, as well as any files stored in the submit machine's SPOOL
directory for that job.
WM_CLOSE
message is sent, the process receiving the message will exit. In some cases, the job can be coded to ignore it and not exit, but in this instance eventually the condor_startd
will hard kill the job (if that is the policy desired by the administrator).
SetConsoleCtrlHandler()
function can be used to intercept a WM_CLOSE
message. A WM_CLOSE
message generates a CTRL_CLOSE_EVENT
. See SetConsoleCtrlHandler()
in the Win32 documentation for more information.
condor_startd
will attempt to clean up. If for some reason the condor_startd
should disappear as well (which is only likely to happen if the machine is suddenly rebooted), the condor_startd
will clean up once Condor has been restarted.
C:\WINNT
, then no Condor job that is run on that machine would be able to write to that location. The only files jobs can access on the execute machine are files accessible by the Users
and Everyone
groups, and files within the job's own temporary working directory.
run_as_owner
. This section outlines several ways to use Condor with networked files.
net use
command and a login and password.
@echo off net use \\myserver\someshare MYPASSWORD /USER:MYLOGIN copy \\myserver\someshare\my-program.exe my-program.exe
@echo off net use \\myserver\someshare copy \\myserver\someshare\my-program.exe my-program.exe
MYSERVER
as the Condor temporary user. If the GUEST
account is enabled on the server, the user will be authenticated to the server as user GUEST
. Set the access control lists (ACLs) so that the GUEST
user or the EVERYONE
group has access to the share someshare
and the directories and files there. The disadvantage of this method is that the GUEST
account must be enabled on the file server.
NULL
security descriptor.
net use z: \\myserver\someshare /USER:"" z:\my-program.exe
someshare
is in the list of allowed NULL
session shares. To edit the list, run regedit.exe
and navigate to this key:
HKEY_LOCAL_MACHINE\ SYSTEM\ CurrentControlSet\ Services\ LanmanServer\ Parameters\ NullSessionShares
0x00
) and the final entry in the list is terminated with two nulls.
condor_submit
and condor_config
are always clients, sending requests to services. Services can also interact with each other, where the daemon making the request acts as the client.
ALLOW_READ ALLOW_WRITE ALLOW_ADMINISTRATOR ALLOW_CONFIG ALLOW_SOAP ALLOW_OWNER ALLOW_NEGOTIATOR ALLOW_DAEMON DENY_READ DENY_WRITE DENY_ADMINISTRATOR DENY_SOAP DENY_CONFIG DENY_OWNER DENY_NEGOTIATOR DENY_DAEMON
ALLOW_DAEMON
and DENY_DAEMON
settings.
ALLOW_ADVERTISE_MASTER ALLOW_ADVERTISE_STARTD ALLOW_ADVERTISE_SCHEDD DENY_ADVERTISE_MASTER DENY_ADVERTISE_STARTD DENY_ADVERTISE_SCHEDD
ALLOW_CLIENT DENY_CLIENT
ALLOW_CLIENT
and DENY_CLIENT
should be considered from the point of view of a client determining which servers to allow or deny. All authorization settings are defined by a comma-separated list of fully qualified users. Each fully qualified user is described using the following format:
username@domain/hostname
host.example.com
10.1.1.1
*@userdomain/host.example.com
userdomain
, where the command is originating from the machine host.example.com
. Another valid example is:
user@userdomain/*.example.com
userdomain
domain, and issued by user.
network/netmask
DENY_*
parameters take precedence over ALLOW_*
parameters where there is a conflict. This implies that if a specific user is both denied and granted authorization, the conflict is resolved by denying access.
example.com
domain WRITE
access, all machines READ
access, and the user admin
will have ADMINISTRATOR
level access to all machines in the example.com
domain:
ALLOW_WRITE = *.example.com ALLOW_READ = * ALLOW_ADMINISTRATOR = admin@*.example.com
READ
READ
access is required to view the status of the pool using condor_status
; check a job queue with condor_q
; view user priorities with condor_userprio
. READ
access will not allow changes to be made, and it will not allow jobs to be submitted.
WRITE
WRITE
access is required for submitting jobs using condor_submit
; advertising a machine so it appears in the pool, which is usually done automatically by the condor_startd
service. The WRITE
context implies READ
access.
ADMINISTRATOR
ADMINISTRATOR
rights are required to change user priorities using condor_userprio -set
; turn Condor on and off using condor_on
and condor_off
. The ADMINISTRATOR
context implies both READ
and WRITE
access.
SOAP
SOAP
is not a general context, and should not be used with configuration variables for authentication, encryption, and integrity checks.
CONFIG
condor_config_val
. By default, this level of access can change any configuration parameters of a Condor pool. The CONFIG
context implies READ
access.
OWNER
OWNER
context can be used to perform commands such as the condor_vacate
command, which causes the condor_startd
to vacate any job currently running on a machine.
DAEMON
condor_startd
sending ClassAd updates to the condor_collector
. This context is only required for the user account that runs the Condor services. The DAEMON
context implies both READ
and WRITE
access. Any configuration setting for this context that is not defined will default to the corresponding setting for the WRITE
context.
NEGOTIATOR
condor_negotiator
service, which runs on the central manager. Commands requiring this context are those that instruct condor_schedd
to begin negotiating, and those that tell an available condor_startd
that it has been matched to a condor_schedd
with jobs to run. The NEGOTIATOR
level of access implies READ
access.
ADVERTISE_MASTER
condor_master
to the collector. Any configuration setting for this context that is not defined will default to the corresponding setting for the DAEMON
context.
ADVERTISE_STARTD
condor_startd
to the collector. Any configuration setting for this context that is not defined will default to the corresponding setting for the DAEMON
context.
ADVERTISE_SCHEDD
condor_schedd
to the collector. Any configuration setting for this context that is not defined will default to the corresponding setting for the DAEMON
context.
CLIENT
Activity | Service | Security Context |
---|---|---|
Reconfigure a service with condor_reconfig
| All services |
WRITE
|
Signalling | All services |
DAEMON
|
Keeps alive | All services |
DAEMON
|
Read configuration | All services |
READ
|
Runtime configuration | All services |
ALLOW
|
Daemon off | All services |
ADMINISTRATOR
|
Fetch or purge logs | All services |
ADMINISTRATOR
|
Activate, request, or release a claim. |
condor_startd
|
WRITE
|
Retrieve startd or job information with condor_preen
|
condor_startd
|
READ
|
Heartbeat |
condor_startd
|
DAEMON
|
Deactivate claim |
condor_startd
|
DAEMON
|
condor_vacate Used to stop running jobs
|
condor_startd
|
OWNER
|
Retrieve negotiation information |
condor_startd
|
NEGOTIATOR
|
ClassAd commands |
condor_startd
|
WRITE
|
VM Universe commands |
condor_startd
|
DAEMON
|
ClassAd commands |
condor_starter
|
WRITE
|
Hold jobs |
condor_starter
|
DAEMON
|
Create job security session |
condor_starter
|
DAEMON
|
Start SSHD |
condor_starter
|
READ
|
Initiate a new negotiation cycle |
condor_negotiator
|
WRITE
|
Retrieve the current user priorities with userprio
|
condor_negotiator
|
READ
|
Set user priorities with userprio -set
|
condor_negotiator
|
ADMINISTRATOR
|
Reschedule |
condor_negotiator
|
DAEMON
|
Reset usage |
condor_negotiator
|
ADMINISTRATOR
|
Delete user |
condor_negotiator
|
ADMINISTRATOR
|
Set usage statistics |
condor_negotiator
|
ADMINISTRATOR
|
Update the condor_collector with new condor_master ClassAds
|
condor_collector
|
ADVERTISE_MASTER
|
Update the condor_collector with new condor_schedd ClassAds
|
condor_collector
|
ADVERTISE_SCHEDD
|
Commands that update the condor_collector with new condor_startd ClassAds
|
condor_collector
|
ADVERTISE_STARTD
|
All other commands that update the condor_collector with new ClassAds
|
condor_collector
|
DAEMON
|
All commands that query the condor_collector for ClassAds
|
condor_collector
|
READ
|
Query the collector |
condor_collector
|
ADMINISTRATOR
|
Invalidate all AdTypes except STARTD , SCHEDD , or MASTER
|
condor_collector
|
DAEMON
|
Update the collector |
condor_collector
|
ALLOW
|
Merge STARTD
|
condor_collector
|
NEGOTIATOR
|
Begin negotiating to match jobs |
condor_schedd
|
NEGOTIATOR
|
Get matches |
condor_schedd
|
DAEMON
|
Begin negotiation cycle with condor_reschedule
|
condor_schedd
|
WRITE
|
View the status of the job queue |
condor_schedd
|
READ
|
Reconfigure |
condor_schedd
|
OWNER
|
File operations. Release claim, kill job, or spool job. |
condor_schedd
|
WRITE
|
Reuse shadow |
condor_schedd
|
DAEMON
|
Update shadow |
condor_schedd
|
DAEMON
|
Startd heartbeat |
condor_schedd
|
DAEMON
|
Store credentials |
condor_schedd
|
WRITE
|
Write to the job queue |
condor_schedd
|
WRITE
|
Transfer the job queue |
condor_schedd
|
WRITE
|
All commands |
condor_master
|
ADMINISTRATOR
|
All high availability functionality |
condor_had
|
DAEMON
|
SEC_CONTEXT
_FEATURE
=VALUE
AUTHENTICATION
ENCRYPTION
INTEGRITY
NEGOTIATION
CLIENT
READ
WRITE
ADMINISTRATOR
CONFIG
OWNER
DAEMON
NEGOTIATOR
ADVERTISE_MASTER
ADVERTISE_STARTD
ADVERTISE_SCHEDD
DEFAULT
DEFAULT
value provides a way to set a policy for all access levels that do not have a specific configuration variable defined.
REQUIRED
PREFERRED
OPTIONAL
NEVER
SEC_CLIENT_FEATURE
SEC_ADMINISTRATOR_AUTHENTICATION = REQUIRED SEC_READ_AUTHENTICATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION = OPTIONAL
condor_schedd
requires authentication to be set, it will have had this policy specified:
SEC_DEFAULT_AUTHENTICATION = REQUIRED
condor_submit
, always require authentication, regardless of the specified policy. Other commands, such as condor_q
, do not always require authentication. In this example, the server's policy would force any condor_q
queries to be authenticated, where a different policy could allow condor_q
to occur without authentication.
SEC_DEFAULT_AUTHENTICATION SEC_CLIENT_AUTHENTICATION
SEC_DEFAULT_AUTHENTICATION SEC_READ_AUTHENTICATION SEC_WRITE_AUTHENTICATION SEC_ADMINISTRATOR_AUTHENTICATION SEC_CONFIG_AUTHENTICATION SEC_OWNER_AUTHENTICATION SEC_DAEMON_AUTHENTICATION SEC_NEGOTIATOR_AUTHENTICATION SEC_ADVERTISE_MASTER_AUTHENTICATION SEC_ADVERTISE_STARTD_AUTHENTICATION SEC_ADVERTISE_SCHEDD_AUTHENTICATION
SEC_access-level
_AUTHENTICATION
, a default value of OPTIONAL
will be used. In this case, authentication is required for any operation which modifies the job queue, such as condor_qedit
and condor_rm
.
SEC_access-level
_AUTHENTICATION_METHODS
, a default value of FS, KERBEROS
will be defined. On a Microsoft Windows machine, it will default to NTSSPI, KERBEROS
.
SEC_WRITE_AUTHENTICATION = REQUIRED
WRITE
context.
SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_context
_AUTHENTICATION_METHODS SEC_context
_CRYPTO_METHODS
SSL
)
KERBEROS
)
PASSWORD
)
FS
)
FS_REMOTE
)
NTSSPI
)
CLAIMTOBE
)
ANONYMOUS
)
AUTH_SSL_CLIENT_CERTFILE
specifies the location of the client (the process that initiates the connection) certificate file
AUTH_SSL_SERVER_CERTFILE
specifies the location for the server (the process that receives the connection) certificate file
AUTH_SSL_CLIENT_KEYFILE
specifies the location of the client key
AUTH_SSL_SERVER_KEYFILE
specifies the location of the server key
AUTH_SSL_CLIENT_CAFILE
specifies a path and file name for the location of client certificates issued by trusted certificate authorities
AUTH_SSL_SERVER_CAFILE
specifies a path and file name for the location of server certificates issued by trusted certificate authorities
AUTH_SSL_CLIENT_CADIR
specifies a directory containing client certificates that have been prepared using the OpenSSL c_rehash
utility
AUTH_SSL_SERVER_CADIR
specifies a directory containing server certificates that have been prepared using the OpenSSL c_rehash
utility
KERBEROS_MAP_FILE = /path/to/etc/condor
.kmap
KERB.REALM = UID.domain.name
CS.WISC.EDU = cs.wisc.edu ENGR.WISC.EDU = ee.wisc.edu
KERBEROS_SERVER_PRINCIPAL
parameter. If a unique name is not defined, then it will default to host
.
KERBEROS_SERVER_PRINCIPAL
parameter is set as:
KERBEROS_SERVER_PRINCIPAL = condor-daemon
condor-daemon/the.host.name@YOUR.KERB.REALM
WRITE
or ADMINISTRATOR
level.
SEC_WRITE_AUTHENTICATION = REQUIRED SEC_WRITE_AUTHENTICATION_METHODS = KERBEROS SEC_ADMINISTRATOR_AUTHENTICATION = REQUIRED SEC_ADMINISTRATOR_AUTHENTICATION_METHODS = KERBEROS
SEC_PASSWORD_FILE
parameter.
condor_store_cred -f /path/to/password/file
-c
option:
condor_store_cred -c add
condor_master
is running.
CONFIG
security context is required. The pool password can be set remotely, but this method is only recommended if it takes place using an encrypted channel.
/tmp
directory. The service then checks the ownership of the file. If the file permissions match, the identity is verified and the file system becomes trusted.
/tmp
directory, but is specified by the FS_REMOTE_DIR
configuration variable.
SEC_DEFAULT_ENCRYPTION
SEC_CLIENT_ENCRYPTION
SEC_DEFAULT_ENCRYPTION
SEC_READ_ENCRYPTION
SEC_WRITE_ENCRYPTION
SEC_ADMINISTRATOR_ENCRYPTION
SEC_CONFIG_ENCRYPTION
SEC_OWNER_ENCRYPTION
SEC_DAEMON_ENCRYPTION
SEC_NEGOTIATOR_ENCRYPTION
SEC_ADVERTISE_MASTER_ENCRYPTION
SEC_ADVERTISE_STARTD_ENCRYPTION
SEC_ADVERTISE_SCHEDD_ENCRYPTION
SEC_CONFIG_ENCRYPTION = REQUIRED
SEC_DEFAULT_ENCRYPTION = REQUIRED
SEC_DEFAULT_CRYPTO_METHODS
SEC_CLIENT_CRYPTO_METHODS
SEC_DEFAULT_CRYPTO_METHODS
SEC_READ_CRYPTO_METHODS
SEC_WRITE_CRYPTO_METHODS
SEC_ADMINISTRATOR_CRYPTO_METHODS
SEC_CONFIG_CRYPTO_METHODS
SEC_OWNER_CRYPTO_METHODS
SEC_DAEMON_CRYPTO_METHODS
SEC_NEGOTIATOR_CRYPTO_METHODS
SEC_ADVERTISE_MASTER_CRYPTO_METHODS
SEC_ADVERTISE_STARTD_CRYPTO_METHODS
SEC_ADVERTISE_SCHEDD_CRYPTO_METHODS
3DES
BLOWFISH
SEC_DEFAULT_INTEGRITY
SEC_CLIENT_INTEGRITY
SEC_DEFAULT_INTEGRITY
SEC_READ_INTEGRITY
SEC_WRITE_INTEGRITY
SEC_ADMINISTRATOR_INTEGRITY
SEC_CONFIG_INTEGRITY
SEC_OWNER_INTEGRITY
SEC_DAEMON_INTEGRITY
SEC_NEGOTIATOR_INTEGRITY
SEC_ADVERTISE_MASTER_INTEGRITY
SEC_ADVERTISE_STARTD_INTEGRITY
SEC_ADVERTISE_SCHEDD_INTEGRITY
SEC_CONFIG_INTEGRITY = REQUIRED
SEC_DEFAULT_INTEGRITY = REQUIRED
yum
command:
wallaby
wallaby-utils
condor-wallaby-base-db
condor-wallaby-tools
# yum install wallaby wallaby-utils condor-wallaby-base-db condor-wallaby-tools
# service wallaby start
$ wallaby load /var/lib/condor-wallaby-base-db/condor-base-db.snapshot
name
type
default
description
conflicts
depends
level
must_change
restart
false
, a condor_reconfig
will be issued instead of a condor_restart
for those subsystems.
name
params
conflicts
depends
includes
name
memberships
Internal Default Group
, which will always exist and be applied to all nodes within the store at the lowest priority.
name
params
restart
field.
cm.example.com
has the Collector
feature installed. Because Collector
depends on Master
and NodeAccess
, these features must both be installed on cm.example.com
too.
Scheduler
feature includes JobQueueLocation
and BaseScheduler
. These inclusions are listed in order of priority, from most to least important.
Scheduler
, Wallaby begins with the configuration for the included feature with the lowest priority; in this case it's BaseScheduler
. It then merges the configuration for JobQueueLocation
so that parameter settings from JobQueueLocation
take precedence. Finally, parameters explicitly set in the Scheduler
configuration are merged; doing so ensures they take precedence over mapping set in features included by Scheduler
.
F
is considered to have installed all the features included by F
. As per the example above, Master
and NodeAccess
must be installed on any node with Scheduler
installed as these features are depended on by BaseJobScheduler
. A node that installs Scheduler
, Master
, and NodeAccess
can likewise install JobRouter
(that depends on BaseScheduler
) without explicitly installing BaseScheduler
.
Scheduler
and HAScheduler
features. This is because the features are mutually exclusive.
>=
, &&=
, or ||=
they are not replaced when added to the working configuration. In this case, Wallaby appends the value from the applied configuration to the value in the working configuration.
>=
, Wallaby will append the applied configuration value to the working configuration value, separating them with a comma. This is useful for generating comma-separated lists of values from several Wallaby features. Values beginning with &&=
and ||=
are appended with &&
or ||
as the delimiter between the value in the working configuration and the applied configuration value. These are useful for creating conjunctions or separations of ClassAd expressions.
Internal Default Group
will always be the lowest priority group that a node is a member of, so it will be inspected first. The store will then evaluate the priority of the groups that the node is a member of, and finally evaluate any features or parameters applied to the node itself. If a parameter has multiple values set in multiple features or groups, the value given in the node's configuration will be the one determined by the highest priority group or feature.
condor_configure_store
condor_configure_store
tool is used to add, remove, and edit parameters, features, groups, subsystems, and nodes in the configuration store. Only one action (add, remove, or edit) can be performed with each command, but multiple targets (parameters, features, groups, subsystems, or nodes) can be acted upon each time.
condor_configure_store
tool does not have to run on the same machine as the configuration store, nor the same machine as the broker the configuration store is communicating with. It will look for the AMQP broker on the machine it is running on by default, but it can be instructed to look for the broker in other locations, even if it is a non-standard port.
condor_configure_store
command with the --add
or -a
option, the target type, and the target:
$ condor_configure_store -a -ffeature1
-nnode1,node2
feature1
and two nodes called node1
and node2
to the configuration store.
$EDITOR
, and will default to vi. See Editing metadata for information about using the editor.
--edit
or -e
option, the target type, and the target:
$ condor_configure_store -e -ssubsys1,subsys2
-pparam1
subsys1
and subsys2
, and a parameter called param1
.
""
) is given, the condor_configure_store
tool will ask if the parameter should use the default value defined in the parameter's metadata.
--delete
or -d
option, the target type, and the target:
$ condor_configure_store -d -ssubsys1,subsys2
-pparam1
-ffeature1
-nnode1,node2
subsys1
and subsys2
, a parameter named param1
, a feature named feature1
, and nodes named node1
and node2
from the configuration store.
--help
or -h
option:
$ condor_configure_store -h
''
or ""
).
-
) followed by a single whitespace and the value. For example:
a_list: - value1 - value2
[]
).
:
) and a single whitespace character. For example:
a_map: value1: This is a string value2: '4'
{}
)
condor-wallaby-client
is installed on each node and used to manage configurations for that node. It installs a configuration file in the MRG Grid local configuration directory, this enables it to control configuration for the node. The condor-wallaby-client
package contains a service that will check in with the store and listen for configuration change notifications. When it receives a new configuration from the store, the service will write it into the local configuration file for the node. The location of the local configuration file is defined in the configuration file installed by the condor-wallaby-client
package.
condor_configure_store
(even if it already exists in the store) in order to change from unprovisioned to provisioned. Nodes are represented by their fully qualified domain names, and each node name in the store must be unique.
yum
command:
condor-wallaby-client
# yum install condor-wallaby-client
QMF_BROKER_HOST = wallaby_broker
condor_configure_pool
condor_configure_pool
tool is used to apply entities in the configuration store to physical nodes. It is also used to manage configurations within the configuration store. Only one node or group can be acted upon with each command, but multiple features and parameters can by be acted upon each time.
condor_configure_pool
tool does not have to run on the same machine as the configuration store, nor the same machine as the broker the configuration store is communicating with. It will look for the AMQP broker on the machine it is running on by default, but it can be instructed to look for the broker in other locations, even if it is a non-standard port.
condor_configure_pool
command with the --list-all-type
option. The --list-all-type
command will list the all the names of the type specified. It is possible to list more than one type by using successive commands:
$ condor_configure_pool --list-all-nodes --list-all-features
Nodes
in the store and Features
in the store.
-l
, command:
$ condor_configure_pool -l -f Master,NodeAccess
Master
and NodeAccess
.
-a
option with the names of the entities and the target node:
$ condor_configure_pool -a -nnode1
-ffeature1
-pparam1,param2
-i
option with the names of the entities and the target node:
$ condor_configure_pool -i -nnode1
-ffeature1
-pparam1,param2
feature1
and parameters called param1
and param2
to a node called node1
.
-e
option:
$ condor_configure_pool -e -n node1
node1
. The text editor is defined in $EDITOR
, and will default to vi. See Editing metadata for information about using the editor.
--default-group
option as the target instead of a specific node or group name.
Y
to instruct the tool to begin making the changes.
Y
and provide a snapshot name. If a snapshot is not desired, answer N
.
Y
otherwise answer N
.
--activate
option:
$ condor_configure_pool --activate
--activate
option will not generate a snapshot if a configuration is successfully activated.
--delete
or -d
option, the target type, and the target:
$ condor_configure_pool -d -ggroup1
-ffeature1,feature2
feature1
and feature2
from a group of nodes called group1
.
--take-snapshot
option with an appropriate name:
$ condor_configure_pool --take-snapshot "A snapshot name"
A snapshot name
.
--load-snapshot
option with the name of the snapshot to be loaded:
$ condor_configure_pool --load-snapshot "A snapshot name"
A snapshot name
.
--activate
option.
--remove-snapshot
option with the name of the snapshot to be removed:
$ condor_configure_pool --remove-snapshot "A snapshot name"
A snapshot name
.
--help
or -h
option:
$ condor_configure_pool -h
condor_configure_store
condor_configure_store
tool is used to configure parameters, features, groups, subsystems, and nodes in the store. Only one action can be performed at a time with this tool.
condor_configure_pool
condor_configure_pool
tool is used to apply configurations to a specific node or group of nodes. It uses the parameters, features, groups, and nodes stored in the configuration store by the condor_configure_store
tool. Only one node or group can be acted upon at a time, but multiple features and parameters can by be added at once.
wallaby [broker options] command
[command options]
wallaby dump [outfile]
wallaby dump
command is used to export objects in the store into plain text. The output can be placed into a file and loaded back into the store with wallaby load
.
wallaby load [file]
wallaby load
command is used to load a file generated by wallaby dump
into the configuration store. When a new database is loaded into the store, it will replace the entire contents of the store with the new information.
wallaby inventory
wallaby inventory
tool is used to list details of the nodes being managed by the configuration store. It provides the node name, information about the last time the node checked in with the store, and whether the node was explicitly provisioned in wallaby or whether it checked in for a configuration before wallaby knew about it.
Workers
group, and add all five nodes to the store:
$ condor_configure_store -a -g Workers -n node1,node2,node3,node4,node5
node3
, node4
, and node5
, to the Workers
group by setting their node membership in the editor:
memberships: - Workers
node1
the central manager. Answer N
when asked to activate the changes:
$ condor_configure_pool -n node1 -a -f CentralManager
node2
the scheduler. Answer N
when asked to activate the changes:
$ condor_configure_pool -n node2 -a -f Scheduler
Workers
group execute nodes. Answer N
when asked to activate the changes:
$ condor_configure_pool -g Workers -a -f ExecuteNode
$ condor_configure_pool --default-group -a -f Master
CONDOR_HOST
. This must be set for the configuration to be valid, and in this configuration should be node1
. Answer Y
when asked to set the parameter. If the configuration is correct, answer Y
when asked to activate the changes.
NodeAccess
feature. This can be resolved by setting it on the default group:
$ condor_configure_pool --default-group -a -f NodeAccess
ALLOW_READ
and ALLOW_WRITE
. These must be set for the configuration to be valid. Answer Y
when asked to set the parameters. If the configuration is correct, answer Y
when asked to activate the changes.
$ condor_configure_pool --list-all-snapshots
Snapshots:
Automatically generated snapshot at date time -- hash
condor_dagman
daemon. It allows users to submit lightweight jobs to be run immediately, alongside the condor_schedd
daemon on the host machine. Scheduler universe jobs are not matched with a remote machine, and will never be pre-empted.
condor_submit
, which requires a file called a submit description file. The submit description file contains the name of the executable, the initial working directory, and any command-line arguments.
condor_submit
how to run the job, what input and output to use, and where any additional files are located.
physica
.
/dev/null
for all STDIN
, STDOUT
and STDERR
. A log file, called physica
.log
will be created. When the job finishes, its exit conditions will be noted in the log file. It is recommended that you always have a log file.
Executable = physica Log = physica.log Queue
mathematica
.
run_1
, and the second will run in directory run_2
. For both queued copies, STDIN
will be test.data
, STDOUT
will be loop.out
, and STDERR
will be loop.error
. There will be two sets of files written, as the files for each job are written to the individual directories. The job will be run in the vanilla universe.
Executable = mathematica Universe = vanilla input = test.data output = loop.out error = loop.error Log = mathematica.log Initialdir = run_1 Queue Initialdir = run_2 Queue
chemistria
.
STDIN
, STDOUT
, and STDERR
will refer to in.0
, out.0
and err.0
for the first run of the program, and in.1
, out.1
and err.1
for the second run of the program. A log file will be written to chemistria
.log
.
Executable = chemistria Requirements = Memory >= 32 && OpSys == "LINUX" && Arch =="X86_64" Rank = Memory >= 64 Image_Size = 28 Meg Error = err.$(Process) Input = in.$(Process) Output = out.$(Process) Log = chemistria.log Queue 150
Universe=vm Executable=testvm Log=$(cluster).kvm.log VM_TYPE=kvm VM_MEMORY=512 VM_DISK=/var/lib/libvirt/images/you_image.img:vda:w:raw,/var/lib/libvirt/your_cdrom_data.iso:hdc:r Queue
condor_submit
command. Full details of the condor_submit
command can be found on the condor_submit manual page.
condor_status
command.
condor_q
command.
condor_rm
command.
JobStatus
will indicate that the job is running.
deferral_prep_time
with an integer expression that evaluates to a number of seconds. At this number of seconds before the deferral time, the job may be matched with a machine.
condor_hold
command is issued, the job is removed from the execution machine and put on hold.
JobStatus
attribute will always show the job as running
when job deferral is used. As of the 2.0 release of MRG Grid, there is no way to distinguish between a job that is executing and a job that has been deferred and is waiting to begin execution.
date
program from the shell prompt with the following syntax:
$ date --date "MM/DD/YYYY HH:MM:SS
" +%s
deferral_time = 1199178000
deferral_time = (CurrentTime + 60)
deferral_window = 120
deferral_time = 1262336400 deferral_prep_time = 60
SIGKILL
command. This does not allow the job to perform a graceful shutdown, and is referred to as a hard-kill. To avoid this, it is possible for each job to define a custom kill signal. In this case, when the job is killed, the custom signal will be sent first. This allows the job to perform necessary functions for a graceful shutdown, such as writing out summary data. The starter will wait a period of time after initiating the job termination before determining that the job is not responding and needs to be killed.
kill_sig
parameter:
kill_sig = 3
3
is SIGQUIT
, or:
kill_sig = SIGQUIT
killing_timeout
configuration variable, less one second. It is also possible to set a timeout value in the job description file, using the kill_sig_timeout
parameter. The starter will wait the shorter of the two values.
condor_submit
daemon, but can also be manually constructed and edited.
condor_status
to view ClassAdscondor_status
command to view ClassAds information from the machines available in the pool.
$ condor_status Name Arch OpSys State Activity LoadAv Mem ActvtyTime adriana.ex x86_64 LINUX Claimed Busy 1.000 64 0+01:10:00 alfred.exa x86_64 LINUX Claimed Busy 1.000 64 0+00:40:00 amul.examp x86_64 LINUX Owner Idle 1.000 128 0+06:20:04 anfrom.exa x86_64 LINUX Claimed Busy 1.000 32 0+05:16:22 anthrax.ex x86_64 LINUX Claimed Busy 0.285 64 0+00:00:00 astro.exam x86_64 LINUX Claimed Busy 0.949 64 0+05:30:00 aura.examp x86_64 LINUX Owner Idle 1.043 128 0+14:40:15
condor_status
command has options that can be used to view the data in different ways. The most common options are:
condor_status -available
condor_status -run
condor_status -l
$ man condor_status
for a complete list of options.
requirements
and rank
expressions. For machines, this information is determined by the configuration.
rank
expression is used by a job to specify which requirements to use to rank potential machine matches.
rank
expression to set constraints and preferences for jobsrank
expression to specify preferences a job has for a machine.
Requirements = Arch=="x86_64" && OpSys == "LINUX" Rank = TARGET.Memory + TARGET.Mips
rank
expression. The condor_negotiator
daemon will satisfy the required attributes first, then deliver the best resource available by matching the rank expression.
Friend = Owner == "arthur" || Owner == "lancelot" ResearchGroup = Owner == "miso" || Owner == "ramen" Trusted = Owner != "rival" && Owner != "riffraff" START = Trusted && ( ResearchGroup || LoadAvg < 0.3 && KeyboardIdle > 15*60 ) RANK = Friend + ResearchGroup*10
ResearchGroup
but will never run jobs owned by users rival
and riffraff
. Jobs submitted by Friends
are preferred to foreign jobs, and jobs submitted by the ResearchGroup
are preferred to jobs submitted by Friends
.
condor_status
and condor_q
tools. Some common examples are shown here:
$ man condor_status
and $ man condor_q
for a complete list of options.
condor_status
command with the -constraint
option$ condor_status -constraint 'KeyboardIdle > 20*60 && Memory > 100' Name Arch OpSys State Activity LoadAv Mem ActvtyTime eliza.exam x86_64 LINUX Claimed Busy 1.000 128 0+03:45:01 kate.examp x86_64 LINUX Claimed Busy 1.000 128 0+00:15:01 mary.examp x86_64 LINUX Claimed Busy 1.000 1024 0+01:05:00 beatrice.e x86_64 LINUX Claimed Busy 1.000 128 0+01:30:02 [output truncated] Machines Owner Claimed Unclaimed Matched Preempting x86_64/LINUX 3 0 3 0 0 0 x86_64/LINUX 21 0 21 0 0 0 x86_64/LINUX 3 0 3 0 0 0 x86_64/LINUX 1 0 0 1 0 0 x86_64/LINUX 1 0 1 0 0 0 Total 29 0 28 1 0 0
ad
contains ClassAd information:
$ cat ad MyType = "Generic" FauxType = "DBMS" Name = "random-test" Machine = "f05.example.com" MyAddress = "<128.105.149.105:34000>" DaemonStartTime = 1153192799 UpdateSequenceNumber = 1
condor_advertise
sends the file ad
containing the ClassAd information to the Collector as a Generic Ad type.
$ condor_advertise UPDATE_AD_GENERIC ad
condor_status
to constrain the search with a regular expression containing a ClassAd function:
$ condor_status -any -constraint 'FauxType=="DBMS" && regexp("random.*", Name, "i")' MyType TargetType Name Generic None random-test
condor_shared_port
. Most daemons listen on a dynamically assigned port. To send a message, Condor daemons and tools locate the correct port by querying the condor_collector
then extracting the port number from the ClassAd. The full IP address and port number the daemon is listening on is one of the attributes included in every daemon's ClassAd.
condor_collector
all daemons and tools must know the port number the condor_collector
is listening on. The condor_collector
is the only daemon with a well-known fixed port, by default this is port 9618. You can change the port by following the instructions for Changing the Fixed Port for the condor_collector
.
<SUBSYS>_ADDRESS_FILE
configuration variables.
condor_negotiator
also listened on a fixed, well-known port (the default was 9614). Beginning with Condor version 6.7.5, the condor_negotiator
behaves like all other daemons, and publishes its own ClassAd to the condor_collector
. This includes the dynamically assigned port the condor_negotiator
is listening on. All tools and daemons that communicate with the condor_negotiator
will either use the NEGOTIATOR_ADDRESS_FILE
or will query the condor_collector
for the condor_negotiator
's ClassAd.
condor_ckpt_server
will listen to 4 fixed ports: 5651, 5652, 5653, and 5654. There is currently no way to configure alternative values for any of these ports.
condor_collector
condor_collector
daemon. To use a different port number for this daemon, the configuration variables that relay communication details are modified.
CONDOR_HOST = machX.cs.wisc.edu COLLECTOR_HOST = $(CONDOR_HOST)with the following:
CONDOR_HOST = machX.cs.wisc.edu COLLECTOR_HOST = $(CONDOR_HOST):9650
COLLECTOR_HOST
(including the port). This setting should be modified in the global configuration file, condor_config
. If a single configuration file is not being shared, the value must be duplicated across all configuration files in the pool.
condor_collector
for a remote pool running on a non standard port, any tool that accepts the -pool
argument can optionally be given a port number. For example:
% condor_status -pool foo.bar.org:1234
condor_negotiator
daemon. This section examines the semantics of evaluating constraints.
MyType = "Machine" TargetType = "Job" Machine = "froth.example.com" Arch = "x86_64" OpSys = "LINUX" Disk = 35882 Memory = 128 KeyboardIdle = 173 LoadAvg = 0.1000 Requirements = TARGET.Owner=="smith" || LoadAvg<=0.3 && KeyboardIdle>15*60
UNDEFINED
and ERROR
are used to identify expressions that contain names of attributes that have no associated value or that attempt to use values in a way that is inconsistent with their types.
TRUE
represents 1
and FALSE
represents 0
.
character
"
characters. A \
character can be used as an escape character
UNDEFINED
represents an attribute that has not been given a value.
ERROR
represents an attribute with a value that is inconsistent with its type, or badly constructed.
MY.
or TARGET.
ClassAd A
which is being evaluated in relation to ClassAd B
:
MY.
the attribute will be looked up in ClassAd A
. If the attribute exists in ClassAd A
, the value of the reference becomes the value of the expression bound to the attribute name. If the attribute does not exist in ClassAd A
, the value of the reference becomes UNDEFINED
TARGET.
the attribute is looked up in ClassAd B
. If the attribute exists in ClassAd B
the value of the reference becomes the value of the expression bound to the attribute name. If the attribute does not exist in ClassAd B
, the value of the reference becomes UNDEFINED
ClassAd A
the value of the reference is the value of the expression bound to the attribute name in ClassAd A
ClassAd B
the value of the reference is the value of the expression bound to the attribute name in ClassAd B
CurrentTime
, which evaluates to the integer value returned by the system call time(2)
.
UNDEFINED
ERROR
-
takes the highest precedence in a string. In order, operators take the following precedence:
-
(unary negation)
*
and /
+
(addition) and -
(subtraction)
<
<=
>=
and >
==
!=
=?=
and =!=
&&
||
*
/
+
and -
operate arithmetically on integers and real literals
UNDEFINED
and ERROR
ERROR
==
!=
<=
<
>=
and >
operate on integers, reals and strings
=?=
and =!=
behave similarly to ==
and !=
, but are not strict. Semantically, =?=
tests if its operands have the same type and the same value. For example, 10 == UNDEFINED
and UNDEFINED == UNDEFINED
both evaluate to UNDEFINED
, but 10 =?= UNDEFINED
will evaluate to FALSE
and UNDEFINED =?= UNDEFINED
will evaluate to TRUE
. The =!=
operator tests for not identical conditions
=?=
and =!=
which perform case sensitive comparisons when both sides are strings
ERROR
==
!=
<=
<
and >= >
are strict with respect to both UNDEFINED
and ERROR
&&
and ||
operate on integers and reals. The zero value of these types are considered FALSE
and non-zero values TRUE
UNDEFINED
and ERROR
values when possible. For example, UNDEFINED && FALSE
evaluates to FALSE
, but UNDEFINED || FALSE
evaluates to UNDEFINED
ERROR
operand for a logical operator. For example TRUE && "string"
evaluates to ERROR
ReturnType
FunctionName(ParameterType1
parameter1
,ParameterType2
parameter2,
...)
AnyType
. Where the type is Integer
, but only returns the value 1
or 0
(True
or False
), it is described as Boolean
. Optional parameters are given in square brackets.
AnyType
ifThenElse(AnyType
IfExpr
,AnyType
ThenExpr
, AnyType
ElseExpr
)IfExpr
evaluates to true
, return the value as given by ThenExpr
false
, return the value as given by ElseExpr
UNDEFINED
, return the value UNDEFINED
ERROR
, return the value ERROR
IfExpr
evaluates to 0.0
, return the value as given by ElseExpr
IfExpr
evaluates to a non-0.0
or Real value, return the value as given by ThenExpr
IfExpr
evaluates to give a value of type String
, return the value ERROR
ERROR
Boolean
isUndefined(AnyType
Expr
)True
if Expr
evaluates to UNDEFINED
. Otherwise, returns False
ERROR
Boolean
isError(AnyType
Expr
)True
, if Expr
evaluates to ERROR
. Otherwise, returns False
ERROR
Boolean
isString(AnyType
Expr
)True
if Expr
gives a value of type String
. Otherwise, returns False
ERROR
Boolean
isInteger(AnyType
Expr
)True
, if Expr
gives a value of type Integer
. Otherwise, returns False
ERROR
Boolean
isReal(AnyType
Expr
)True
if Expr
gives a value of type Real
. Otherwise, returns False
ERROR
Boolean
isBoolean(AnyType
Expr
)True
, if Expr
returns an integer value of 1
or 0
. Otherwise, returns False
ERROR
Integer
int(AnyType
Expr
)Expr
Expr
is Real
the value is rounded down to an integer
Expr
is String
the string is converted to an integer using a C-like atoi()
function. If the result is not an integer, ERROR
is returned
Expr
is ERROR
or UNDEFINED
, ERROR
is returned
ERROR
Real
real(AnyType
Expr
)Expr
Expr
is Integer
the return value is the converted integer
Expr
is String
the string is converted to a real value using a C-like atof()
function. If the result is not real
ERROR
is returned
Expr
is ERROR
or UNDEFINED
, ERROR
is returned
ERROR
String
formatTime([ Integer time
] [ , String format
])time
is interpreted as coordinated universe time in seconds, since midnight of January 1, 1970. If not specified, time
will default to the value of attribute CurrentTime
.
strftime
function. It consists of arbitrary text plus placeholders for elements of the time. These placeholders are percent signs (%
) followed by a single letter. To have a percent sign in the output, use a double percent sign (%%
). If the format is not specified, it defaults to %c
(local date and time representation).
strftime()
to implement this, and some versions implement extra, non-ANSI C options, the exact options available to an implementation may vary. An implementation is only required to use the ANSI C options, which are:
%a
%A
%b
%B
%c
%d
%H
%I
%j
%m
%M
%p
%S
%U
%w
%W
%x
%X
%y
%Y
%Z
String
string(AnyType
Expr
)Expr
string
value will be converted to a string
Expr
is ERROR
or UNDEFINED
, ERROR
is returned
ERROR
Integer
floor(AnyType
Expr
)Expr
is Integer
, returns the integer that results from the evaluation of Expr
Expr
is anything other than Integer
, function real(Expr)
is called. Its return value is then used to return the largest integer that is not higher than the returned value
Real(Expr)
returns ERROR
or UNDEFINED
, ERROR
is returned
ERROR
Integer
ceiling(AnyType
Expr
)Expr
is Integer
, returns the integer that results from the evaluation of Expr
Expr
is anything other than Integer
, function real(Expr)
is called. Its return value is then used to return the smallest integer that is not less than the returned value
Real(Expr)
returns ERROR
or UNDEFINED
, ERROR
is returned
ERROR
Integer
round(AnyType
Expr
)Expr
is Integer
, returns the integer that results from the evaluation of Expr
Expr
is anything other than Integer
, function real(Expr)
is called. Its return value is then used to return the integer that results from a round-to-nearest rounding method. The nearest integer value to the return value is returned, except in the case of the value at the exact midpoint between two values. In this case, the even valued integer is returned
Real(Expr)
returns ERROR
or UNDEFINED
, or the integer does not fit into 32 bits ERROR
is returned
ERROR
Integer
random([ AnyType
Expr
])Expr
evaluates to Integer
or Real
, the return value is the integer or real r
randomly chosen from the interval 0 <= r < x
random(1.0)
ERROR
ERROR
String
strcat(AnyType
Expr1
[ , AnyType
Expr2
... ])String
by function string(Expr)
ERROR
or UNDEFINED
, ERROR
is returned
String
substr(String s, Integer
offset
[ , Integer
length
])s
, from the position indicated by offset
, with optional length
characters
s
is at offset 0
. If the length
argument is not used, the substring extends to the end of the string
offset
is negative, the value of length - offset
is used for offset
length
is negative, an initial substring is computed, from the offset to the end of the string. Then, the absolute value of length characters are deleted from the right end of the initial substring. Further, where characters of this resulting substring lie outside the original string, the part that lies within the original string is returned. If the substring lies completely outside of the original string, the null string is returned
ERROR
Integer
strcmp(AnyType
Expr1
, AnyType
Expr2
)String
by string(Expr)
0
if Expr1
is less than Expr2
0
if Expr1
is equal to Expr2
0
if Expr1
is greater than Expr2
ERROR
or UNDEFINED
, ERROR
is returned
ERROR
Integer
stricmp(AnyType
Expr1
, AnyType
Expr2
)strcmp
function, except that letter case is not significant
String
toUpper(AnyType
Expr
)String
by the string(Expr)
ERROR
or UNDEFINED
, ERROR
is returned
ERROR
String
toLower(AnyType
Expr
)String
by the string(Expr)
ERROR
or UNDEFINED
, ERROR
is returned
ERROR
Integer
size(AnyType
Expr
)string(Expr)
function
ERROR
or UNDEFINED
, ERROR
is returned
ERROR
, |
" with string listsstringListSize
function to demonstrate how a string delimiter of ", |
" (a comma, followed by a space character, followed by a pipe) operates.
StringListSize("1,2 3|4&5", ", |")
"1" and "2 3|4&5"
"1", "2" and "3|4&5"
"1", "2", "3" and "4&5"
string delimiter
is optional in the following functions. If no string delimiter
is defined, the default string delimiter of " ,
" (a space character, followed by a comma) is used.
Integer
stringListSize(String
list
[ , String
delimiter
])String
list
, as delimited by the String
delimiter
ERROR
ERROR
Integer
stringListSum(String
list
[ , String
delimiter
]) OR Real
stringListSum(String
list
[ , String
delimiter
])String
list
, as delimited by the String
delimiter
ERROR
Real
stringListAve(String
list
[ , String
delimiter
])String
list
, as delimited by the String
delimiter
ERROR
0.0
Integer
stringListMin(String
list
[ , String
delimiter
]) OR Real
stringListMin(String
list
[ , String
delimiter
])String
list
, as delimited by the String
delimiter
ERROR
UNDEFINED
Integer
stringListMax(String
list
[ , String
delimiter
]) OR Real
stringListMax(String
list
[ , String
delimiter
])String
list
, as delimited by the String
delimiter
ERROR
UNDEFINED
Boolean
stringListMember(String
x
, String
list
[ , String
delimiter
])TRUE
if item x
is in the string
list
, as delimited by the String
delimiter
FALSE
if item x
is not in the string
list
strcmp()
function
ERROR
Boolean
stringListIMember(String
x
, String
list
[ , String
delimiter
])stringListMember
function, except that the comparison is done with the stricmp()
function, so letter case is not significant
Option | Description |
---|---|
I or i
| Ignore letter case |
M or m
|
Modifies the interpretation of the carat (^ ) and dollar sign ($ ) characters, so that ^ matches the start of a string, as well as after each new line character and $ matches before a new line character
|
S or s
|
Modifies the interpretation of the period (. ) character to match any character, including the new line character.
|
X or x
|
Ignore white space and comments within the pattern. A comment is defined by starting with the # character, and continuing until the new line character.
|
Boolean
regexp(String
pattern
, String
target
[ , String
options ])TRUE
if the String
target
is a regular expression as described by pattern
. Otherwise returns FALSE
String
, or if pattern
does not describe a valid regular expression, returns ERROR
String
regexps(String
pattern
, String
target
, String
substitute
, [ String
options
])pattern
is applied to target
. If the String
target
is a regular expression as described by pattern
, the String
substitute
is returned, with backslash expansion performed
String
returns ERROR
Boolean
stringListRegexpMember(String
pattern
, String
list
[ , String
delimiter
] [ , String
options
])TRUE
if any of the strings within the list is a regular expression as described by pattern
. Otherwise returns FALSE
String
, or if pattern
does not describe a valid regular expression, returns ERROR
String
delimiter
is required. If a specific delimiter
is not specified, the default value of " ,
" (a space character followed by a comma) will be used
Integer
time()CurrentTime
. This is the time, in seconds, since midnight on January 1, 1970
String
interval(Integer
seconds
)days+hh:mm:ss
representing an interval of time. Leading values of zero are omitted from the string
condor_startd
daemon is able to divide system resources among all available slots, by changing how they advertised to the collector for match-making purposes. This parameter will cause all jobs to execute inside a wrapper that will enforce limits on RAM, disk space, and swap space.
USER_JOB_WRAPPER=$(LIBEXEC)/condor_limits_wrapper.sh
# service condor restart Stopping condor services: [ OK ] Starting condor services: [ OK ]
nobody
. The nobody
user is often used by the system for other jobs, or non-condor tasks as well, which can make killing processes owned by nobody
complicated. To avoid this issue, create a dedicated low-privilege user account for each job execution slot on every machine. These user accounts can then be used for running jobs instead of the nobody
account.
condorusr1
, and condorusr2
:
# adduser condorusr1 # adduser condorusr2
SLOT1_USER = condorusr1 SLOT2_USER = condorusr2
DEDICATED_EXECUTE_ACCOUNT_REGEXP
configuration variable to the file. This allows condor to kill all the processes belonging to these users when a job has been completed. The DEDICATED_EXECUTE_ACCOUNT_REGEXP
configuration variable uses a regular expression to match the user accounts:
DEDICATED_EXECUTE_ACCOUNT_REGEXP = condorusr[0-9]+
STARTER_ALLOW_RUNAS_OWNER
configuration variable, so that it no longer runs jobs as the job owner, but uses the dedicated user accounts instead:
STARTER_ALLOW_RUNAS_OWNER = False
Tracking process family by login "condorusr1"
750
to 757
:
# groupadd -g 750-757
USE_GID_PROCESS_TRACKING = True MIN_TRACKING_GID = 750 MAX_TRACKING_GID = 757
condor_procd
daemon. If the USE_GID_PROCESS_TRACKING
configuration variable is set to True
, condor_procd
will be used regardless of the setting for USE_PROCD
.
condor_q
).
condor_startd
daemon will spawn a condor_starter
daemon to manage the execution of the job. The job will then be treated as any other, and can potentially be pre-empted by a higher ranking job.
condor_startd
. Job hooks invoked during a job's lifecycle are handled by the condor_starter
daemon.
HOOK_FETCH_WORK
condor_startd
daemon. The FetchWorkDelay
configuration variable determines how long the daemon will wait between attempts to fetch work.
HOOK_REPLY_FETCH
HOOK_FETCH_WORK
job hook, the condor_startd
decides whether to accept or reject the fetched job and uses HOOK_REPLY_FETCH
job hook to send notification of the decision.
condor_startd
will not wait for the results of HOOK_REPLY_FETCH
before performing other actions. The output and exit status of this hook is ignored.
HOOK_EVICT_CLAIM
HOOK_EVICT_CLAIM
is invoked by condor_startd
in order to evict a fetched job. This hook is also advisory in nature.
HOOK_PREPARE_JOB
condor_starter
invokes HOOK_PREPARE_JOB
. This job hook allows commands to be executed to set up the job environment, such as transferring input files.
condor_starter
will wait for HOOK_PREPARE_JOB
to be returned before it attempts to execute the job. An exit status of 0
indicates that the job has been prepared successfully. If the hook returns with an exit status that is not 0
, an error has occurred and the job will be aborted.
HOOK_UPDATE_JOB_INFO
STARTER_INITIAL_UPDATE_INTERVAL
configuration variable. After the initial interval, further intervals can be adjusted with the STARTER_UPDATE_INTERVAL
configuration variable. Using the default values, the hook would be invoked for the first time eight seconds after the job has begun executing, and then every five minutes (600 seconds) thereafter.
HOOK_JOB_EXIT
condor_starter
will wait for this hook to return before taking any further action.
condor_startd
will have the same privileges as the condor user (or the privileges of the user running the startd
, if that is a user other than Condor). Job hooks invoked by the condor_starter
will have the same privileges as the job owner.
condor_startd
daemon will attempt to fetch new work in two circumstances:
condor_startd
evaluates its own state; and
condor_starter
exits after completing fetched work.
condor_startd
checks for new work, this can be prevented. This can be achieved by defining the FetchWorkDelay
configuration variable.
FetchWorkDelay
variable is expressed as the number of seconds to wait in between the last fetch attempt completing and attempting to fetch another job.
FetchWorkDelay
configuration variablecondor_startd
to wait for 300 seconds (5 minutes) between attempts to fetch jobs, unless the slot is marked as Claimed/Idle
. In this case, condor_startd
should attempt to fetch a job immediately:
FetchWorkDelay = ifThenElse(State == "Claimed" && Activity == "Idle", 0, 300)
FetchWorkDelay
variable is not defined, condor_startd
will default to a 300 second (5 minute) delay between all attempts to fetch work, regardless of the state of the slot.
condor_startd
will use the global keyboard defined in the STARTD_JOB_HOOK_KEYWORD
configuration variable.
HOOK_FETCH_WORK
job hook, the condor_startd
daemon will use the keyword for that job to select the hooks required to execute it.
STARTD_JOB_HOOK_KEYWORD = DATABASE SLOT4_JOB_HOOK_KEYWORD = WEB DATABASE_HOOK_DIR = /usr/local/condor/fetch/database DATABASE_HOOK_FETCH_WORK = $(DATABASE_HOOK_DIR)/fetch_work.php DATABASE_HOOK_REPLY_FETCH = $(DATABASE_HOOK_DIR)/reply_fetch.php WEB_HOOK_DIR = /usr/local/condor/fetch/web WEB_HOOK_FETCH_WORK = $(WEB_HOOK_DIR)/fetch_work.php
DATABASE
and WEB
are very generic terms. It is advised that you choose more specific keywords for your own installation.
condor_startd
daemon to implement policies that perform actions such as:
Owner
to Unclaimed
START
expression evaluates to TRUE
.
Unclaimed
to Owner
START
expression evaluates to FALSE
.
Unclaimed
to Matched
Unclaimed
to Claimed
condor_schedd
initiates the claiming procedure before the condor_startd
receives notification of the match from the condor_negotiator
.
Matched
to Owner
START
expression evaluates to FALSE
.
MATCH_TIMEOUT
timer expires. This occurs when a machine has been matched but not claimed. The machine will eventually give up on the match and become available for a new match.
condor_schedd
has attempted to claim the machine but encountered an error.
condor_startd
receives a condor_vacate
command while it is in the Matched
state.
Matched
to Claimed
Claimed
to Pre-empting
Claimed
state, the only possible destination is the Pre-empting
state. This transition can be caused when:
PREEMPT
expression evaluates to TRUE
condor_startd
receives a condor_vacate
command.
condor_startd
is instructed to shut down.
Pre-empting
to Claimed
Pre-empting
to Owner
PREEMPT
expression evaluated to TRUE
while the machine was in the Claimed
state
condor_startd
receives a condor_vacate
command
START
expression evalutes to FALSE
and the job it was running had finished being evicted when it entered the Pre-empting
state.
Owner
Idle
: This is the only possible activity for a machine in the Owner
state. It indicates that the machine is not currently performing any work for MRG Grid, even though it may be working on other unrelated tasks.
Unclaimed
Idle
: This is the normal activity for machines in the Unclaimed
state. The machine is available to run MRG Grid tasks, but is not currently doing so.
Benchmarking
: This activity only occurs in the Unclaimed
state. It indicates that benchmarks are being run to determine the speed of the machine. How often this activity occurs can be adjusted by changing the RunBenchmarks
configuration variable.
Matched
Idle
: Although the machine is matched, it is still considered Idle
, as it is not currently running a job.
Claimed
Idle
: The machine has been claimed, but the condor_starter
daemon, and therefore the job, has not yet been started. The machine will briefly return to this state when the job finishes.
Busy
: The condor_starter
daemon has started and the job is running.
Suspended
: The job has been suspended. The claim is still valid, but the job is not making any progress and MRG Grid is not currently generating a load on the machine.
Retiring
: When an active claim is about to be pre-empted, it enters retirement while it waits for the current job to finish. The MaxJobRetirementTime
configuration variable determines how long to wait. Once the job finishes or the retirement time expires, the Preempting
state is entered.
Preempting
Vacating
: The job that was running is attempting to exit gracefully.
Killing
: The machine has requested the currently running job to exit immediately.
condor_startd
daemoncondor_startd
daemon. This daemon evaluates a number of expressions in order to determine when to transition between states and activities. The most important expressions are explained here.
condor_startd
daemon represents the machine or slot on which it is running. This daemon is responsible for publishing characteristics about the machine in the machine's ClassAd. To see the values for the attributes, run condor_status -l hostname
from the shell prompt. On a machine with more than one slot, the condor_startd
will regard the machine as separate slots, each with its own name and ClassAd.
condor_negotiator
evaluates expressions in the machine ClassAd against job ClassAds to see if there is a match. By locally evaluating an expression, the machine only evaluates the expression against its own ClassAd. If the expression references parameters that can only be found in another ClassAd, then the expression can not be locally evaluated. In this case, the expression will usually evaluate locally to UNDEFINED
.
START
expressioncondor_startd
daemon is the START
expression. This expression describes the conditions that must be met for a machine to run a job. This expression can reference attributes in the machine ClassAd - such as KeyboardIdle
and LoadAvg
- and attributes in a job ClassAd - such as Owner
, Imagesize
and Cmd
(the name of the executable the job will run). The value of the START
expression plays a crucial role in determining the state and activity of a machine.
IsOwner
expression to determine if it is capable of running jobs. The default IsOwner
expression is a function of the START
expression, so that START =?= FALSE
. Any job ClassAd attributes appearing in the START
expression, and subsequently in the IsOwner
expression, are undefined in this context, and may lead to unexpected behavior. If the START
expression is modified to reference job ClassAd attributes, the IsOwner
expression should also be modified to reference only machine ClassAd attributes.
REQUIREMENTS
expressionREQUIREMENTS
expression is used for matching machines with jobs. When a machine is unavailable for further matches, the REQUIREMENTS
expression is set to FALSE
. When the START
expression locally evaluates to TRUE
, the machine advertises the REQUIREMENTS
expression as TRUE
and does not publish the START
expression.
RANK
expressionRANK
expression in the machine ClassAd. It can reference any attribute found in either the machine ClassAd or a job ClassAd. The most common use of this expression is to configure a machine so that it prefers to run jobs from the owner of that machine. Similarly, it is often used for a group of machines to prefer jobs submitted by the owners of those machines.
RANK
expression in the machine ClassAdRANK
expression
tenorsax
is owned by the user coltrane
piano
is owned by the user tyner
bass
is owned by the user garrison
drums
is owned by the user jones
RANK
expression to reference the Owner
attribute, where it matches one of the people in the group:
RANK = Owner == "coltrane" || Owner == "tyner" \ || Owner == "garrison" || Owner == "jones"
1
or 0
(TRUE
or FALSE
). In this case, if the remote job is owned by one of the preferred users, the RANK
expression will evaluate to 1
. If the remote job is owned by any other user, it would evaluate to 0
. The RANK
expression is evaluated as a floating point number, so it will prefer the group users because it evaluates to a higher number.
RANK
expression in the machine ClassAdRANK
expression. It uses the same basic scenario as Example 9.1, “A simple application of the RANK
expression in the machine ClassAd”, but gives the owner a higher priority on their own machine.
bass
, which is owned by the user garrison
. The following entry would need to be included in a file in the local configuration directory on that machine:
RANK = (Owner == "coltrane") + (Owner == "tyner") \ + ((Owner == "garrison") * 10) + (Owner == "jones")
+
operator has higher default precedence than ==
. Using +
instead of ||
allows the system to match some terms and not others.
bass
, the RANK
expression will evaluate to 0
, as all of the boolean terms evaluate to 0
. If the user jones
submits a job, his job would match this machine and the RANK
expression will evaluate to 1
. Therefore, the job submitted by jones
would pre-empt the running job. If the user garrison
(the owner of the machine) later submits a job, the RANK
expression will evaluate to 10
because the boolean that matches Jimmy gets multiplied by 10. In this case, the job submitted by garrison
will pre-empt the job submitted by jones
.
RANK
expression can reference parameters other than Owner
. If one machine has an enormous amount of memory and other do not have much at all, the RANK
expression can be used to run jobs with larger memory requirements on the machine best suited to it, by using RANK = ImageSize
. This preference will always service the largest of the jobs, regardless of which user has submitted them. Alternatively, a user could specify that their own jobs should run in preference to those with the largest ImageSize
by using RANK = (Owner == "user_name
" * 1000000000000) + Imagesize
.
Owner
stateOwner
state represents a resource that is currently in use and not available to run jobs. When the startd
is first spawned, the machine will enter the Owner
state. The machine remains in the Owner
state while the IsOwner
expression evaluates to TRUE
. If the IsOwner
expression is FALSE
, then the machine will transition to Unclaimed
, indicating that it is ready to begin accepting jobs.
IsOwner
is optimized to START =?= FALSE
. This causes the machine to remain in the Owner
state as long as the START
expression locally evaluates to FALSE
. If the START
expression locally evaluates to TRUE
or cannot be locally evaluated (in which case, it will evaluate to UNDEFINED
), the machine will transition to the Unclaimed
state. For dedicated resources, the recommended value for the IsOwner
expression is FALSE
.
IsOwner
expression should not reference job ClassAd attributes as this would result in an evaluation of UNDEFINED
.
Owner
state, the startd
polls the status of the machine. The frequency of this is determined by the UPDATE_INTERVAL
configuration variable. The poll performs the following actions:
startd
has any critical tasks that need to be performed when the machine moves out of the Owner
state
Owner
state. Once a job is started, the value of IsOwner
is no longer relevant and the job will either run to completion or be preempted.
Unclaimed
stateUnclaimed
state represents a resource that is not currently in use by its owner or by MRG Grid.
Unclaimed
state are:
Owner:Idle
Matched:Idle
Claimed:Idle
condor_negotiator
matches a machine with a job, it sends a notification of the match to each. Normally, the machine will enter the Matched
state before progressing to Claimed:Idle
. However, if the job receives the notification and initiates the claiming procedure before the machine receives the notification, the machine will transition directly to the Claimed:Idle
state.
IsOwner
expression is TRUE, the machine is in the Owner
State. When the IsOwner
expression is FALSE
, the machine goes into the Unclaimed
state. If the IsOwner
expression is not present in the configuration files, then the default value is START =?= FALSE
. This causes the machine to transition to the Owner
state when the START
expression locally evaluates to TRUE
.
Owner
and Unclaimed
states. The most obvious difference is how the resources are displayed in condor_status
and other reporting tools. The only other difference is that benchmarking will only be run on a resource that is in the Unclaimed
state. Whether or not benchmarking is run is determined by the RunBenchmarks
expression. If RunBenchmarks
evaluates to TRUE
while the machine is in the Unclaimed
state, then the machine will transition from the Idle
activity to the Benchmarking
activity. Benchmarking performs and records two performance measures:
Idle
.
Unclaimed
state.
BenchmarkTimer
is used in this example, which records the time since the last benchmark. When this time exceeds four hours, the benchmarks will be run again. A weighted average is used, so the more frequently the benchmarks run, the more accurate the data will be.
BenchmarkTimer = (CurrentTime - LastBenchmark) RunBenchmarks = $(BenchmarkTimer) >= (4 * $(HOUR))
RunBenchmarks
is defined and set to anything other than FALSE
, benchmarking will be run as soon as the machine transitions into the Unclaimed
state. To completely disable benchmarks, set RunBenchmarks
to FALSE
or remove it from the configuration file.
Matched
stateMatched
state occurs when the machine has been matched to a job by the negotiator, but the job has not yet claimed the machine. Machines are in this state for a very short period before transitioning.
Claimed:Idle
state. At any time while the machine is in the Matched
state, if the START
expression locally evaluates to FALSE
the machine will enter the Owner
state.
Matched
state will adjust the START
expression so that the requirements evaluate to FALSE
. This is to prevent the machine being matched again before it has been claimed.
startd
will start a timer when a machine transitions into the Matched
state. This is to prevent the machine from staying in the Matched
state for too long. The length of the timer can be adjusted with the MATCH_TIMEOUT
configuration variable, which defaults to 120 seconds (2 minutes). If the job that was matched with the machine does not claim it within this period of time, the machine gives up, and transitions back into the Owner
state. Normally, it would then transition straight back to the Unclaimed
state to wait for a new match.
Claimed
stateClaimed
state occurs when the machine has been claimed by a job. It is the most complex state, with the most possible transitions.
Claimed
state it is in the Idle
activity. If a job has claimed the machine and the claim will be activated, the machine will transition into the Busy
activity and the job started. If a condor_vacate
arrives, or the START
expression locally evaluates to FALSE
, it will enter the Retiring
activity before transitioning to the Pre-empting
state.
Claimed:Busy
, the startd
daemon will evaluate the WANT_SUSPEND
expression to determine which other expression to evaluate. If WANT_SUSPEND
evaluates to TRUE
, the startd
will evaluate the SUSPEND
expression to determine whether or not to transition to Claimed:Suspended
. Alternatively, if WANT_SUSPEND
evaluates to FALSE
the startd
will evaluate the PREEMPT
expression to determine whether or not to skip the Suspended
state and move to Claimed:Retiring
before transitioning to the Preempting
state.
Claimed
state, the startd
daemon will poll the machine for a change in state much more frequently than while in other states. The frequency can be adjusted by changing the POLLING_INTERVAL
configuration variable.
condor_vacate
command affects the machine when it is in the Claimed
state. There are a variety of events that may cause the startd
daemon to try to suspend or kill a running job. Possible causes could be:
startd
has been instructed to shut down
startd
can be configured to handle interruptions in different ways. Activity on the machine could be ignored, or it could cause the job to be suspended or even killed. Desktop machines can benefit from a configuration that goes through successively more dramatic actions to remove the job. The least costly option to the job is to suspend it. If the owner is still using the machine after suspending the job for a short while, then startd
will attempt to vacate the job. Vanilla jobs are sent a soft kill signal, such as SIGTERM
, so that they can gracefully shut down. If the owner wants to resume using the machine, and vacating can not be completed, the startd
will progress to kill the job. Killing is a quick death to a job. It uses a hard-kill signal that cannot be intercepted by the application. For vanilla jobs, vacating and killing are equivalent actions.
Pre-empting
statePre-empting
state is used to evict a running job from a machine, so that a new job can be started. There are two possible activities while in the Pre-empting
state. Which activity the machine is in is dependent on the value of the WANT_VACATE
expression. If WANT_VACATE
evaluates to TRUE
, the machine will enter the Vacating
activity. Alternatively, if WANT_VACATE
evaluates to FALSE
, the machine will enter the Killing
activity.
Pre-empting
state is to remove the condor_starter
associated with the job. If the condor_starter
associated with a given claim exits while the machine is still in the Vacating
activity, then the job has successfully completed a graceful shutdown, and the application was given the opportunity to intercept the soft kill signal.
Pre-empting
state the machine advertises its Requirements
expression as FALSE
, to signify that it is not available for further matches. This is because it is about to transition to the Owner
state, or because it has already been matched with a job that is currently pre-empting and further matches are not allowed until the machine has been claimed by the new job.
Vacating
activity, it continually evaluates the KILL
expression. As soon as it evaluates to TRUE
, the machine will enter the Killing
activity.
Killing
activity it attempts to force the condor_starter
to immediately kill the job. Once the machine has begun to kill the job, the condor_startd
starts a timer. The length of the timer defaults to 30 seconds and can be adjusted by changing the KILLING_TIMEOUT
macro. If the timer expires and the machine is still in the Killing
activity, it is assumed that an error has occurred with the condor_starter
and the startd will try to vacate the job immediately by sending SIGKILL
to all of the children of the condor_starter
and then to the condor_starter
itself.
condor_starter
has killed all the processes associated with the job and exited, and once the schedd that had claimed the machine is notified that the claim is broken, the machine will leave the Killing
activity. If the job was pre-empted because a better match was found, the machine will enter Claimed:Idle
. If the pre-emption was caused by the machine owner, the machine will enter the Owner
state.
MINUTE
60
HOUR
(60 * $(MINUTE))
StateTimer
(CurrentTime - EnteredCurrentState)
ActivityTimer
(CurrentTime - EnteredCurrentActivity)
ActivationTimer
(CurrentTime - JobStart)
NonCondorLoadAvg
(LoadAvg - CondorLoadAvg)
BackgroundLoad
0.3
HighLoad
NonCondorLoadAvg
goes over this, the CPU is considered too busy, and eviction of the job should start
0.5
StartIdleTime
15 * $(MINUTE)
ContinueIdleTime
5 * $(MINUTE)
MaxSuspendTime
10 * $(MINUTE)
MaxVacateTime
10 * $(MINUTE)
KeyboardBusy
TRUE
when the keyboard is being used
KeyboardIdle < $(MINUTE)
CPUIdle
TRUE
when the CPU is idle
$(NonCondorLoadAvg) <= $(BackgroundLoad)
CPUBusy
TRUE
when the CPU is busy
$(NonCondorLoadAvg) >= $(HighLoad)
MachineBusy
($(CPUBusy) || $(KeyboardBusy)
CPUIsBusy
CPUBusy
CPUBusyTime
CPUBusy
became TRUE
. Evaluates to 0
if CPUBusy
is FALSE
WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) || $(IsVanilla) ) WANT_VACATE = ( $(ActivationTimer) > 10 * $(MINUTE) || $(IsVanilla) )
START
START = ( (KeyboardIdle > $(StartIdleTime)) \ && ( $(CPUIdle) || (State != "Unclaimed" \ && State != "Owner")) )
SUSPEND
SUSPEND = ( $(KeyboardBusy) || \ ( (CpuBusyTime > 2 * $(MINUTE)) \ && $(ActivationTimer) > 90 ) )
CONTINUE
CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) \ && (KeyboardIdle > $(ContinueIdleTime)) )
PREEMPT
PREEMPT = ( ((Activity == "Suspended") && \ ($(ActivityTimer) > $(MaxSuspendTime))) \ || (SUSPEND && (WANT_SUSPEND == False)) )
MaxJobRetirementTime
MaxJobRetirementTime = 0
KILL
KILL = $(ActivityTimer) > $(MaxVacateTime)
coltrane
submits a job. When this occurs, the job should start execution immediately, regardless of what else is happening on the machine at that time.
coltrane
should not be suspended, vacated or killed. This is reasonable because coltrane
will only be submitting very short running programs for testing purposes.
START = ($(START)) || Owner == "coltrane" SUSPEND = ($(SUSPEND)) && Owner != "coltrane" CONTINUE = $(CONTINUE) PREEMPT = ($(PREEMPT)) && Owner != "coltrane" KILL = $(KILL)
CONTINUE
or KILL
expressions. Because the jobs submitted by coltrane
will never be suspended, the CONTINUE
expression is irrelevant. Similarly, because the jobs can not be pre-empted, KILL
is irrelevant.
ClockMin
and ClockDay
attributes. These are special attributes which are automatically inserted by the condor_startd
into its ClassAd, so they can always be referenced in policy expressions. ClockMin
defines the number of minutes that have passed since midnight. ClockDay
defines the day of the week, where Sunday = 0, Monday = 1, and so on to Saturday = 7.
WorkHours = ( (ClockMin >= 480 && ClockMin < 1020) && \ (ClockDay > 0 && ClockDay < 6) ) AfterHours = ( (ClockMin < 480 || ClockMin >= 1020) || \ (ClockDay == 0 || ClockDay == 6) )
START = $(AfterHours) && $(CPUIdle) && KeyboardIdle > $(StartIdleTime)
MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) )
WANT_SUSPEND = $(AfterHours)
MachineBusy
macro is used to define the SUSPEND
and PREEMPT
expressions. If you have changed these expressions, you will need to add $(WorkHours)
to your SUSPEND
and PREEMPT
expressions as appropriate.
condor_config
with a toggle that can be set in the local configuration directory.
IsDesktop
is configured, make it an attribute of the machine ClassAd:
STARTD_EXPRS = IsDesktop
START = ( ($(CPUIdle) || (State != "Unclaimed" && State != "Owner")) \ && (IsDesktop =!= True || (KeyboardIdle > $(StartIdleTime))) )
WANT_SUSPEND = ( $(SmallJob) || $(JustCpu) \ || $(IsVanilla) )
WANT_VACATE = ( $(ActivationTimer) > 10 * $(MINUTE) \ || $(IsVanilla) )
SUSPEND = ( ((CpuBusyTime > 2 * $(MINUTE)) && ($(ActivationTimer) > 90)) \ || ( IsDesktop =?= True && $(KeyboardBusy) ) )
CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 300) \ && (IsDesktop =!= True || (KeyboardIdle > $(ContinueIdleTime))) )
PREEMPT = ( ((Activity == "Suspended") && \ ($(ActivityTimer) > $(MaxSuspendTime))) \ || (SUSPEND && (WANT_SUSPEND == False)) )
0
with the desired amount of retirement time for dedicated machines. The other part of the expression forces the whole expression to 0
on desktop machines:
MaxJobRetirementTime = (IsDesktop =!= True) * 0
KILL = $(ActivityTimer) > $(MaxVacateTime)
condor_config
, the local configuration directories for desktops can now be configured with the following line:
IsDesktop = True
MAXJOBRETIREMENTTIME = $(HOUR) * 24 * 2
MAXJOBRETIREMENTTIME
, but this can only cause less retirement time to be used, never more than what the machine offers.
MAXJOBRETIREMENTTIME
limits the killing of jobs, but it does not prevent the pre-emption of resource claims. Therefore, it is technically not a way of disabling pre-emption, but simply a way of forcing pre-empting claims to wait until an existing job finishes or runs out of time.
PREEMPT = False
PREEMPTION_REQUIREMENTS = False
RANK = 0
NEGOTIATOR_CONSIDER_PREEMPTION = False
condor_negotiator
gives the condor_schedd
a match to a machine, the condor_schedd
may hold onto this claim indefinitely, as long as the user keeps supplying more jobs to run. To avoid this behavior, force claims to be retired after a specified period of time bys etting the CLAIM_WORKLIFE
variable. This enforces a time limit, beyond which no new jobs may be started on an existing claim. In this case, the condor_schedd
daemon is forced to go back to the condor_negotiator
to request a new match. The execute machine configuration would include a line that forces the schedd to renegotiate for new jobs after 20 minutes:
CLAIM_WORKLIFE = 1200
NEGOTIATOR_CONSIDER_PREEMPTION
to False
, as it can potentially lead to some machines never being matched to jobs.
slot1
only runs jobs identified as high priority jobs; slot2
is set to run jobs according to the usual policy and to suspend them when slot1
is claimed. A policy for a machine with more than one physical CPU may be adapted from this example. Instead of having two slots, you would have twice times the number of physical CPUs. Half of the slots would be for high priority jobs and the other half would be for suspendable jobs.
NUM_CPUS = 2
slot1
is the high-priority slot, while slot2
is the background slot:
START = (SlotID == 1) && $(SLOT1_START) || \ (SlotID == 2) && $(SLOT2_START)
slot1
if the job is marked as a high-priority job:
SLOT1_START = (TARGET.IsHighPrioJob =?= TRUE)
slot2
if there is no job on slot1
, and if the machine is otherwise idle. Note that the Busy
activity is only in the Claimed
state, and only when there is an active job:
SLOT2_START = ( (slot1_Activity != "Busy") && \ (KeyboardIdle > $(StartIdleTime)) && \ ($(CPUIdle) || (State != "Unclaimed" && State != "Owner")) )
slot2
if there is keyboard activity or if a job starts on slot1
:
SUSPEND = (SlotID == 2) && \ ( (slot1_Activity == "Busy") || ($(KeyboardBusy)) ) CONTINUE = (SlotID == 2) && \ (KeyboardIdle > $(ContinueIdleTime)) && \ (slot1_Activity != "Busy")
IsHighPrioJob
has no special meaning. It is an invented name chosen for this example. To take advantage of the policy, a user must submit high priority jobs with this attribute defined. The following line appears in the job's submit description file as:
+IsHighPrioJob = True
username
@uid_domain
and is assigned a priority value. The priority value is assigned directly to the username and domain, so the same user can submit jobs from multiple machines. The highest possible priority is 1, and the priority decreases as the number rises. There are two priority values assigned to users:
PRIORITY_HALFLIFE
setting, which measures in seconds. For example, if the PRIORITY_HALFLIFE
is set to 86400 (1 day), and a user who's RUP is 10 removes all their jobs and consumes no further resources, the RUP would become 5 in one day, 2.5 in two days, and so on.
PREEMPTION_REQUIREMENTS
setting in the configuration file. Set this variable to deny pre-emption when the current job has been running for a relatively short period of time. This limits the number of pre-emptions per resource, per time period. There is more information about the PREEMPTION_REQUIREMENTS
setting in Chapter 2, Configuration.
condor_negotiator
daemon is responsible for negotiation.
condor_negotiator
daemon performs the following actions, in this order:
condor_negotiator
daemon has finished the initial actions, it will list every job for each submitter, in EUP order. Since jobs can be submitted from more than one machine, there is further sorting. When the jobs all come from a single machine, they are sorted in order of job priority. Otherwise, all the jobs from a single machine are sorted before sorting the jobs from the next machine.
condor_negotiator
will perform the following tasks for each machine in the pool that can execute jobs:
machine.requirements
is false or job.requirements
is false, ignore the machine
Claimed
state, but not running a job, ignore the machine
No Preemption
machine.RANK
on the submitted job is higher than that of the running job, add this machine to the potential match list with a reason of Rank
PREEMPTION_REQUIREMENTS
is true, and the machine.RANK
on the submitted job is higher than the running job, add this machine to the potential match list with a reason of Priority
NEGOTIATOR_PRE_JOB_RANK
job.RANK
NEGOTIATOR_POST_JOB_RANK
No Preemption
Rank
Priority
PREEMPTION_RANK
NEGOTIATE_ALL_JOBS_IN_CLUSTER
can be used to disable this behaviour. The definition of what makes up a cluster can be modified by use of the SIGNIFICANT_ATTRIBUTES
setting.
group_physics
accounting group and belongs to user mcurie
:
+AccountingGroup = "group_physics.mcurie"
AccountingGroup
attribute is a submitter string. It must be enclosed in double quotation marks, may be of arbitrary length and is case sensitive. The name should not be qualified with a domain, as parts of the system will add the $(UID_DOMAIN)
to the string. For example, the statistics for this accounting group might be displayed as follows:
User Name EUP ------------------------------ --------- group_physics@example.com 0.50 mcurie@example.com 23.11 pvonlenard@example.com 111.13 ...
-delete
option with the condor_userprio
daemon. This action will only work if all jobs have already been removed from the accounting group, and the group is identified by its fully-qualified name. For example:
$ condor_userprio -delete group_physics@example.com
AccountingGroup
attribute. Members of a group quota are called group users. When specifying a group user, you will need to include the name of the group, as well the username, using the following syntax:
+AccountingGroup = "group
.user
"
group_physics
group was submitting a job in a pool that implements group quotas, the submit description file would be:
+AccountingGroup = "group_physics.mcurie"
<none>
group. The <none>
group contains only those users who do not submit jobs as part of a group.
condor_negotiator
daemon will create a list of all jobs belonging to defined groups before it lists those jobs submitted by individual submitters.
NEGOTIATOR_USE_SLOT_WEIGHTS = FALSE
.
GROUP_NAMES
. Future occurrences of accounting group are matched with the Accountant's version in a case-insensitive manner.
a.user # full submitter entry A.user # full submitter entry, distinguished from 'a.user' a # single group entry, matches against 'a' and 'A'
GROUP_PRIO_FACTOR_
setting. Additionally, if a group is currently allocated the entire quota of machines, and a group user has a submitted job that is not running, the GROUP_AUTOREGROUP_
setting, if true, will allow the group to use surplus quota to run extra jobs. Refer to section Section 10.3, “Hierarchical Fair Share (HFS)” for a detailed description of the autoregroup
feature and surplus quota sharing.
GROUP_NAMES = group_physics, group_chemistry
GROUP_QUOTA_group_physics = 20
GROUP_QUOTA_group_chemistry = 10
GROUP_PRIO_FACTOR_group_physics = 1.0
GROUP_PRIO_FACTOR_group_chemistry = 3.0
GROUP_AUTOREGROUP_group_physics = FALSE
GROUP_AUTOREGROUP_group_chemistry = TRUE
GROUP_AUTOREGROUP_
settings indicate that the physics group will never be able to access more than 20 machines, while the chemistry group could potentially get more than ten machines.
condor_prio
command. Jobs with a higher number will run with a higher priority. Job priority works only on a per user basis. It is effective when used by a single user to order their own jobs, but will not impact the order in which they run with other jobs in the pool.
condor_q
with the name of the user to query:
$ condor_q user -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 126.0 user 4/11 15:06 0+00:00:00 I 0 0.3 hello 1 jobs; 1 idle, 0 running, 0 held
0
. To change the priority use the condor_prio
with the desired priority:
/who$ condor_prio -p -15 126.0
condor_q
command again:
$ condor_q user -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 126.0 user 4/11 15:06 0+00:00:00 I -15 0.3 hello 1 jobs; 1 idle, 0 running, 0 held
HIBERNATE
configuration variable within the context of the slot. This evaluation occurs in the context of each slot, not on the machine as a whole, therefore any slot can veto a change of machine power state. HIBERNATE
can reference a number of variables, possibilities include changes in power state if none of the slots are claimed or slots are not in the Owner state. See Section A.10, “condor_startd
Configuration File Macros ” for a list of the supported state names for HIBERNATE
.
HIBERNATE_CHECK_INTERVAL
configuration variable. Power management is enabled when the HIBERNATE_CHECK_INTERVAL
variable contains a non-zero value. The value is an integer representing seconds and may be large or small. A trade-off needs to be made between extra time at a regular power state and the unnecessary computation to reach machine readiness.
HIBERNATE_CHECK_INTERVAL = 20
StartIdleTime
parameter is not always set to True
. Doing this makes it easier to determine whether the machine is in an Unclaimed
state by using an auxiliary macro named ShouldHibernate
.
TimeToWait = (2 * $(HOUR)) ShouldHibernate = ( (KeyboardIdle > $(StartIdleTime)) \ && $(CPUIdle) \ && ($(StateTimer) > $(TimeToWait)) )
ShouldHibernate
returns True
only if the following are all true:
Unclaimed
for more than two hours.
HIBERNATE
will enter the power state RAM
if ShouldHibernate
returns the value True
. If this doesn't occur the machine will remain in its current state.
HibernateState = "RAM" HIBERNATE = ifThenElse($(ShouldHibernate), $(HibernateState), "NONE" )
"NONE"
that slot vetoes the decision to enter a low power state. Only when values returned by slots are all non-zero, is the a decision to enter a low power state made. If all slots agree to enter the low power state, but differ in which state to enter, the largest magnitude value is selected.
condor_power
can wake a machine from a low power stage by sending a UDP Wake On Lan (WOL) packet. Full details of the condor_power
command can be found on the condor_power manual page.
condor_power
under specific conditions, condor_rooster
may be used. The configuration options for condor_rooster
are described on the condor_rooster Configuration File Macros manual page.
condor_power
from rooster machine to send WOL packet to hibernating machine.
condor_collector
daemon can be configured for a pool to keep a ClassAd
entry for each machine once it has entered hibernation. This is required by condor_rooster
so that it can evaluate the UNHIBERNATE
configuration variable of offline machines.
OFFLINE_LOG
configuration variable. An optional expiration time for each ClassAd
can be specified with OFFLINE_EXPIRE_ADS_AFTER
. The timing begins from the time the hibernating machine's ClassAd
enters the condor_collector
daemon. See Section A.10, “condor_startd
Configuration File Macros ” for the definitions of OFFLINE_LOG
and OFFLINE_EXPIRE_ADS_AFTER
.
pm-utils
is a set of command line tools that can be used to detect and switch power states. In Condor, this is defined by the string "pm-utils"
.
/sys/power
contains virtual files that can be used to detect and set the power states. In Condor, this is defined by the string "/sys"
.
/proc/acpi
contains virtual files that can be used to detect and set the power states. In Condor, this is defined by the string "/proc"
.
LINUX_HIBERNATION_METHOD
with one of the defined strings listed in Section A.10, “condor_startd
Configuration File Macros ”. If no usable methods are detected, or the method specified by LINUX_HIBERNATION_METHOD
is not detected or invalid, hibernation is disabled.
D_FULLDEBUG
in the relevant subsystem's log configuration.
powercfg
tool can be used to discover the available power states on the machine.
> powercfg -A The following sleep states are available on this system: Standby (S3) Hibernate Hybrid Sleep The following sleep states are not available on this system: Standby (S1) The system firmware does not support this standby state. Standby (S2) The system firmware does not support this standby state.
> powercfg -h on
powercfg
is insufficient for configuring the machine as required, the Power Options control panel application offers the full range of the machine's power management abilities. Windows 2000 and Windows XP lack the powercfg
tool, so all configuration must be done via the Power Options control panel application.
libvirtd
service must be installed and running. This service is provided by the libvirt
package.
pygrub
program must be available. This program executes virtual machines whose disks contain the kernel they will run. This program is provided by the xen
package.
image_type
to the end of the vm_disk
declaration:
vm_disk = file_name:device:permissions:image_type
vm_disk
lsmod | grep kvm
yum install libvirt condor-vm-gahp
condor-vm-gahp
package:
# yum install condor-vm-gahp
VM_TYPE
setting:
VM_TYPE = kvm
VM_TYPE
setting are:
kvm
xen
condor_vm-gahp
and its configuration file, using the VM_GAHP_SERVER
settings:
VM_GAHP_SERVER = $(SBIN)/condor_vm-gahp
condor_vm-gahp
logs. By default, logs are written to /dev/null
, effectively disabling logging. Change the value of the VM_GAHP_LOG
to enable logging:
VM_GAHP_LOG = $(LOG)/VMGahpLogs/VMGahpLog
.$(USERNAME)
VM_MEMORY = 512
LIBVIRT_XML_SCRIPT
setting:
LIBVIRT_XML_SCRIPT = $(LIBEXEC)/libvirt_simple_script.awk
VM_NETWORKING
parameter to TRUE
, and specify the permitted network types using the VM_NETWORKING_TYPE
parameter:
VM_NETWORKING =TRUE
VM_NETWORKING_TYPE =nat, bridge
VM_NETWORKING_DEFAULT_TYPE
can also be set. This will allow VM Universe jobs to access the network, even if they have not specified a networking type in their submit description file. To define nat
as the default networking type:
VM_NETWORKING_DEFAULT_TYPE = nat
VM_NETWORKING_BRIDGE_INTERFACE
setting will also need to specified. If it is not defined, then bridge networking will be disabled on the execute node. To specify br1
as the network device:
VM_NETWORKING_BRIDGE_INTERFACE = br1
VM_NETWORKING_BRIDGE_INTERFACE
is a string value that must be set to the networking interface that VM Universe jobs (KVM or Xen only) will use for bridge networking. The bridge network interface must be set up by the system administrator prior to setting VM_NETWORKING_BRIDGE_INTERFACE
.
VM_TYPE = kvm MAX_VM_GAHP_LOG = 1000000 VM_GAHP_DEBUG = D_FULLDEBUG VM_MEMORY = 1024 VM_MAX_MEMORY = 1024
mkinitrd
from the shell prompt and loading the xennet
and xenblk
drivers into it.
XEN_BOOTLOADER
. The bootloader allows you to select a kernel instead of specifying the Dom0 configuration, and allows the use of the xen_kernel = included
specification when submitting a job to the VM universe. A typical bootloader is pygrub
:
XEN_BOOTLOADER = /usr/bin/pygrub
VM_TYPE = xen MAX_VM_GAHP_LOG = 1000000 VM_GAHP_DEBUG = D_FULLDEBUG VM_MEMORY = 1024 VM_MAX_MEMORY = 1024 XEN_BOOTLOADER = /usr/bin/pygrub
condor_startd
daemon on the host. You can do this by running condor_restart
. This should be performed on the central manager machine:
$ condor_restart -subsystem startd
condor_startd
daemon is currently servicing jobs it will let them finish running before restarting. If you want to force the condor_startd
daemon to restart and kill any running jobs, add the -fast
option to the condor_restart
command.
condor_startd
daemon will pause while it performs the following checks:
condor_vm-gahp
condor_status
will record the virtual machine type and version number. These details can be displayed by running the following command from the shell prompt:
$ condor_status -vm machine_name
If this command does not display output after some time, it is likely that condor_vm-gahp
is not able to execute the virtualization software. The problem could be caused by configuration of the virtual machine, the local installation, or a variety of other factors. Check the log file (defined in VM_GAHP_LOG
) for diagnostics.
root
user or administrator. These privileges are required to create a virtual machine on top of a Xen kernel, as well as to use the libvirtd
utility that controls creation and management of Xen guest virtual machines.
condor_schedd
daemon; and
condor_negotiator
and condor_collector
daemons.
condor_schedd
daemon controls the job queue. If the job queue is not functioning then the entire pool will be unable to run jobs. This situation can be made worse if one machine is a dedicated submission point for jobs. When a job on the queue is executed, a condor_shadow
process runs on the machine it was submitted from. The purpose of this process is to handle all input and output functionality for the job. However, if the machine running the queue becomes non-functional, condor_shadow
can not continue communication and no jobs can continue processing.
condor_schedd
daemon became available again. By enabling high availability, management of the job queue can be transferred to other designated schedulers and reduce the chance of an outage. If jobs are required to stop without finishing, they can be restarted from the beginning.
condor_schedd
daemon. To prevent multiple instances of condor_schedd
running, a lock is placed on the job queue. When the machine running the job queue fails, the lock is lifted and condor_schedd
is transferred to another machine. Configuration variables are also used to determine the intervals at which the lock expires, and how frequently polling for expired locks should occur.
condor_schedd
daemon is started, the condor_master
will attempt to discover which machine is currently running the condor_schedd
. It does this by working out which machine holds a lock. If no lock is currently held, it will assume that no condor_schedd
is currently running. It will then acquire the lock and start the condor_schedd
daemon. If a lock is currently held by another machine, the condor_schedd
daemon will not be started.
condor_schedd
daemon renews the lock periodically. If the machine is not functioning, it will fail to renew the lock, and the lock will become stale. The lock can also be released if condor_off
or condor_off -schedd
is executed. When another machine that is capable of running condor_schedd
becomes aware that the lock is stale, it will attempt to acquire the lock and start the condor_schedd
.
condor_schedd
daemon and become the single pool submission point:
MASTER_HA_LIST = SCHEDD SPOOL = /share/spool HA_LOCK_URL = file:/share/spool VALID_SPOOL_FILES = $(VALID_SPOOL_FILES), SCHEDD.lock
MASTER_HA_LIST
macro identifies the condor_schedd
daemon as a daemon that should be kept running.
condor_schedd
. SPOOL
identifies the location of the job queue, and needs to be accessible by all High Availability schedulers. This is typically accomplished by placing the SPOOL
directory in a file system that is mounted on all schedulers. HA_LOCK_URL
identifies the location of the job queue lock. Like SPOOL
, this needs to be accessible by all High Availability Schedulers, and is often found in the same location.
SCHEDD.lock
to the VALID_SPOOL_FILES
variables. This is to prevent condor_preen
deleting the lock file because it is not aware of it.
$ condor_submit -remote schedd_name
myjob
.submit
condor_schedd
running on a single machine. When high availability is configured, there are multiple possible condor_schedd
daemons, with any one of them providing a single submission point.
SCHEDD_NAME
variable in the local configuration of each potential High Availability Scheduler. They will need to have the same value on each machine that could potentially be running the condor_schedd
daemon. Ensure that the value chosen ends with the @
character. This will prevent MRG Grid from modifying the value set for the variable.
SCHEDD_NAME = ha-schedd
@
$ condor_submit -remote had-schedd
@ myjob
.submit
condor_negotiator
and condor_collector
daemons are critical to a pool functioning correctly. Both daemons usually run on the same machine, referred to as the central manager. If a central manager machine fails, MRG Grid will not be able to match new jobs or allocate new resources. Configuring high availability in a pool reduces the chance of an outage.
condor_collector
daemons, only a single, active condor_negotiator
will be running. The machine with the condor_negotiator
daemon running is the active central manager. All machines running a condor_collector
daemon are idle central managers. All submit and execute machines are configured to report to all potential central manager machines.
condor_had
. The daemons on each of the machines will communicate to monitor the pool and ensure that a central manager is always available. If the active central manager stops functioning, the condor_had
daemons will detect the failure. The daemons will then select one of the idle machines to become the new active central manager.
condor_had
daemons on each side of the partition will choose a new active central manager. As long as the partition exists, there will be an active central manager on each side. When the partition is removed and the network repaired, the condor_had
daemons will be re-organized and ensure that only one central manager is active.
condor_had
daemon is not running
condor_replication
daemon replicates the state information on all potential central manager machines. The condor_replication
daemon needs to be running on the active central manager as well as all potential central managers.
condor_had
daemons to detect a change in the pool state and recover from this change. It is computed using the following formula:
stabilization period = 12 * [number of central managers] * $(HAD_CONNECTION_TIMEOUT)
CENTRAL_MANAGER1 = cm1.example.com CENTRAL_MANAGER2 = cm2.example.com
CONDOR_HOST = $(CENTRAL_MANAGER1
),$(CENTRAL_MANAGER2
) COLLECTOR_HOST = $(CONDOR_HOST)
condor_had
will listen on. The port number must match the port number used when defining HAD_LIST
. This port number is arbitrary, but ensure that there are no port number collisions with other applications:
HAD_PORT = 51450 HAD_ARGS = -p $(HAD_PORT)
condor_replication
will listen on. The port number must match the port number specified for the replication daemon in REPLICATION_LIST
. The port number is arbitrary, but ensure that there are no port number collisions with other applications:
REPLICATION_PORT = 41450 REPLICATION_ARGS = -p $(REPLICATION_PORT)
HAD_LIST
. Additionally, for each hostname specify the port number of the condor_replication
daemon running on that host. This parameter is mandatory and has no default value:
REPLICATION_LIST = $(CENTRAL_MANAGER1
):$(REPLICATION_PORT
),$(CENTRAL_MANAGER2
):$(REPLICATION_PORT
)
COLLECTOR_HOST
. Additionally, for each hostname specify the port number of the condor_had
daemon running on that host. The first machine in the list will be considered the primary central manager if HAD_USE_PRIMARY
is set to TRUE
:
HAD_LIST = $(CENTRAL_MANAGER1
):$(HAD_PORT
),$(CENTRAL_MANAGER2
):$(HAD_PORT
)
2
if the central managers are on the same subnet
5
if security is enabled
10
if the network is very slow, or to reduce the sensitivity of the high availability daemons to network failures
HAD_CONNECTION_TIMEOUT = 2
HAD_CONNECTION_TIMEOUT
value too low can cause the condor_had
daemons to incorrectly assume that the other machines have failed. This can result in a multiple central managers running at once. Conversely, setting the value too high can create a delay in fail-over due to the stabilization period.
HAD_CONNECTION_TIMEOUT
value is sensitive to the network environment and topology, and should be tuned based on those conditions.
HAD_LIST
as a primary central manager:
HAD_USE_PRIMARY = true
ALLOW_ADMINISTRATOR = $(COLLECTOR_HOST)
condor_negotiator
. These are trusted central managers. The default value is appropriate for most pools:
ALLOW_NEGOTIATOR = $(COLLECTOR_HOST)
HAD = $(SBIN)/condor_had REPLICATION = $(SBIN)/condor_replication
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, HAD, REPLICATION DC_DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, HAD, REPLICATIONThe
DC_DAEMON_LIST
should also include any other daemons running on the node.
HAD_USE_REPLICATION = true
STATE_FILE = $(SPOOL)/Accountantnew.log
REPLICATION_INTERVAL = 300
MAX_TRANSFERER_LIFETIME = 300
condor_had
to wait in between sending ClassAds to the condor_collector
:
HAD_UPDATE_INTERVAL = 300
MASTER_NEGOTIATOR_CONTROLLER = HAD MASTER_HAD_BACKOFF_CONSTANT = 360
condor_negotiator
churning. This occurs when a constant cycling of the daemons stopping and starting prevents the condor_negotiator
from being able to run long enough to complete a negotiation cycle. Churning causes an inability for any job to start processing. Increasing the MASTER_HAD_BACKOFF_CONSTANT
variable can help solve this problem.
MAX_HAD_LOG = 640000
HAD_DEBUG = D_COMMAND
condor_had
:
HAD_LOG = $(LOG)/HADLog
MAX_REPLICATION_LOG = 640000
REPLICATION_DEBUG = D_COMMAND
condor_replication
:
REPLICATION_LOG = $(LOG)/ReplicationLog
NEGOTIATOR_HOST
and CONDOR_HOST
macros:
NEGOTIATOR_HOST= CONDOR_HOST=
CENTRAL_MANAGER1 = cm1.example.com CENTRAL_MANAGER2 = cm2.example.com
COLLECTOR_HOST = $(CENTRAL_MANAGER1
),$(CENTRAL_MANAGER2
)
condor_negotiator
. These are trusted central managers. The default value is appropriate for most pools:
ALLOW_NEGOTIATOR = $(COLLECTOR_HOST)
HAD_USE_REPLICATION
configuration variable to FALSE
. This will disable replication at the configuration level.
REPLICATION
from both the DAEMON_LIST
and DC_DAEMON_LIST
in the configuration file.
HAD
, REPLICATION
, and NEGOTIATOR
settings from the DAEMON_LIST
configuration variable on all machines except the primary machine. This will leave only one condor_negotiator
remaining in the pool.
condor_off -all -subsystem negotiator
condor_off -all -subsystem replication
condor_off -all -subsystem had
condor_had
, condor_replication
and condor_negotiator
daemons.
condor_on -subsystem negotiator
on the machine where the single condor_negotiator
is going to operate.
concurrency_limits
parameter in the job submit file. The concurrency_limits
parameter references a value in the configuration file. The value of the concurrency_limits
parameter can be a floating point number and a job submit file can also reference more than one limit.
condor_negotiator
uses the information in the submit file when attempting to match the job to a resource. Firstly, it checks that the limits have not been reached. It will then store the limits of the job in the matched machine ClassAd.
condor_negotiator
daemon's configuration file. The important configuration variables for concurrency limits are:
*_LIMIT
*
is the name of the limit. This variable sets the allowable number of concurrent jobs for jobs that reference this limit in their submit file. Any number of *_LIMIT
variables can be set, as long as they all have different names
CONCURRENCY_LIMIT_DEFAULT
*_LIMIT
, will use the default limit
*_LIMIT
and CONCURRENCY_LIMIT_DEFAULT
*_LIMIT
and CONCURRENCY_LIMIT_DEFAULT
configuration variables
Y_LIMIT
is set to 2
and CONCURRENCY_LIMIT_DEFAULT
to 1
. In this case, any job that includes the line concurrency_limits = y
in the submit file will have a limit of 2. All other jobs that have a limit other than Y
will be limited to 1
:
CONCURRENCY_LIMIT_DEFAULT = 1 Y_LIMIT = 2
*_LIMIT
variable can also be set without the use of CONCURRENCY_LIMIT_DEFAULT
. With the following configuration, any job that includes the line concurrency_limits = x
in the submit file will have a limit of 5. All other jobs that have a limit other than X
will not be limited:
X_LIMIT = 5
machine_count
. The following example gives the job cluster a single license.
machine_count = 4, concurrency_limits=license:0.25
concurrency_limits
attribute references the *_LIMIT
variables:
universe = vanilla executable = /bin/sleep arguments = 28 concurrency_limits = Y, x, z queue 1
condor_submit
will sort the given concurrency limits and convert them to lowercase:
$ condor_submit job Submitting job(s). 1 job(s) submitted to cluster 28. $ condor_q -long 28.0 | grep ConcurrencyLimits ConcurrencyLimits = "x,y,z"
condor_config_val
. In this case, three configuration variables need to be set. Set the ENABLE_RUNTIME_CONFIG
variable to TRUE
:
ENABLE_RUNTIME_CONFIG = TRUE
CONFIG
access level. This allows you to change the limit from that machine:
ALLOW_CONFIG = $(CONDOR_HOST)
NEGOTIATOR.SETTABLE_ATTRS_CONFIG = *_LIMIT
$ condor_config_val -negotiator -rset "X_LIMIT = 3"
condor_negotiator
to pick up the changes:
$ condor_reconfig -negotiator
condor_userprio
command with the -l
option:
$ condor_userprio -l | grep ConcurrencyLimit ConcurrencyLimit.p = 0 ConcurrencyLimit.q = 2 ConcurrencyLimit.x = 6 ConcurrencyLimit.y = 1 ConcurrencyLimit.z = 0
X
limit, two are using the Q
limit, and none are using the Z
or P
limits. The limits with zero users are returned because they have been used at some point in the past. If a limit has been configured but never used, it will not appear in the list.
X
limit, and X_LIMIT
value is changed to a lower number, all of the original jobs will continue to run. However, no more matches will be accepted against the X
limit until the number of running jobs drops below the new value.
PartitionableSlot=TRUE
and the dynamic slots will have an attribute stating DynamicSlot=TRUE
. These attributes can be used in a START
expression to create detailed policies.
SLOT_TYPE_X
, SLOT_TYPE_X
_PARTITIONABLE
, NUM_SLOTS
, and NUM_SLOTS_TYPE_X
configuration variables. The X
refers to the number of the slot being configured.
SLOT_TYPE_1 = cpus=100%,disk=100%,swap=100% SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1
condor
service:
# service condor restart Stopping condor services: [ OK ] Starting condor services: [ OK ]
request_cpus
1
request_memory
ImageSize
or JobVMemory
parameters.
request_disk
DiskUsage
parameter.
JobA: universe = vanilla executable = ... ... request_cpus = 3 request_memory = 1024 request_disk = 10240 ... queue
# Add a second slot dedicated to GPU jobs. # Subtract from the slot1 dynamic slot totals. SLOT_TYPE_1 = cpus=25%,disk=25%,swap=25% SLOT_TYPE_1_PARTITIONABLE = False NUM_SLOTS_TYPE_1 = 1 NUM_SLOTS = 1 # Label the slot as a GPU slot. SLOT1_GPU = True SLOT1_STARTD_ATTRS = GPU # Only accept jobs with NeedsGPU on GPU slots # Allow non-NeedsGPU jobs on non-GPU slots STARTD_JOB_EXPRS = NeedsGPU START = $(START) && TARGET.NeedsGPU #Carve out the rest of the resources as dynamic partitionable slots SLOT_TYPE_2 = cpus=75%,disk=75%,swap=75% SLOT_TYPE_2_PARTITIONABLE = True NUM_SLOTS_TYPE_2 = 1 NUM_SLOTS = 1
$(attributedname
)
syntax. For example, if a machine named claimedidle
has been idle for ten minutes and met the Idle for long time
trigger, the following syntax:
$(Machine) has been Claimed/Idle for $(TriggerdActivityTime) seconds
claimedidle has been Claimed/Idle for 600 seconds
absent nodes
. This feature requires that all machines in the pool are configured to work with the MRG Management Console. For further information refer to MRG Management Console Installation Guide and Chapter 4, Remote Configuration.
QMF_BROKER_HOST = ip/hostname_of_broker
ENABLE_ABSENT_NODES_DETECTION = TRUE
QMF_BROKER_HOST
is the name or IP address of an AMQP broker communicating with all the nodes in the pool and the remote configuration store.
STARTD_CRON_NAME = TRIGGER_DATA STARTD_CRON_AUTOPUBLISH = If_Changed TRIGGER_DATA_JOBLIST = GetData TRIGGER_DATA_GETDATA_PREFIX = Triggerd TRIGGER_DATA_GETDATA_EXECUTABLE = $(BIN)/get_trigger_data TRIGGER_DATA_GETDATA_PERIOD = 5m TRIGGER_DATA_GETDATA_RECONFIG = FALSE DAEMON_LIST = $(DAEMON_LIST), TRIGGERD DC_DAEMON_LIST =+ TRIGGERD DATA = $(SPOOL)
DATA
sets the location for the trigger service to save the configured triggers. If not specified, DATA
defaults to the same directory as $(SPOOL)
.
TRIGGERD_DEFAULT_EVAL_PERIOD
QMF_BROKER_HOST =ip/hostname_of_broker
QMF_BROKER_PORT =broker_listen_port
condor_trigger_config
tool:
$ /usr/sbin/condor_trigger_config -i broker
broker
parameter should be the name of the broker that communicates with the trigger service.
condor_trigger_config
command are:
Trigger Name: ClassAd Query: High CPU Usage (TriggerdLoadAvg1Min > 5) Low Free Mem (TriggerdMemFree <= 10240) Low Free Disk Space (/) (TriggerdFilesystem_slash_Free < 10240) Busy and Swapping (State == \"Claimed\" && Activity == \"Busy\" && TriggerdSwapInKBSec > 1000 && TriggerdActivityTime > 300) Busy but Idle (State == \"Claimed\" && Activity == \"Busy\" && CondorLoadAvg < 0.3 && TriggerdActivityTime > 300) Idle for long time (State == \"Claimed\" && Activity == \"Idle\" && TriggerdActivityTime > 300) dprintf Logs (TriggerdCondorLogDPrintfs != \"\") Core Files (TriggerdCondorCoreFiles != \"\") Logs with ERROR entries (TriggerdCondorLogCapitalError != \"\") Logs with error entries (TriggerdCondorLogLowerError != \"\") Logs with DENIED entries (TriggerdCondorLogCapitalDenied != \"\") Logs with denied entries (TriggerdCondorLogLowerDenied != \"\") Logs with WARNING entries (TriggerdCondorLogCapitalWarning != \"\") Logs with warning entries (TriggerdCondorLogLowerWarning != \"\") Logs with stack dumps (TriggerdCondorLogStackDump != \"\")
condor_trigger_config
command with the -a
option. Specify the name, query, and trigger text, in the following syntax:
$ condor_trigger_config -a -nname
-qquery
-ttext
broker
name
with the name of the trigger, replace query
with the ClassAd query (which must evaluate to TRUE
for the trigger to run), and text
with the string to be raised in the event. The broker
parameter should be the name of the broker that communicates with the trigger service.
condor_trigger_config
command with the -l
option, in the following syntax:
$ condor_trigger_config -l broker
broker
parameter should be the name of the broker that communicates with the trigger service.
condor_trigger_config
command with the -d
option. Specify the ID number of the trigger, in the following syntax:
$ condor_trigger_config -dID
broker
ID
with the unique ID number of the trigger. The broker
parameter should be the name of the broker that communicates with the trigger service.
reply-to
field set, or the jobs will not run. They must also include a unique message ID. If data needs to be submitted with the job, it will need to be compressed and the archive placed in the body of the message.
caro
daemon controls the communication between MRG Messaging and MRG Grid. It will look for parameters in the condor configuration files first. It will then look for its own configuration file at /etc/condor/carod.conf
. This file controls the active broker and other options such as the exchange name, message queue and IP information.
grid
, it could be split into two queues named grid_high
and grid_low
. The client program would then be able to use the appropriate routing key to get to the needed queue. This can also be achieved through the API.
yum
to install these components:
# yum install condor-low-latency # yum install condor-job-hooks # yum install python-condorutils
CAROD = $(SBIN)/carod CAROD_BROKER_IP = <broker ip> CAROD_BROKER_PORT = 5672 CAROD_BROKER_QUEUE = grid CAROD_IP = 127.0.0.1 CAROD_PORT = 10000 CAROD_QUEUED_CONNECTIONS = 5 CAROD_LEASE_TIME = 35 CAROD_LEASE_CHECK_INTERVAL = 30 JOB_HOOKS_IP = 127.0.0.1 JOB_HOOKS_PORT = $(CAROD_PORT) DAEMON_LIST = $(DAEMON_LIST), CAROD # Startd hooks LOW_LATENCY_HOOK_FETCH_WORK = $(LIBEXEC)/hooks/hook_fetch_work.py LOW_LATENCY_HOOK_REPLY_FETCH = $(LIBEXEC)/hooks/hook_reply_fetch.py # Starter hooks LOW_LATENCY_JOB_HOOK_PREPARE_JOB = $(LIBEXEC)/hooks/hook_prepare_job.py LOW_LATENCY_JOB_HOOK_UPDATE_JOB_INFO = \ $(LIBEXEC)/hooks/hook_update_job_status.py LOW_LATENCY_JOB_HOOK_JOB_EXIT = $(LIBEXEC)/hooks/hook_job_exit.py STARTD_JOB_HOOK_KEYWORD = LOW_LATENCY CAROD_LOG = $(LOG)/CaroLog MAX_CAROD_LOG = 1000000 JOB_HOOKS_LOG = $(LOG)/JobHooksLog MAX_JOB_HOOKS_LOG = 10000000
FetchWorkDelay
setting. This setting controls how often the condor-low-latency feature will look for jobs to execute, in seconds:
FetchWorkDelay = ifThenElse(State == "Claimed" && Activity == "Idle", 0, 10) STARTER_UPDATE_INTERVAL = 30
# service qpidd start Starting qpidd daemon: [ OK ]
condor
service:
# service condor restart Stopping condor services: [ OK ] Starting condor services: [ OK ]
condor_submit
command with the -dump
option:
$ condor_submit myjob.submit -dump output_file
output_file
. This file contains the information contained in the myjob.submit
in a format suitable for placing directly into the application header of a message. This method only works when queuing a single message at a time.
myjob.submit
should only have one queue
command with no arguments. For example:
executable = /bin/echo arguments = "Hello there!" queue
Configuration variable | Data type | Description |
---|---|---|
CAROD_BROKER_IP
| IP Address |
The IP address of the broker that carod is talking to
|
CAROD_BROKER_PORT
| Integer |
The port on $(CAROD_BROKER_IP) that the broker is listening to
|
CAROD_BROKER_QUEUE
| String | The queue on the broker for condor jobs |
CAROD_IP
| IP Address |
The IP address of the interface carod is using for connections
|
CAROD_PORT
| Integer |
The port carod is listening to for connections
|
CAROD_QUEUED_CONNECTIONS
| Integer | The number of allowed outstanding connections |
CAROD_LEASE_TIME
| Integer | The maximum amount of time (in seconds) a job is allowed to run without providing an update |
CAROD_LEASE_CHECK_INTERVAL
| Integer |
How often (in seconds) carod is checking for lease expiration
|
CAROD_LOG
| String |
The location of the file carod should use for logging
|
MAX_CAROD_LOG
| Integer |
The maximum size of the carod log file before being rotated
|
JOB_HOOKS_IP
| IP Address |
The IP address where carod is listening for connections
|
JOB_HOOKS_PORT
| Integer |
The port carod is listening to for connections
|
JOB_HOOKS_LOG
| String | The location of the log file for the job hooks to use for logging |
MAX_JOB_HOOKS_LOG
| Integer | The maximum size of the job hooks log before rotating |
application_headers
field.
replyTo reply-t
queue. It also ensures that the AMQP message has a unique ID.
work_headers = {} work_headers['Cmd'] = '"/bin/sleep"' work_headers['Arguments'] = '"5"' work_headers['Iwd'] = '"/tmp"' work_headers['Owner'] = '"nobody"' work_headers['JobUniverse'] = 5 message_props = session.message_properties(application_headers=work_headers) replyTo = str(uuid4()) message_props.reply_to = session.reply_to('amq.direct', replyTo) message_props.message_id = uuid4()
condor_schedd
in the same way as ordinary MRG Grid jobs. The condor_schedd
launches the DAG job inside the scheduler universe.
# this file is called diamond.dag JOB A A_job.submit JOB B B_job.submit JOB C C_job.submit JOB D D_job.submit PARENT A CHILD B C PARENT B C CHILD D
condor_submit_dag
tool.
condor_submit_dag
tool.
condor_submit_dag
command with the name of the DAG submit description file:
$ condor_submit_dag diamond_dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : diamond_dag.condor.sub Log of DAGMan debugging messages : diamond_dag.dagman.out Log of Condor library output : diamond_dag.lib.out Log of Condor library error messages : diamond_dag.lib.err Log of the life of condor_dagman itself : diamond_dag.dagman.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 30072. -----------------------------------------------------------------------
condor_submit_dag
tool will provide a summary of the submission, including the location of the log files.
condor_dagman
log labeled Log of the life of condor_dagman itself
and referred to as the lifetime log. This file is used to coordinate job execution.
ERROR: a cycle exists in the DAG
.
condor_submit_dag
command, all files will be submitted as one large DAG submission:
$ condor_submit_dag dag_file1, dag_file2, dag_file3
condor_dagman
process.
condor_submit_dag
command multiple times, specifying each individual DAG submit description file. In this case, ensure that the DAG submit description files and job names are all unique, to avoid log and output files being overwritten.
condor_q
command. By specifying the username, the results will show only jobs submitted by that user:
$ condor_q daguser 29017.0 daguser 6/24 17:22 4+15:12:28 H 0 2.7 condor_dagman 29021.0 daguser 6/24 17:22 4+15:12:27 H 0 2.7 condor_dagman 29030.0 daguser 6/24 17:22 4+15:12:34 H 0 2.7 condor_dagman 30047.0 daguser 6/29 09:13 0+00:01:56 R 0 2.7 condor_dagman 30048.0 daguser 6/29 09:13 0+00:01:07 R 0 2.7 condor_dagman 30049.0 daguser 6/29 09:14 0+00:01:07 R 0 2.7 condor_dagman 30050.0 daguser 6/29 09:14 0+00:01:06 R 0 2.7 condor_dagman 30051.0 daguser 6/29 09:14 0+00:01:06 R 0 2.7 condor_dagman 30054.0 daguser 6/29 09:15 0+00:00:01 R 0 0.0 uname -n 30055.0 daguser 6/29 09:15 0+00:00:00 R 0 0.0 uname -n 30056.0 daguser 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30057.0 daguser 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30058.0 daguser 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30059.0 daguser 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30060.0 daguser 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 15 jobs; 5 idle, 7 running, 3 held
condor_q
command lists the supervising condor_dagman
jobs.
condor_q
command with the -dag
option:
$ condor_q -dag daguser 29017.0 daguser 6/24 17:22 4+15:12:28 H 0 2.7 condor_dagman -f - 29021.0 daguser 6/24 17:22 4+15:12:27 H 0 2.7 condor_dagman -f - 29030.0 daguser 6/24 17:22 4+15:12:34 H 0 2.7 condor_dagman -f - 30047.0 daguser 6/29 09:13 0+00:01:50 R 0 2.7 condor_dagman -f - 30057.0 |-B0 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30058.0 |-C0 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30048.0 daguser 6/29 09:13 0+00:01:01 R 0 2.7 condor_dagman -f - 30055.0 |-A1 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30049.0 daguser 6/29 09:14 0+00:01:01 R 0 2.7 condor_dagman -f - 30056.0 |-A2 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30050.0 daguser 6/29 09:14 0+00:01:00 R 0 2.7 condor_dagman -f - 30059.0 |-B3 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30060.0 |-C3 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 30051.0 daguser 6/29 09:14 0+00:01:00 R 0 2.7 condor_dagman -f - 30054.0 |-A4 6/29 09:15 0+00:00:00 I 0 0.0 uname -n 15 jobs; 7 idle, 5 running, 3 held
-dag
option will show the DAGMan processes that are running, with their associated node listings. In this example, Job 30047.0
is processing child nodes labeled B0
and C0
. These are the names they were given in the DAG submit file. Each DAG manages its own set of nodes, so the names of nodes can be traced back to the DAG submit file for each.
condor_rm
command with the job number. When a DAG job is removed, all jobs associated with it will also be removed.
$ condor_rm 29017.0
-no_submit
option. This instructs DAGman to generate a submission file that can be used by DAGman, but does not submit the job to MRG Grid. This is an advanced feature that allows additional editing of the original DAG submit file prior to the submission. It can also be used by external tooling for pre-processing and deferred DAG submissions:
$ condor_submit_dag -no_submit diamond_dag
"
).
VARS
declaration:
JOB A A_job.submit JOB B B_job.submit JOB C C_job.submit JOB D D_job.submit PARENT A CHILD B C PARENT B C CHILD D VARS A dataset=”small_data.txt”
# node job filename: A_job.submit executable = A_job log = A.log error = A.err arguments = $(dataset) queue
A
is executed, it will be launched as:
A_job small_data.txt
A_job.submit
file and not the others at runtime.
DONE
. Rescue files have the same filename as the original DAG, with .rescueXXX
appended, and by default are written to the same directory as the original DAG submit file.
$ condor_submit_dag diamond_dag
diamond_dag.rescue003
present and no file with a larger increment, then the DAG will be recovered from that file.
condor_submit_dag
command line by using the command:
$ condor_submit_dag -DoRescueFrom 2
diamond_dag
1
, 2
, or 3
and not 001
, 002
, or 003
.
PRE
and POST
lines to the DAG submit file:
JOB A A_job.submit JOB B B_job.submit JOB C C_job.submit JOB D D_job.submit PARENT A CHILD B C PARENT B C CHILD D SCRIPT PRE C setup_data.sh $JOB SCRIPT POST C teardown_data.sh $JOB
$JOB
argument in the script, which represents a job name. This can be useful for using the job name as part of an external filename or directory.
setup_data.sh
might be something like this:
#!/bin/csh tar -C staging/$argv[1] -zxf /mnt/storage/$argv[1]/data.tar.gz
teardown_data.sh
might be:
#!/bin/csh rm -fr staging/$argv[1]
# inner.dag JOB X X_job JOB Y Y_job JOB Z Z_job PARENT X CHILD Y PARENT Y CHILD Z
SUBDAG EXTERNAL
command:
# outer.dag JOB A A_job.submit SUBDAG EXTERNAL B inner.dag JOB C C_job.submit JOB D D_job.submit PARENT A CHILD B C PARENT B C CHILD D
condor_submit_dag
, its rescue file will be run, which in turn will lead to the rescue file of the inner DAG also being run.
condor_dagman
process. This additional overhead can potentially put a strain on machine resources. An alternative is to use splicing instead of nesting. Splicing includes an external DAG definition inside another. The included nodes become part of a larger DAG that can all be managed by a single condor_dagman
process. If one DAG fails, there will be a single rescue file that represents the state of all node jobs in the spliced DAG.
SPLICE
in the following syntax:
SPLICEsplice name
DAG file name
# big.dag JOB A A_job.submit SPLICE B inner.dag JOB C C_job.submit JOB D D_job.submit PARENT A CHILD B C PARENT B C CHILD D
+
character between the splice name and the original DAG job name.
$ENV
syntax. If not done already, ensure that the variables to be used have been set in your Bash shell as follows:
$ export MYEXE=”/bin/sleep” $ export MYARGS=”10”
# the dag job node file: dag_job.sub executable = $ENV(MYEXE) arguments = $ENV(MYARGS) output = dags/out/dag_job.out.$(cluster) error = dags/err/dag_job.err.$(cluster) # log path can't use macro log = dags/log/diamond_dag.log universe = vanilla notification = NEVER should_transfer_files = true when_to_transfer_output = on_exit queue
Notification = Error
signal 9
. What's wrong?
condor_q -analyze
and condor_q -better
to check the output they give you
log = path/to/filename.log
in the submit file. From this file you should be able to tell if the jobs are starting to run, or if they are exiting before they begin.
Lost priority, no more jobs
.
No more machines
.
SCHEDD_LOG
file:
[date] [time] Swap space estimate reached! No more jobs can be run! [date] [time] Solution: get more swap space, or set RESERVED_SWAP = 0 [date] [time] 0 jobs matched, 1 jobs idle
$ condor_status -schedd [hostname] -long | grep VirtualMemory
0
, then you will need to tell the system that it has some swap space. This can be done in two ways:
0
, and change the RESERVED_SWAP
configuration variable to 0
. You will need to perform condor_restart
on the submit machine to pick up the changes.
arch
and opsys
are not specified in the submit description file, they will be added. It will insert the same platform details as the machine from which the job was submitted.
Memory * 1024 > ImageSize
is automatically added. This makes sure that the job runs on a machine with at least as much physical memory as the memory footprint of the job.
Disk >= DiskUsage
is not specified, it will be added. This makes sure that the job will only run on a machine with enough disk space for the job's local input and output.
APPEND_REQUIREMENTS
APPEND_REQ_VANILLA
APPEND_REQ_STANDARD
remove_kill_sig = SIGWHATEVER
kill_sig = SIGWHATEVER
SIGTERM
signal will be used. In the case of a hard kill, the SIGKILL
signal is sent instead.
condor_status
appear as [?????]
?
[?????]
. This can be fixed by synchronizing the time on all machines in the pool, using a tool such as NTP (Network Time Protocol).
condor_collector
daemon, but can not find one. If you are not running a condor_collector
daemon, change the COLLECTOR_HOST
configuration variable to nothing:
COLLECTOR_HOST=
condor_submit
is automounted under NFS (Network File System), Condor might try to unmount the volume before the job has completed.
initialdir
command in your submit description file with a reference to the stable access point. For example, if the NFS automounter is configured to mount a volume at /a/myserver.company.com/vol1/user
whenever the directory /home/user
is accessed, add this line to the submit description file:
initialdir = /home/user
main()
method and waits for it to return. When it returns, Condor considers your job to have been completed. This can happen inadvertently if the main()
method is starting threads for processing. To avoid this, ensure you join()
all threads spawned in the main()
method.
$ env CONDOR_CONFIG=ONLY_ENV condor_config_val -dump
PERMISSION DENIED
. What does that mean?
ALLOW_*
and DENY_*
are not configured correctly. Check these parameters and set ALLOW_*
and DENY_*
as appropriate.
condor_status
. What is wrong?
DaemonCore: PERMISSION DENIED to host 128.105.101.15:9618
for command 0 (UPDATE_STARTD_AD)
DEFAULT_DOMAIN_NAME
setting in the configuration file
ALLOW_WRITE
and DENY_WRITE
configuration macros
ALLOW_WRITE = condor.your.domain.com
ALLOW_WRITE = *.your.domain.com
numactl
or taskset
. If you are running jobs from within your own program, use sched_setaffinity
and pthred_{,attr_}setaffinity
to achieve the same result.
schedd
keeps on trying to start but exits with a status 0
. Why is this happening?
schedd
, before it starts on Node B. On node B, the schedd
continually attempts to start and exits with status 0
.
schedd
names. In this case, the schedd
on Node B will continually try to start, but will not be able to because of lock conflicts.
schedd
on both nodes. This will make the schedd
on Node B realize that one is already running, and it doesn't need to start. Change the SCHEDD_NAME
configuration entry on both nodes so that the name is identical.
SCHEDD_NAME
. So you can have HA (on two nodes) and other schedd
s elsewhere.
condor_startd
crashes. Why does this happen?
procd
. The startd will always wait the value specified in the killing_timeout
parameter before hard-killing the starter. However, by default the starter will wait for the value specified in the killing_timeout-1
configuration variable before attempting to hard-kill the job. This means that it is sometimes possible for the startd to be attempting to hard-kill the starter, while the starter is cleaning up and exiting. It causes the starter to stop communicating with the procd
, which makes the startd suffer a communication failure, and then crash.
STARTD.USE_PROCD = FALSE
and STARTER.USE_PROCD = FALSE
in the configuration settings. This is the most reliable way to handle the situation.
kill_sig_timeout
set to a reasonable time in the submit description file. This will require adjustment, as the timing can be dependent on the jobs running, and the load on the startd. Also, kill_sig_timeout
cannot be a larger value than killing_timeout-1
.
chmod =x
)
D_HOSTNAME
to the logging for the daemon(s) having problems. You can force MRG Grid to use a specific network interface by setting the parameter NETWORK_INTERFACE
to the IP address or network interface device that MRG Grid should use.
FULL_HOSTNAME
HOSTNAME
IP_ADDRESS
TILDE
condor
user.
STARTD
SCHEDD
MASTER
COLLECTOR
NEGOTIATOR
KBDD
SHADOW
STARTER
GRIDMANAGER
HAD
REPLICATION
JOB_ROUTER
ARCH
OPSYS
uname
command
UNAME_ARCH
uname
command's machine
field
UNAME_OPSYS
uname
command's sysname
field
PID
PPID
USERNAME
FILESYSTEM_DOMAIN
UID_DOMAIN
COLLECTOR_HOST
condor_collector
is running for the pool. COLLECTOR_HOST
must be defined for the pool to work properly.
condor_collector
. The port is separated from the host name by a colon. To set the network port to 1234, use the following syntax:
COLLECTOR_HOST = $(CONDOR_HOST):1234
CONDOR_VIEW_HOST
CondorView
server is running. This service is optional, and requires additional configuration to enable it. If CONDOR_VIEW_HOST
is not defined, no CondorView
server is used.
CONDOR_VIEW_CLASSAD_TYPES
CONDOR_VIEW_HOST
. The ad types can be seen with the condor_status -any
command. The default forwarding behavior of the Collector is equivalent to:
CONDOR_VIEW_CLASSAD_TYPES=Machine,Submitter
RELEASE_DIR
bin
, etc
, lib
and sbin
directories. There is no default value for RELEASE_DIR
.
BIN
LIB
LIBEXEC
INCLUDE
SBIN
SBIN
should also be in the path of users acting as administrators.
LOCAL_CONFIG_DIR
LOCAL_DIR
$(TILDE)
, in this format:
LOCAL_DIR = $(TILDE)
$(HOSTNAME)
macro and have a directory with many sub-directories, one for each machine in your pool. For example:
LOCAL_DIR = $(tilde)/hosts/$(hostname)
LOCAL_DIR = $(release_dir)/hosts/$(hostname)
LOCAL_CONFIG_DIR_EXCLUDE_REGEXP
LOCAL_CONFIG_DIR
If a match is made against a file's name, that file is not read by MRG Grid. The default setting excludes files that begin with ".
" or "#
" or end with "~
", ".rpmsave
" or ".rpmnew
".
LOCAL_CONFIG_DIR
is read. Therefore, for this parameter it is acceptable to edit the global configuration file to change the default value if necessary. In general, customization of parameters should be done in the LOCAL_CONFIG_DIR
itself.
LOG
$(LOG)
macro.
SPOOL
condor_schedd
are stored, including the job queue file and the initial executables of any jobs that have been submitted. If a given machine executes jobs but does not submit them, it does not require a SPOOL
directory.
EXECUTE
EXECUTE
directory. To customize the execute directory independently for each batch slot, use SLOTx_EXECUTE
.
REQUIRE_LOCAL_CONFIG_FILE
LOCAL_CONFIG_FILE
cannot be located. If the value is set to false, MRG Grid will ignore any local configuration files that cannot be located and continue. If LOCAL_CONFIG_FILE
is not defined, and REQUIRE_LOCAL_CONFIG_FILE
has not been explicitly set to false, an error will be caused.
CONDOR_IDS
CONDOR_IDS
environment variable. The syntax is:
CONDOR_IDS =UID
.GID
CONDOR_IDS = 1234.5678
CONDOR_IDS
is not set and the daemons are run by the root user, MRG Grid will search for a condor user on the system, and use that UID and GID.
CONDOR_ADMIN
CONDOR_SUPPORT_EMAIL
Email address of the local MRG Grid administrator: admin@example.com
CONDOR_ADMIN
.
MAIL
/bin/mail
. The email client must be able to accept mail messages and headers as standard input (STDIN
) and use the -s
command to specify a subject for the message. On all platforms, the default shipped with MRG Grid should work. This setting will only need to be changed if the installation is in a non-standard location. The condor_schedd
will not function unless MAIL
is defined.
RESERVED_SWAP
RESERVED_DISK
LOCK
LOCK
entry to avoid problems with file locking.
LOCK
is provided, the value of LOG
is used.
HISTORY
condor_schedd
to append information, and condor_history
the user-level program used to view the file. The default value is $(SPOOL)/history
. If not defined, no history file will be kept.
ENABLE_HISTORY_ROTATION
MAX_HISTORY_LOG
to define the size of the file and MAX_HISTORY_ROTATIONS
to define the number of files to use when rotation is enabled.
MAX_HISTORY_LOG
MAX_HISTORY_ROTATIONS
MAX_JOB_QUEUE_LOG_ROTATIONS
NO_DNS
DEFAULT_DOMAIN_NAME
.
DEFAULT_DOMAIN_NAME
NO_DNS
is set to true.
EMAIL_DOMAIN
notify_user
in the submit description file, MRG Grid will send any email about that job to username
@UID_DOMAIN
. If all the machines share a common UID domain, but email to this address will not work, you will need to define the correct domain to use. In many cases, you can set EMAIL_DOMAIN
to FULL_HOSTNAME
.
CREATE_CORE_FILES
LOG
directory in the case of a segmentation fault (segfault). When set to false no core files will be created. When left undefined, it will retain the setting that was in effect when the Condor daemons were started. Core files are used primarily for debugging purposes.
ABORT_ON_EXCEPTION
CREATE_CORE_FILES
is also true, MRG Grid will create a core file when an exception occurs.
Q_QUERY_TIMEOUT
condor_q
will wait when trying to connect to condor_schedd
, before causing a timeout error. Defaults to 20 seconds.
DEAD_COLLECTOR_MAX_AVOIDANCE_TIME
condor_collector
daemon. If connections to the dead daemon take very little time to fail, new query attempts become more frequent. Defaults to 3600 (1 hour).
NETWORK_MAX_PENDING_CONNECTS
condor_schedd
can try to connect to large numbers of startds
when claiming them. The negotiator may also connect to large numbers of startds
when initiating security sessions. Defaults to 80% of the process file descriptor limit, except on Windows operating systems, where the default is 1600.
WANT_UDP_COMMAND_SOCKET
MASTER_INSTANCE_LOCK
condor_master
daemons from starting. This is useful when using shared file systems like NFS, where the lock files exist on a local disk. Defaults to $(LOCK)/InstanceLock
. The $(LOCK)
macro can be used to specify the location of all lock files, not just the condor_master
instance lock. If $(LOCK)
is undefined, the master log itself will be locked.
SHADOW_LOCK
ShadowLog
file. It must be a separate file from the ShadowLog
, since the ShadowLog might be rotated and access will need to be synchronized across rotations. This macro is defined relative to the $(LOCK)
macro.
LOCAL_QUEUE_BACKUP_DIR
LOCAL_XACT_BACKUP_FILTER
ALL
local transaction backups will always be kept. When set to NONE
local transaction backups will never be kept. When set to FAILED
local transaction backups will be kept for transactions that have failed to commit.
LOCAL_QUEUE_BACKUP_DIR
must be set to a valid directory and LOCAL_XACT_BACKUP_FILTER
must be set to something other than NONE
.
X_CONSOLE_DISPLAY
:0.0
.
SUBSYSTEM
with the name of the appropriate subsystem.
SUBSYSTEM
_LOG
STARTD_LOG
gives the location of the log file for the condor_startd
daemon.
MAX_SUBSYSTEM
_LOG
.old
. The .old
files are overwritten each time the log is saved, thus the maximum space devoted to logging for any one program will be twice the maximum length of its log file. A value of 0
specifies that the file may grow without bounds. Defaults to 1MB.
TRUNC_SUBSYSTEM
_LOG_ON_OPEN
TRUE
, the log will be restarted with an empty file every time the program is run. When FALSE
new entries will be appended. Defaults to FALSE
.
SUBSYSTEM
_LOCK
SUBSYSTEM
_LOG
file, since that file can be rotated and synchronization should occur across log file rotations. A lock file is only required for log files which are accessed by more than one process. Currently, this includes only the SHADOW subsystem. This macro is defined relative to the LOCK
macro.
FILE_LOCK_VIA_MUTEX
TRUE
logs are able to be locked using a mutex instead of by file locking. This can correct problems on Windows platforms where processes starve waiting for a lock on a log file. Defaults to TRUE
on Windows platforms. Always set to FALSE
on Unix platforms.
ENABLE_USERLOG_LOCKING
TRUE
the job log specified in the submit description file is locked before being written to. Defaults to TRUE
.
TOUCH_LOG_INTERVAL
touch
command) log files, in seconds. The change in last modification time for the log file is useful when a daemon restarts after failure or shut down. The last modification date is printed, and it provides an upper bound on the length of time that the daemon was not running. Defaults to 60 seconds.
LOGS_USE_TIMESTAMP
TRUE
, Unix Epoch Time is used. When FALSE
, the time is printed in the local timezone using the syntax:
[Month]/[Day]/[Year] [Hour]:[Minute]:[Second]
FALSE
.
SUBSYSTEM
_DEBUG
D_ALWAYS
. This logs all messages. Settings are a comma or space-separated list of these values:
D_ALL
D_FULLDEBUG
D_DAEMONCORE
D_PRIV
D_COMMAND
D_LOAD
condor_startd
records the load average on the machine where it is running. Both the general system load average, and the load average being generated by MRG Grid activity are determined. With this flag set, the condor_startd
will log a message with the current state of both of these load averages whenever it computes them. This flag only affects the condor_startd
subsystem.
D_KEYBOARD
condor_startd
subsystem.
D_JOB
condor_schedd
sends to claim the condor_startd
. This flag only affects the condor_startd
subsystem.
D_MACHINE
condor_schedd
sends to claim the condor_startd
. This flag only affects the condor_startd
subsystem.
D_SYSCALLS
D_MATCH
condor_negotiator
.
D_NETWORK
D_HOSTNAME
D_SECURITY
D_PROCFAMILY
D_ACCOUNTANT
D_PROTOCOL
D_PID
D_PID
is set, the process identifier (PID) of the process writing each line to the log file will be recorded.
D_FDS
D_FDS
is set, the file descriptor that the log file was allocated will be recorded.
ALL_DEBUG
ALL_DEBUG = D_ALL
.
TOOL_DEBUG
SUBSYSTEM
_DEBUG
to describe the amount of debugging information sent to STDERR
for Condor tools.
SUBMIT_DEBUG
SUBSYSTEM
_DEBUG
to describe the amount of debugging information sent to STDERR
for condor_submit
.
SUBSYSTEM
_[LEVEL]
_LOG
SUBSYSTEM
_DEBUG
, then all messages of this debug level will be written both to the SUBSYSTEM
_LOG
file and the SUBSYSTEM
_[LEVEL]
_LOG
file.
MAX_SUBSYSTEM
_[LEVEL]
_LOG
MAX_SUBSYSTEM
_LOG
.
TRUNC_SUBSYSTEM
_[LEVEL]
_LOG_ON_OPEN
TRUNC_SUBSYSTEM
_LOG_ON_OPEN
.
EVENT_LOG
MAX_EVENT_LOG
.old
. The .old
files are overwritten each time the log is saved. A value of 0
allows the file to grow continuously. Defaults to 1MB.
EVENT_LOG_USE_XML
TRUE
, events are logged in XML format. Defaults to FALSE
.
EVENT_LOG_JOB_AD_INFORMATION_ATTRS
JobAdInformationEvent
. This new event is placed in the event log in addition to each logged event.
ALLOW...
ALLOW
or DENY
are settings for host-based security.
ENABLE_RUNTIME_CONFIG
condor_config_val
tool has an option -rset
for dynamically setting run time configuration values (which only effect the in-memory configuration variables). Because of the potential security implications of this feature, by default, Condor daemons will not honor these requests. To use this functionality, administrators must specifically enable it by setting ENABLE_RUNTIME_CONFIG
to True
, and specify what configuration variables can be changed using the SETTABLE_ATTRS...
family of configuration options (described below). This setting defaults to False
.
ENABLE_PERSISTENT_CONFIG
condor_config_val
tool has a -set
option for dynamically setting persistent configuration values. These values override options in the normal configuration files. Because of the potential security implications of this feature, by default, Condor daemons will not honor these requests. To use this functionality, administrators must specifically enable it by setting ENABLE_PERSISTENT_CONFIG
to True
, creating a directory where the Condor daemons will hold these dynamically-generated persistent configuration files (declared using PERSISTENT_CONFIG_DIR
, described below) and specify what configuration variables can be changed using the SETTABLE_ATTRS...
family of configuration options (described below). This setting defaults to False
.
PERSISTENT_CONFIG_DIR
condor_config_val -set
) This directory should only be writable by root, or the user the Condor daemons are running as (if non-root). There is no default, administrators that wish to use this functionality must create this directory and define this setting. This directory must not be shared by multiple MRG Grid installations, though it can be shared by all Condor daemons on the same host. Keep in mind that this directory should not be placed on an NFS mount where ``root-squashing'' is in effect, or else Condor daemons running as root will not be able to write to them. A directory (only writable by root) on the local file system is usually the best location for this directory.
SETTABLE_ATTRS...
SETTABLE_ATTRS
or SUBSYSTEM
_SETTABLE_ATTRS
are settings used to restrict the configuration values that can be changed using the condor_config_val
command.
SHUTDOWN_GRACEFUL_TIMEOUT
SUBSYSTEM
_ADDRESS_FILE
condor_collector
(which listens on a well-known port) to find the address of a given daemon on a given machine. When tools and daemons are all executing on the same single machine, communications do not require a query of the condor_collector
daemon. Instead, they look in a file on the local disk to find the IP/port. This macro causes daemons to write the IP/port of their command socket to a specified file. In this way, local tools will continue to operate, even if the machine running the condor_collector
crashes. Using this file will also generate slightly less network traffic in the pool, since tools including condor_q
and condor_rm
do not need to send any messages over the network to locate the condor_schedd
daemon. This macro is not necessary for the condor_collector
daemon, since its command socket is at a well-known port.
SUBSYSTEM
with the appropriate subsystem string.
SUBSYSTEM
_DAEMON_AD_FILE
condor_collector
, it will also place a copy of the ClassAd in this file. Currently, this setting only works for the condor_schedd
(that is SCHEDD_DAEMON_AD_FILE
).
SUBSYSTEM
_ATTRS
SUBSYSTEM
with the appropriate subsystem string.
condor_kbdd
does not send ClassAds, so this entry does not affect it. The condor_startd
, condor_schedd
, condor_master
and condor_collector
do send ClassAds, so those would be valid subsystems to set this entry for.
condor_startd
is to advertise a string macro, a numeric macro, and a boolean expression, do something similar to:
STRING = This is a string NUMBER = 666 BOOL1 = True BOOL2 = CurrentTime >= $(NUMBER) || $(BOOL1) MY_STRING = "$(STRING)" STARTD_ATTRS = MY_STRING, NUMBER, BOOL1, BOOL2
DAEMON_SHUTDOWN
condor_collector
, it will evaluate this expression. If it evaluates to True
, the daemon will gracefully shut itself down, exit with the exit code 99, and will not be restarted by the condor_master
(as if it sent itself a condor_off
command). The expression is evaluated in the context of the ClassAd that is being sent to the condor_collector
, so it can reference any attributes that can be seen with condor_status -long [-daemon_type]
command (for example; condor_status -long [-master]
for the condor_master
). Since each daemon's ClassAd will contain different attributes, administrators should define these shutdown expressions specific to each daemon. For example:
STARTD.DAEMON_SHUTDOWN = when to shutdown the startd MASTER.DAEMON_SHUTDOWN = when to shutdown the master
FALSE
.
DAEMON_SHUTDOWN_FAST
DAEMON_SHUTDOWN
(defined above), except the daemon will use the fast shutdown mode (as if it sent itself a condor_off
command using the -fast
option).
USE_CLONE_TO_CREATE_PROCESSES
True
(the default value), the clone
system call is used. Otherwise, the fork
system call is used. clone
provides scalability improvements for daemons using a large amount of memory (e.g. a condor_schedd
with a lot of jobs in the queue). Currently, the use of clone
is available on Linux systems other than IA-64, but not when GCB is enabled.
NOT_RESPONDING_TIMEOUT
SUBSYSTEM
_NOT_RESPONDING_TIMEOUT
NOT_RESPONDING_TIMEOUT
, but controls the timeout for a specific type of daemon. For example, SCHEDD_NOT_RESPONDING_TIMEOUT
controls how long the condor_schedd
's parent daemon will wait without receiving a message from the condor_schedd
before killing it.
NOT_RESPONDING_WANT_CORE
NOT_RESPONDING_WANT_CORE
is true, the parent will send a SIGABRT
instead of SIGKILL
to the child process. If the child process is configured with CREATE_CORE_FILES
enabled, the child process will then generate a core dump. The parent will follow up the SIGABRT
with a SIGKILL
after 600 seconds (not configurable) in case the child does not exit in response to the SIGABRT
.
LOCK_FILE_UPDATE_INTERVAL
tmpwatch
, from deleting long lived lock files. If set to a value less than 60, the update time will be 60 seconds. The default value is 28800, which is 8 hours. This variable only takes effect at the start or restart of a daemon.
BIND_ALL_INTERFACES
False
, network sockets will only bind to the IP address specified with NETWORK_INTERFACE
(described below). If set to True
, the default value, MRG Grid will listen on all interfaces. However, currently MRG Grid is still only able to advertise a single IP address, even if it is listening on multiple interfaces. By default, it will advertise the IP address of the network interface used to contact the collector, since this is the most likely to be accessible to other processes which query information from the same collector.
CCB_ADDRESS
condor_collector
that will serve as this daemon's Condor Connection Broker (CCB). Multiple addresses may be listed (separated by commas and/or spaces) for redundancy. The CCB server must authorize this daemon at DAEMON level for this configuration to succeed. It is highly recommended to also configure PRIVATE_NETWORK_NAME
if you configure CCB_ADDRESS
so communications originating within the same private network do not need to go through CCB.
SUBSYSTEM
_MAX_FILE_DESCRIPTORS
MAX_FILE_DESCRIPTORS
, but it only applies to a specific subsystem. If the subsystem-specific setting is unspecified, MAX_FILE_DESCRIPTORS
is used.
MAX_FILE_DESCRIPTORS
NETWORK_INTERFACE
NETWORK_INTERFACE
should be set to the IP address to use. When BIND_ALL_INTERFACES
is set to True
(the default), this setting simply controls what IP address a given host will advertise.
PRIVATE_NETWORK_NAME
PRIVATE_NETWORK_NAME
, it will attempt to contact this daemon using the PrivateIpAddr
attribute from the classified ad. Even for sites using CCB or GCB, this is an important optimization, since it means that two daemons on the same network can communicate directly, without having to go through the broker.
PRIVATE_NETWORK_NAME
is defined, the PrivateIpAddr
will be defined automatically. Otherwise, you can specify a particular private IP address to use by defining the PRIVATE_NETWORK_INTERFACE
setting (described below). There is no default for this setting.
PRIVATE_NETWORK_INTERFACE
PRIVATE_NETWORK_NAME
are both defined, Condor daemons will advertise some additional attributes in their ClassAds to help other Condor daemons and tools in the same private network to communicate directly.
PRIVATE_NETWORK_INTERFACE
defines what IP address a given multi-homed machine should use for the private network. If another Condor daemon or tool is configured with the same PRIVATE_NETWORK_NAME
, it will attempt to contact this daemon using the IP address specified here.
PRIVATE_NETWORK_NAME
, and the PRIVATE_NETWORK_INTERFACE
will be defined automatically. Unless CCB/GCB is enabled, there is no default for this setting.
HIGHPORT
HIGHPORT
and LOWPORT
(given below) must be defined.
LOWPORT
HIGHPORT
(given above) and LOWPORT
must be defined.
IN_LOWPORT
IN_LOWPORT
and IN_HIGHPORT
. A range of port numbers less than 1024 may be used for daemons running as root. Do not specify IN_LOWPORT
in combination with IN_HIGHPORT
such that the range crosses the port 1024 boundary. Applies only to Unix machine configuration. Use of IN_LOWPORT
and IN_HIGHPORT
overrides any definition of LOWPORT
and HIGHPORT
.
IN_HIGHPORT
IN_LOWPORT
and IN_HIGHPORT
. A range of port numbers less than 1024 may be used for daemons running as root. Do not specify IN_LOWPORT
in combination with IN_HIGHPORT
such that the range crosses the port 1024 boundary. Applies only to Unix machine configuration. Use of IN_LOWPORT
and IN_HIGHPORT
overrides any definition of LOWPORT
and HIGHPORT
.
OUT_LOWPORT
OUT_LOWPORT
and OUT_HIGHPORT
. A range of port numbers less than 1024 is inappropriate, as not all daemons and tools will be run as root. Applies only to Unix machine configuration. Use of OUT_LOWPORT
and OUT_HIGHPORT
overrides any definition of LOWPORT
and HIGHPORT
.
OUT_HIGHPORT
OUT_LOWPORT
and OUT_HIGHPORT
. A range of port numbers less than 1024 is inappropriate, as not all daemons and tools will be run as root. Applies only to Unix machine configuration. Use of OUT_LOWPORT
and OUT_HIGHPORT
overrides any definition of LOWPORT
and HIGHPORT
.
UPDATE_COLLECTOR_WITH_TCP
False
. If your site needs to use TCP connections to send ClassAd updates to your collector set to this to True
. At this time, this setting only affects the main condor_collector
for the site. If enabled, also define COLLECTOR_SOCKET_CACHE_SIZE
at the central manager, so that the collector will accept TCP connections for updates, and will keep them open for reuse. For large pools, it is also necessary to ensure that the collector has a high enough file descriptor limit (e.g. using MAX_FILE_DESCRIPTORS
).
TCP_UPDATE_COLLECTORS
SUBSYSTEM
_TIMEOUT_MULTIPLIER
NONBLOCKING_COLLECTOR_UPDATE
True
. When True
, the establishment of TCP connections to the condor_collector
daemon for a security-enabled pool are done in a nonblocking manner.
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT
True
. When True
, the establishment of TCP connections from the condor_negotiator
daemon to the condor_startd
daemon for a security-enabled pool are done in a nonblocking manner.
NET_REMAP_ENABLE
True
, enables a network remapping service. The service to use is controlled by NET_REMAP_SERVICE
. This boolean value defaults to False
.
NET_REMAP_SERVICE
NET_REMAP_ENABLE
is defined to True
, this setting controls what network remapping service should be used. Currently, the only value supported is GCB. The default is undefined.
NET_REMAP_INAGENT
condor_master
chooses one at random from among the working brokers in the list. There is no default if not defined.
NET_REMAP_ROUTE
MASTER_WAITS_FOR_GCB_BROKER
condor_master
with GCB enabled. It defaults to True
.
MASTER_WAITS_FOR_GCB_BROKER
is True
; if there is no GCB broker working when the condor_master
starts, or if communications with a GCB broker fail, the condor_master
waits while attempting to find a working GCB broker.
MASTER_WAITS_FOR_GCB_BROKER
is False
; if no GCB broker is working when the condor_master
starts the condor_master
fails and exits without restarting. If the condor_master
has successfully communicated with a GCB broker at start-up but the communication fails, the condor_master
kills all its children, exits, and restarts.
condor_vm-gahp
.
VM_GAHP_SERVER
condor_vm-gahp
. There is no default value for this required configuration variable.
VM_GAHP_LOG
condor_vm-gahp
log. If not specified on a Unix platform, the condor_starter
log will be used for condor_vm-gahp
log items. There is no default value for this required configuration variable on Windows platforms.
MAX_VM_GAHP_LOG
condor_vm-gahp
log will be allowed to grow.
VM_TYPE
kvm
or xen
. There is no default value for this required configuration variable.
VM_MEMORY
VM_MAX_NUMBER
NUM_CPUS
. When it evaluates to Undefined
, as is the case when not defined with a numeric value, no meaningful limit is imposed.
VM_STATUS_INTERVAL
condor_starter
to see if the job has finished. The default value is 60
seconds and a minimum value of 30
seconds is enforced.
VM_GAHP_REQ_TIMEOUT
condor_starter
to the condor_vm-gahp
to be completed. When a command times out, an error is reported to the condor_starter
. Defaults to 300
(five minutes).
VM_RECHECK_INTERVAL
condor_startd
waits after a virtual machine error is reported by the condor_starter
, and before checking a final time on the status of the virtual machine. If the check fails, Condor disables starting any new vm universe jobs by removing the VM_Type
attribute from the machine ClassAd. Default to 600
(ten minutes).
VM_SOFT_SUSPEND
False
, causing Condor to free the memory of a vm universe job when the job is suspended. When True
, the memory is not freed.
VM_UNIV_NOBODY_USER
ALWAYS_VM_UNIV_USE_NOBODY
False
. When True
, all vm universe jobs (independent of their UID domain) will run as the user defined in VM_UNIV_NOBODY_USER
.
VM_NETWORKING
False
.
VM_NETWORKING_TYPE
VM_NETWORKING
is True
. Defined strings are:
bridge nat nat, bridge
VM_NETWORKING_DEFAULT_TYPE
VM_NETWORKING_TYPE
, this optional configuration variable identifies which to use. Therefore, for
VM_NETWORKING_TYPE = nat, bridge
nat
or bridge
. Where multiple networking types are given in VM_NETWORKING_TYPE
, and this variable is not defined, a default of nat
is used.
VM_SCRIPT
VM_NETWORKING_BRIDGE_INTERFACE
XEN_BOOTLOADER
XEN_LOCAL_SETTINGS_FILE
vm
universe.
VMP_HOST_MACHINE
VMP_VM_LIST
condor_master
Configuration File Macros DAEMON_LIST
condor_master
will start and monitor. The list is a comma or space separated list of subsystem names. For example:
DAEMON_LIST = MASTER, STARTD, SCHEDD
DC_DAEMON_LIST
DAEMON_LIST
which use the Condor DaemonCore library. The condor_master
must differentiate between daemons that use DaemonCore and those that do not, so it uses the appropriate inter-process communication mechanisms.
DC_DAEMON_LIST
value by placing the plus character (+) before the first entry in the DC_DAEMON_LIST
definition. For example:
DC_DAEMON_LIST = +NEW_DAEMON
SUBSYSTEM
condor_master
to start, you must provide it with the full path to each of these binaries. For example:
MASTER = $(SBIN)/condor_master STARTD = $(SBIN)/condor_startd SCHEDD = $(SBIN)/condor_schedd
$(SBIN)
macro.
SUBSYSTEM
with the appropriate subsystem string as defined in previous sections.
DAEMONNAME
_ENVIRONMENT
DAEMON_LIST
, you may specify changes to the environment that daemon is started with by setting DAEMONNAME
_ENVIRONMENT
, where DAEMONNAME
is the name of a daemon listed in DAEMON_LIST
. It should use the same syntax for specifying the environment as the environment specification in a condor_submit
file. For example, if you wish to redefine the TMP
and CONDOR_CONFIG
environment variables seen by the condor_schedd
, you could place the following in the config file:
SCHEDD_ENVIRONMENT = "TMP=/new/value CONDOR_CONFIG=/special/config"
condor_schedd
was started by the condor_master
, it would see the specified values of TMP
and CONDOR_CONFIG
.
SUBSYSTEM_ARGS
condor_master
. List the desired arguments using the same syntax as the arguments specification in a condor_submit
submit file, with one exception: do not escape double-quotes when using the old-style syntax (this is for backward compatibility). Set the arguments for a specific daemon with this macro, and the macro will affect only that daemon. Define one of these for each daemon the condor_master
is controlling. For example, set $(STARTD_ARGS)
to specify any extra command line arguments to the condor_startd
.
SUBSYSTEM
with the appropriate subsystem string.
PREEN
DAEMON_LIST
, the condor_master
also starts up a special process called condor_preen
to clean out junk files that have been left behind. This macro determines where the condor_master
finds the condor_preen
binary. This macro can be commented out to prevent condor_preen
from running.
PREEN_ARGS
condor_preen
behaves by allowing the specification of command-line arguments. This macro works as SUBSYSTEM
_ARGS
does. The difference is that you must specify this macro for condor_preen
if you want it to do anything. condor_preen
takes action only because of command line arguments. The -m
switch will instruct MRG Grid to send e-mail about files that should be removed. -r
means you want condor_preen
to actually remove these files.
PREEN_INTERVAL
condor_preen
should be started. It is defined in terms of seconds and defaults to 86400
(once a day).
PUBLISH_OBITUARIES
condor_master
can send e-mail to the address specified by CONDOR_ADMIN
with an obituary letting the administrator know that the daemon died, the cause of death (which signal or exit status it exited with), and (optionally) the last few entries from that daemon's log file. If you want obituaries, set this macro to True
.
OBITUARY_LOG_LENGTH
20
lines.
START_MASTER
False
the condor_master
will exit as soon as it starts. This setting is useful if the boot scripts for your entire pool are centralized but you do not want MRG Grid to run on certain machines. This entry is most effectively used in a file in the local configuration directory, not a global configuration file.
START_DAEMONS
START_MASTER
macro described above. This macro, however, does not force the condor_master
to exit; instead preventing it from starting any of the daemons listed in the DAEMON_LIST
. The daemons may be started later with a condor_on
command.
MASTER_UPDATE_INTERVAL
condor_master
sends a ClassAd update to the condor_collector
. It is defined in seconds and defaults to 300
(every 5 minutes).
MASTER_CHECK_NEW_EXEC_INTERVAL
condor_master
checks the timestamps of the running daemons. If any daemons have been modified, the master restarts them. It is defined in seconds and defaults to 300
(every 5 minutes).
MASTER_NEW_BINARY_DELAY
condor_master
has discovered a new binary, this macro controls how long it waits before attempting to execute it. This delay exists because the condor_master
might notice a new binary while it is in the process of being copied, in which case trying to execute it yields unpredictable results. The entry is defined in seconds and defaults to 120
(2 minutes).
SHUTDOWN_FAST_TIMEOUT
condor_master
kills them outright. It is defined in seconds and defaults to 300
(5 minutes).
MASTER_BACKOFF_CONSTANT
and MASTER_name
_BACKOFF_CONSTANT
condor_master
uses an exponential back off delay before restarting it (see the "Backoff Delays" section below for details on how these parameters work together). These settings define the constant value of the expression used to determine how long to wait before starting the daemon again (and, effectively becomes the initial backoff time). It is an integer in units of seconds, and defaults to 9
seconds.
$(MASTER_name
_BACKOFF_CONSTANT)
is the daemon-specific form of MASTER_BACKOFF_CONSTANT
; if this daemon-specific macro is not defined for a specific daemon, the non-daemon-specific value will used.
MASTER_BACKOFF_FACTOR
and MASTER_name
_BACKOFF_FACTOR
condor_master
uses an exponential back off delay before restarting it; (see the "Backoff Delays" section below for details on how these parameters work together). This setting is the base of the exponent used to determine how long to wait before starting the daemon again. It defaults to 2
seconds.
$(MASTER_name
_BACKOFF_FACTOR)
is the daemon-specific form of MASTER_BACKOFF_FACTOR
; if this daemon-specific macro is not defined for a specific daemon, the non-daemon-specific value will used.
MASTER_BACKOFF_CEILING
and MASTER_name
_BACKOFF_CEILING
condor_master
uses an exponential back off delay before restarting it; (see the "Backoff Delays" section below for details on how these parameters work together). This entry determines the maximum amount of time you want the master to wait between attempts to start a given daemon. (With 2.0 as the $(MASTER_BACKOFF_FACTOR)
, 1 hour is obtained in 12 restarts). It is defined in terms of seconds and defaults to 3600
(1 hour).
$(MASTER_name
_BACKOFF_CEILING)
is the daemon-specific form of MASTER_BACKOFF_CEILING
; if this daemon-specific macro is not defined for a specific daemon, the non-daemon-specific value will used.
MASTER_RECOVER_FACTOR
and MASTER_name
_RECOVER_FACTOR
300
(5 minutes).
$(MASTER_name
_RECOVER_FACTOR)
is the daemon-specific form of MASTER_RECOVER_FACTOR
; if this daemon-specific macro is not defined for a specific daemon, the non-daemon-specific value will used.
SUBSYSTEM
_USERID
condor_master
will restart the daemon after a delay (a back off). The length of this delay is based on how many times it has been restarted, and gets larger after each crash. The equation for calculating this backoff time is given by:
t = c + kn
t
is the calculated time, c
is the constant defined by MASTER_BACKOFF_CONSTANT
, k
is the factor defined by MASTER_BACKOFF_FACTOR
, and n
is the number of restarts already attempted (0
for the first restart, 1
for the next, etc.).
MASTER_BACKOFF_FACTOR
(which defaults to 2
) to the power the number of times the daemon has restarted, and add MASTER_BACKOFF_CONSTANT
(which defaults to 9
). Thus:
MASTER_BACKOFF_CEILING
, which defaults to 3600
, so the daemon would be restarted after only 3600 seconds, not 4105. The condor_master
tries again every hour (since the numbers would get larger and would always be capped by the ceiling). Should the daemon stay alive for the time set in MASTER_RECOVER_FACTOR
(defaults to 5 minutes), the count of how many restarts this daemon has performed is reset to 0
.
MASTER_NAME
condor_master
daemon on a machine. For a condor_master
running as root, it defaults to the fully qualified host name. When not running as root, it defaults to the user that instantiates the condor_master
, concatenated with an at symbol (@), concatenated with the fully qualified host name. If more than one condor_master
is running on the same host, then the MASTER_NAME
for each condor_master
must be defined to uniquely identify the separate daemons.
MASTER_NAME
is presumed to be of the form identifying-string@full.host.name
. If the string does not include an @ sign, one will be appended, followed by the fully qualified host name of the local machine. The identifying-string portion may contain any alphanumeric ASCII characters or punctuation marks, except the @ sign. We recommend that the string does not contain the : (colon) character, since that might cause problems with certain tools. The string will not be modified if it contains an @ sign. This is useful for remote job submissions under the high availability of the job queue.
MASTER_NAME
setting is used, and the condor_master
is configured to spawn a condor_schedd
, the name defined with MASTER_NAME
takes precedence over the SCHEDD_NAME
setting. Since the assumption is that there is only one instance of the condor_startd
running on a machine, the MASTER_NAME
is not automatically propagated to the condor_startd
. However, in situations where multiple condor_startd
daemons are running on the same host, the STARTD_NAME
should be set to uniquely identify the condor_startd
daemons.
master, schedd or startd
) has been given a unique name, all tools that need to contact that daemon can be told what name to use via the -name
command-line option.
MASTER_ATTRS
SUBSYSTEM
_ATTRS
.
MASTER_DEBUG
SUBSYSTEM
_DEBUG
.
MASTER_ADDRESS_FILE
SUBSYSTEM
_ADDRESS_FILE
.
ALLOW_ADMIN_COMMANDS
NO
for a given host, this macro disables administrative commands, such as condor_restart
, condor_on
and condor_off
, to that host.
MASTER_INSTANCE_LOCK
condor_master
daemon to lock in order to prevent multiple condor_masters
from starting. This is useful when using shared file systems like NFS which do not technically support locking in the case where the lock files reside on a local disk. If this macro is not defined, the default file name will be LOCK
/InstanceLock
. LOCK
can instead be defined to specify the location of all lock files, not just the condor_master
's InstanceLock
. If LOCK
is undefined, then the master log itself is locked.
ADD_WINDOWS_FIREWALL_EXCEPTION
condor_master
will not automatically add MRG Grid to the Windows Firewall list of trusted applications. Such trusted applications can accept incoming connections without interference from the firewall. This only affects machines running Windows XP SP2 or higher. The default is True
.
WINDOWS_FIREWALL_FAILURE_RETRY
60
) that represents the number of times the condor_master
will retry to add firewall exceptions. When a Windows machine boots up, MRG Grid starts up by default as well. Under certain conditions, the condor_master
may have difficulty adding exceptions to the Windows Firewall because of a delay in other services starting up. Examples of services that may possibly be slow are the SharedAccess service, the Netman service, or the Workstation service. This configuration variable allows administrators to set the number of times (once every 10 seconds) that the condor_master
will retry to add firewall exceptions. A value of 0
means that it will retry indefinitely.
USE_PROCESS_GROUPS
True
. When False
, Condor daemons on UNIX machines will not create new sessions or process groups. Process groups help to track the descendants of processes that have been created. This can cause problems when MRG Grid is run under another job execution system.
condor_startd
Configuration File Macros START
True
, indicates that the machine is willing to start running a job. START
is considered when the condor_negotiator
daemon is considering evicting the job to replace it with one that will generate a better rank for the condor_startd
daemon, or a user with a higher priority.
SUSPEND
True
, causes a running job to be suspended. The machine may still be claimed, but the job makes no further progress, and no load is generated on the machine.
PREEMPT
True
, causes a currently running job to be stopped.
WANT_HOLD
VIRTUAL_MEMORY_AVAILABLE_MB = (VirtualMemory*0.9) MEMORY_EXCEEDED = ImageSize/1024 > $(VIRTUAL_MEMORY_AVAILABLE_MB) PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED)) WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE WANT_HOLD = ($(MEMORY_EXCEEDED)) WANT_HOLD_REASON = \ ifThenElse( $(MEMORY_EXCEEDED), \ "Your job used too much virtual memory.", \ undefined )
WANT_HOLD_REASON
WANT_HOLD
. If not defined or if the expression evaluates to Undefined, a default hold reason is provided.
WANT_HOLD_SUB_CODE
HoldReasonSubCode
when a job is put on hold due to WANT_HOLD
. If not defined or if the expression evaluates to Undefined
, the value is set to 0
. Note that HoldReasonCode
is always set to 21
.
CONTINUE
True
, causes a previously suspended job to continue executed.
KILL
True
, causes the execution of a currently running job to stop without delay.
RANK
RANK
.
WANT_SUSPEND
True
, will evaluate the SUSPEND
expression.
WANT_VACATE
True
, defines that a preempted job is to be vacated, instead of killed.
IS_OWNER
IS_OWNER = (START =?= FALSE)
IS_OWNER
, as they would be Undefined
.
STARTER
condor_starter
binary that the condor_startd
should spawn. It is normally defined relative to $(SBIN)
.
POLLING_INTERVAL
condor_startd
enters the claimed state, this macro determines how often the state of the machine is polled to check the need to suspend, resume, vacate or kill the job. It is defined in terms of seconds and defaults to 5
.
UPDATE_INTERVAL
condor_startd
should send a ClassAd update to the condor_collector
. The condor_startd
also sends update on any state or activity change, or if the value of its START
expression changes. This macro is defined in terms of seconds and defaults to 300
(5 minutes).
UPDATE_OFFSET
condor_startd
should wait before sending its initial update, and the first update after a condor_reconfig
command is sent to the condor_collector
. The time of all other updates sent after this initial update is determined by UPDATE_INTERVAL
. Thus, the first update will be sent after UPDATE_OFFSET
seconds, and the second update will be sent after UPDATE_OFFSET
+ UPDATE_INTERVAL
. This is useful when used in conjunction with the RANDOM_INTEGER
macro for large pools, to spread out the updates sent by a large number of condor_startd
daemons. Defaults to 0
.
startd.UPDATE_INTERVAL = 300 startd.UPDATE_OFFSET = $RANDOM_INTEGER(0,300)
0
and 300
, with all further updates occurring at fixed 300
second intervals following the initial update.
MAXJOBRETIREMENTTIME
0
(when the configuration variable is not present) implements the expected policy that there is no retirement time.
CLAIM_WORKLIFE
schedd
a claim to a slot the schedd
will, by default, keep running jobs on that slot (as long as it has jobs with matching requirements) without returning the slot to the unclaimed state and renegotiating for machines. The solution is to use CLAIM_WORKLIFE
to force the claim to stop running additional jobs after a certain amount of time. Once CLAIM_WORKLIFE
expires, any existing job may continue to run as usual, but once it finishes or is preempted, the claim is closed.
CLAIM_WORKLIFE
is -1
, which is treated as an infinite claim worklife so claims may be held indefinitely (as long as they are not preempted and the schedd
does not relinquish them). A value of 0 has the effect of not allowing more than one job to run per claim, since it immediately expires after the first job starts running.
MAX_CLAIM_ALIVES_MISSED
condor_startd
before it considers a resource claim by a condor_schedd
no longer valid. The default is 6
.
condor_schedd
sends periodic keep alive updates to each condor_startd
. If the condor_startd
does not receive any keep alive messages it assumes that something has gone wrong with the condor_schedd
and that the resource is not being effectively used. Once this happens the condor_startd
considers the claim to have timed out. It releases the claim and starts advertising itself as available for other jobs. As keep alive messages are sent via UDP and are sometimes dropped by the network, the condor_startd
has some tolerance for missed keep alive messages. If a few keep alive messages are not recieved, the condor_startd
will not immediately release the claim. This macro sets the number of missed messages that will be tolerated.
STARTD_HAS_BAD_UTMP
condor_startd
is computing the idle time of all the users of the machine (both local and remote), it checks the utmp
file to find all the currently active ttys, and only checks access time of the devices associated with active logins. Unfortunately, on some systems, utmp
is unreliable, and the condor_startd
might miss keyboard activity by doing this. So, if your utmp
is unreliable, set this macro to True
and the condor_startd
will check the access time on all tty and pty devices.
CONSOLE_DEVICES
condor_startd
to monitor console (keyboard and mouse) activity by checking the access times on special files in /dev. Activity on these files shows up as ConsoleIdle
time in the condor_startd
's ClassAd. Give a comma-separated list of the names of devices considered the console, without the /dev/ portion of the path name. The defaults vary from platform to platform, and are usually correct.
/dev/psaux
for a PS/2 bus mouse, or /dev/tty00
for a serial mouse connected to com1). However, if your installation does not have this soft link, you will need to either add it or change this macro to point to the right device.
STARTD_JOB_EXPRS
condor_startd
can also advertise arbitrary attributes from the job ClassAd in the machine ClassAd. List the attribute names to be advertised.
STARTD_SENDS_ALIVES
False
, such that the condor_schedd
daemon sends keep alive signals to the condor_startd
daemon. When True
, the condor_startd
daemon sends keep alive signals to the condor_schedd
daemon, reversing the direction. This may be useful if the condor_startd
daemon is on a private network or behind a firewall.
STARTD_SHOULD_WRITE_CLAIM_ID_FILE
condor_startd
can be configured to write out the ClaimId for the next available claim on all slots to separate files. This boolean attribute controls whether the condor_startd
should write these files. The default value is True
.
STARTD_CLAIM_ID_FILE
STARTD_SHOULD_WRITE_CLAIM_ID_FILE
is True
. By default, the ClaimId will be written into a file in the LOG
directory called .startd_claim_id.slotX
, where X
is the value of SlotID
, the integer that identifies a given slot on the system, or 1 on a single-slot machine. If you define your own value for this setting, you should provide a full path, and the .slotX
portion of the file name will be automatically appended.
NUM_CPUS
condor_startd
daemon about how many CPUs a machine has. When set, it overrides automatic detection of CPUs.
STARTD_ATTRS
setting to advertise the fact in the machine's ClassAd. This will allow jobs submitted in the pool to specify if they do not want to be matched with machines that are only offering these fractional CPUs.
SIGHUP
or by using the condor_reconfig
command. To change this macro you must restart the condor_startd
daemon. The command is:
condor_restart -subsystem startd
MAX_NUM_CPUS
NUM_CPUS
is set. If set to zero, there is no ceiling. If not defined, the default value is zero, and thus there is no ceiling.
SIGHUP
or by using the condor_reconfig
command. To change this, restart the condor_startd
daemon for the change to take effect. The command will be:
condor_restart -startd
COUNT_HYPERTHREAD_CPUS
True
(the default), it includes virtual CPUs in the default value of NUM_CPUS
. On dedicated cluster nodes, counting virtual CPUs can sometimes improve total throughput at the expense of individual job speed. However, counting them on desktop workstations can interfere with interactive job performance.
MEMORY
MEMORY
to state how much physical memory (in MB) your machine has, overriding the automatic value.
RESERVED_MEMORY
RESERVED_MEMORY
is defined, this value is subtracted from the amount of memory advertised as available.
STARTD_NAME
Name
attribute in the condor_startd
's ClassAd. This esoteric configuration macro might be used in the situation where there are two condor_startd
daemons running on one machine, and each reports to the same condor_collector
. Different names will distinguish the two daemons. See the description of MASTER_NAME
in section Section A.9, “condor_master
Configuration File Macros ” for defaults and composition of valid Condor daemon names.
RUNBENCHMARKS
True
. If RunBenchmarks
is specified and set to anything other than False
, additional benchmarks will be run when the condor_startd
initially starts. To disable start up benchmarks, set RunBenchmarks
to False
, or comment it out of the configuration file.
DedicatedScheduler
STARTD_RESOURCE_PREFIX
slot
. This setting enables sites to define what string the condor_startd
will use to name the individual resources on an SMP machine if they prefer to use something other than slot
.
SLOTS_CONNECTED_TO_CONSOLE
condor_startd
is representing should be "connected" to the console (that is slots that notice when there is console activity). This defaults to all slots (N
in a machine with N
CPUs).
SLOTS_CONNECTED_TO_KEYBOARD
condor_startd
is representing should be "connected" to the keyboard (for remote tty activity, as well as console activity). Defaults to 1
.
DISCONNECTED_KEYBOARD_IDLE_BOOST
condor_startd
was spawned plus the value of this macro. It defaults to 1200
seconds (20 minutes).
condor_startd
starts up (if the slot is configured to ignore keyboard activity), instead of having to wait for 15 minutes (which is the default time a machine must be idle before a job will start) or more.
0
. Increase this macro's value if you change your START
expression to require more than 15 minutes before a job starts, but you still want jobs to start right away on some of your SMP nodes.
STARTD_SLOT_ATTRS
STARTD_SLOT_ATTRS = State, Activity, EnteredCurrentActivity
slot1_State = "Claimed" slot1_Activity = "Busy" slot1_EnteredCurrentActivity = 1075249233 slot2_State = "Unclaimed" slot2_Activity = "Idle" slot2_EnteredCurrentActivity = 1075240035
MAX_SLOT_TYPES
10
(you should only need to change this setting if you define more than 10 separate slot types).
SLOT_TYPE_N
NUM_SLOTS_TYPE_N
. N
can be any integer from 1
to the value of MAX_SLOT_TYPES
, such as SLOT_TYPE_1
.
SLOT_TYPE_N
_PARTITIONABLE
False
. When set to True
, this slot permits dynamic slots.
NUM_SLOTS_TYPE_N
NUM_SLOTS
NUM_CPUS
.
ALLOW_VM_CRUFT
True
. When True
, MRG Grid looks for configuration variables named with the previously used string VM after searching unsuccessfully for variables named with the currently used string SLOT
. When False
, it does not look for variables named with the previously used string VM after searching unsuccessfully for the string SLOT
.
STARTD_CRON_NAME
STARTD_CRON_NAME = HAWKEYE
HAWKEYE
" in their name.
STARTD_CRON_CONFIG_VAL
condor_config_val
program which the modules (jobs) should use to get configuration information from the daemon. If this is provided, a environment variable by the same name with the same value will be passed to all modules.
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_CONFIG_VAL
to $(STARTD_CRON_NAME)_CONFIG_VAL
. Example:
HAWKEYE_CONFIG_VAL = /usr/local/condor/bin/condor_config_val
STARTD_CRON_AUTOPUBLISH
condor_startd
should automatically publish a new update to the condor_collector
after any of the cron modules produce output.
never
condor_startd
to not automatically publish updates based on any cron modules. Instead, updates rely on the usual behavior for sending updates, which is periodic, based on the UPDATE_INTERVAL
configuration setting, or whenever a given slot changes state.
always
condor_startd
to always send a new update to the condor_collector
whenever any module exits.
if_changed
condor_startd
to only send a new update to the condor_collector
if the output produced by a given module is different than the previous output of the same module. The only exception is the LastUpdate
attribute (automatically set for all cron modules to be the timestamp when the module last ran), which is ignored when STARTD_CRON_AUTOPUBLISH
is set to if_changed
.
STARTD_CRON_AUTOPUBLISH
does not honor the STARTD_CRON_NAME
setting described above. Even if STARTD_CRON_NAME
is defined, STARTD_CRON_AUTOPUBLISH
will have the same name.
STARTD_CRON_JOBLIST
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_JOBLIST
to $(STARTD_CRON_NAME)_JOBLIST
.
STARTD_CRON_ModuleName
_PREFIX
xyz_
", and an individual attribute is named abc
", the resulting attribute would be xyz_abc
. Although it can be quoted the prefix can contain only alpha-numeric characters.
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_PREFIX
to $(STARTD_CRON_NAME)_ModuleName
_PREFIX
.
STARTD_CRON_ModuleName
_EXECUTABLE
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_EXECUTABLE
to $(STARTD_CRON_NAME)_ModuleName
_EXECUTABLE
.
STARTD_CRON_ModuleName
_PERIOD
s
'), in minutes (append value with the character 'm
'), or in hours (append value with the character 'h
'). For example, 5m
starts the execution of the module every five minutes. If no character is appended to the value, seconds are used as a default. The minimum valid value of the period is 1 second.
STARTD_CRON_NAME
is defined, this configuration macro name is changed from STARTD_CRON_ModuleName
_PERIOD
to $(STARTD_CRON_NAME)_ModuleName
_PERIOD
.
STARTD_CRON_ModuleName
_MODE
condor_startd
daemon, gather and publish its data, and then exit.
condor_startd
daemon interprets the "period" differently. In this case, it refers to the amount of time to wait after the module exits before restarting it. With a value of 1
, the module is kept running nearly continuously.
STARTD_CRON_ModuleName
_RECONFIG
condor_startd
daemon is reconfigured. The module is expected to reread its configuration at that time. A value of "True" enables this setting, and "False" disables it.
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_RECONFIG
to:
$(STARTD_CRON_NAME)_ModuleName
_RECONFIG.
STARTD_CRON_ModuleName
_KILL
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_KILL
to $(STARTD_CRON_NAME)_ModuleName
_KILL
.
condor_startd
when it detects that the module's executable is still running when it is time to start the module for a run. If enabled, the condor_startd
will kill and restart the process in this condition. If not enabled, the existing process is allowed to continue running.
STARTD_CRON_ModuleName
_ARGS
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_ARGS
to $(STARTD_CRON_NAME)_ModuleName
_ARGS
.
STARTD_CRON_ModuleName_ENV
DAEMONNAME_ENVIRONMENT
.
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_ENV
to $(STARTD_CRON_NAME)_ModuleName
_ENV
.
STARTD_CRON_ModuleName
_CWD
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_CWD
to $(STARTD_CRON_NAME)_ModuleName
_CWD
.
STARTD_CRON_ModuleName
_OPTIONS
STARTD_CRON_NAME
is defined, then this configuration macro name is changed from STARTD_CRON_ModuleName
_OPTIONS
to $(STARTD_CRON_NAME)_ModuleName
_OPTIONS
.
STARTD_CRON_JOBS
Hawkeye
, this is usually named HAWKEYE_JOBS
. This configuration variable is defined by a white space or newline separated list of jobs (called modules) to run, where each module is specified using the format:
modulename:prefix:executable:period[:options]
foo:foo_:"c:/some dir/foo.exe":10m
modulename
: The logical name of the module. This must be unique (no two modules may have the same name). See STARTD_CRON_JOBLIST
.
prefix
: See STARTD_CRON_ModuleName
_PREFIX
.
executable
: See STARTD_CRON_ModuleName
_EXECUTABLE
.
period
: See STARTD_CRON_ModuleName
_PERIOD
.
STARTD_CRON_ModuleName
_OPTIONS
.
STARTD_CRON_ModuleName
_OPTIONS
above.
STARTD_CRON_ModuleName
_OPTIONS
above.
STARTD_CRON_ModuleName
_OPTIONS
above.
STARTD_CRON_ModuleName
_OPTIONS
above.
STARTD_CRON_ModuleName
_OPTIONS
above.
# Hawkeye Job Definitions HAWKEYE_JOBS =\ JOB1:prefix_:$(MODULES)/job1:5m:nokill\ JOB2:prefix_:$(MODULES)/job1_co:1h HAWKEYE_JOB1_ARGS =-foo -bar HAWKEYE_JOB1_ENV = xyzzy=somevalue HAWKEYE_JOB2_ENV = lwpi=somevalue
# Hawkeye Job Definitions HAWKEYE_JOBS = # Job 1 HAWKEYE_JOBS = $(HAWKEYE_JOBS) JOB1:prefix_:$(MODULES)/job1:5m:nokill HAWKEYE_JOB1_ARGS =-foo -bar HAWKEYE_JOB1_ENV = xyzzy=somevalue # Job 2 HAWKEYE_JOBS = $(HAWKEYE_JOBS) JOB2:prefix_:$(MODULES)/job2:1h HAWKEYE_JOB2_ENV = lwpi=somevalue
STARTD_COMPUTE_AVAIL_STATS
condor_startd
computes resource availability statistics. The default is False
.
STARTD_COMPUTE_AVAIL_STATS
= True
, the condor_startd
will define the following ClassAd attributes for resources:
AvailTime
LastAvailInterval
AvailSince
AvailTimeEstimate
STARTD_AVAIL_CONFIDENCE
condor_startd
daemon's AvailTime estimate. By default, the estimate is based on the 80th percentile of past values (that is, the value is initially set to 0.8).
STARTD_MAX_AVAIL_PERIOD_SAMPLES
condor_startd
to limit memory and disk consumption. Each sample requires 4 bytes of memory and approximately 10 bytes of disk space.
JAVA
JAVA_MAXHEAP_ARGUMENT
-Xmx
.
JAVA_CLASSPATH_ARGUMENT
JAVA_CLASSPATH_SEPARATOR
JAVA_CLASSPATH_DEFAULT
JAVA_EXTRA_ARGUMENTS
SLOTN_JOB_HOOK_KEYWORD
N
" in "SLOTN
" should be replaced with the slot identification number, for example, on slot1
, this setting would be called SLOT1_JOB_HOOK_KEYWORD
. There is no default keyword. Sites that wish to use these job hooks must explicitly define the keyword (and the corresponding hook paths).
STARTD_JOB_HOOK_KEYWORD
condor_startd
should invoke. This setting is only used if a slot-specific keyword is not defined for a given compute slot. There is no default keyword. Sites that wish to use these job hooks must explicitly define the keyword (and the corresponding hook paths).
HOOK_FETCH_WORK
condor_startd
wants to fetch work. The actual configuration setting must be prefixed with a hook keyword. There is no default.
HOOK_REPLY_CLAIM
condor_startd
finishes fetching a job and decides what to do with it. The actual configuration setting must be prefixed with a hook keyword. There is no default.
HOOK_EVICT_CLAIM
condor_startd
needs to evict a fetched claim. The actual configuration setting must be prefixed with a hook keyword. There is no default.
FetchWorkDelay
condor_startd
should wait after an invocation of HOOK_FETCH_WORK
completes before the hook should be invoked again. The expression is evaluated in the context of the slot ClassAd, and the ClassAd of the currently running job (if any). The expression must evaluate to an integer. If not defined, the condor_startd
will wait 300
seconds (five minutes) between attempts to fetch work.
HIBERNATE_CHECK_INTERVAL
condor_startd
checks to see if the machine is ready to enter a low power state. The default value is 0
, which disables the check. If not 0
, the HIBERNATE
expression is evaluated within the context of each slot at the given interval. If used, a value 300
(5 minutes) is recommended.
HIBERNATE
HIBERNATE
expression is written in terms of the S-states as defined in the Advanced Configuration and Power Interface (ACPI) specification. The S-states take the form Sn
, where n
is an integer in the range 0 to 5, inclusive. The number that results from evaluating the expression determines which S-state to enter. The n
from Sn
notation was adopted because at this junction in time it appears to be the standard naming scheme for power states on several popular Operating Systems, including various flavors of Windows and Linux distributions. The other strings ("RAM", "DISK", etc.) are provided for ease of configuration.
HIBERNATE
in one slot evaluates to "NONE" or "0", then the machine will not be placed into a low power state. On the other hand, if all slots evaluate to a non-zero value, but differ in value, then the largest value is used as the representative power state.
LINUX_HIBERNATION_METHOD
LINUX_HIBERNATION_METHOD
to one of the defined strings.
OFFLINE_LOG
condor_collector
daemon crashes.
condor_preen
removing this log, place it in a directory other than the directory defined by SPOOL
. Alternatively, if this log file is to go in the directory defined by SPOOL
, add the file to the list given by VALID_SPOOL_FILES
.
OFFLINE_EXPIRE_ADS_AFTER
condor_schedd
Configuration File EntriesSHADOW
condor_shadow
binary that the condor_schedd
spawns. It is normally defined in terms of $(SBIN)
.
START_LOCAL_UNIVERSE
True
. The condor_schedd
uses this macro to determine whether to start a local universe job. At intervals determined by SCHEDD_INTERVAL
, the condor_schedd
daemon evaluates this macro for each idle local universe job that it has. For each job, if the START_LOCAL_UNIVERSE
macro is True
, then the job's Requirements
expression is evaluated. If both conditions are met, then the job is allowed to begin execution.
TotalLocalJobsRunning
is supplied by condor_schedd
's ClassAd:
START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 10
STARTER_LOCAL
condor_starter
to run for local universe jobs. This variable's value is defined in the initial configuration provided as:
STARTER_LOCAL = $(SBIN)/condor_starter
START_SCHEDULER_UNIVERSE
True
. The condor_schedd
uses this macro to determine whether to start a scheduler universe job. At intervals determined by SCHEDD_INTERVAL
, the condor_schedd
daemon evaluates this macro for each idle scheduler universe job that it has. For each job, if the START_SCHEDULER_UNIVERSE
macro is True
, then the job's Requirements
expression is evaluated. If both conditions are met, then the job is allowed to begin execution.
TotalSchedulerJobsRunning
is supplied by the condor_schedd
ClassAd:
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 10
condor_starter
Configuration File Entriescondor_starter
.
JOB_RENICE_INCREMENT
condor_starter
spawns a job, it can set a nice level. This is a mechanism that allows users to assign processes a lower priority. This can mean that those processes do not interfere with interactive use of the machine.
condor_starter
daemon just before each job runs. The range of allowable values are integers in the range of 0 to 19, with 0 being the highest priority and 19 the lowest. If the integer value is outside this range, then a value greater than 19 is auto-decreased to 19 and a value less than 0 is treated as 0. The default value is 10.
STARTER_LOCAL_LOGGING
STARTER_DEBUG
STARTER_UPDATE_INTERVAL
condor_startd
and condor_shadow
daemons. Defaults to 300 seconds (5 minutes).
STARTER_UPDATE_INTERVAL_TIMESLICE
condor_starter
daemon should spend collecting monitoring information about the job. If monitoring takes a long time, the condor_starter
will monitor less frequently than specified. The default value is 0.1.
USER_JOB_WRAPPER
STARTER_JOB_ENVIRONMENT
JOB_INHERITS_STARTER_ENVIRONMENT
TRUE
, jobs will inherit all environment variables from the condor_starter
. When both the user job and STARTER_JOB_ENVIRONMENT
define an environment variable, the user's job definition takes precedence. This variable does not apply to standard universe jobs. Defaults to FALSE
STARTER_UPLOAD_TIMEOUT
ENFORCE_CPU_AFFINITY
FALSE
, the CPU affinity of jobs and their descendents is not enforced. When TRUE
, CPU affinity will be maintained, and finely tuned affinities can be specified using SLOTX
_CPU_AFFINITY
. Defaults to FALSE
SLOTX
_CPU_AFFINITY
SLOTX
will show affinity. This setting will work only if ENFORCE_CPU_AFFINITY
is set to TRUE
.
SCHEDD_CLUSTER_INITIAL_VALUE
1
. If the job cluster ID reaches the value set by SCHEDD_CLUSTER_MAXIMUM_VALUE
and wraps around, the job cluster ID will be reset to the value of SCHEDD_CLUSTER_INITIAL_VALUE
.
job_queue.log
file is removed, cluster IDs will be assigned starting from SCHEDD_CLUSTER_INITIAL_VALUE
after system restart.
SCHEDD_CLUSTER_MAXIMUM_VALUE
M
), the maximum job cluster ID assigned to any job will be (M-1
). When the maximum ID is reached, job IDs will wrap around back to SCHEDD_CLUSTER_INITIAL_VALUE
. The default value is 0
, which will not set a maximum cluster ID.
SCHEDD_CLUSTER_INCREMENT_VALUE
1
.
SCHEDD_CLUSTER_INITIAL_VALUE
is set to 2
, and SCHEDD_CLUSTER_INCREMENT_VALUE
is set to 2
, the cluster ID numbers will be {2, 4, 6, ...}
.
condor_negotiator
Configuration File MacrosHFS_MAX_ALLOCATION_ROUNDS
HFS_MAX_ALLOCATION_ROUNDS
defaults to (3). It has a minimum of (1), which is the traditional behavior of a single attempt at negotiation, and a maximum of INT_MAX
.
HFS_ROUND_ROBIN_RATE
HFS_ROUND_ROBIN_RATE
to a small value such as (1.0), which will cause the negotiator to use a round robin strategy in negotiating slots from the available accounting groups. This will prevent any single accounting group from being allocated all the slots from an overlapping effective pool and starving other groups.
HFS_ROUND_ROBIN_RATE
defaults to traditional behavior where it attempts to negotiate for everything in one iteration. If it is set to its minimum value (1.0), it will give minimum starvation and maximum preservation of allocation ratios inside overlapping effective pools, but may require many iterations. It can be increased to some value > 1.0 to reduce iterations (and log output), with the possible consequence of increased starvation of some accounting groups.
NEGOTIATOR_UPDATE_AFTER_CYCLE
false
.
/etc/condor/condor_config
and is usually the same for all installations. Do not change this file. To customize the configuration, edit files in the local configuration directory instead.
###################################################################### ###################################################################### ### ### ### N O T I C E: D O N O T E D I T T H I S F I L E ### ### ### ### Customization should be done via the LOCAL_CONFIG_DIR. ### ### ### ###################################################################### ###################################################################### ###################################################################### ## ## condor_config ## ## This is the global configuration file for condor. Any settings ## found here * * s h o u l d b e c u s t o m i z e d i n ## a l o c a l c o n f i g u r a t i o n f i l e. * * ## ## The local configuration files are located in LOCAL_CONFIG_DIR, set ## below. ## ## For a basic configuration, you may only want to start by ## customizing CONDOR_HOST and DAEMON_LIST. ## ## Note: To double-check where a configuration variable is set from ## you can use condor_config_val -v -config <variable name>, ## e.g. condor_config_val -v -config CONDOR_HOST. ## ## The file is divided into four main parts: ## Part 1: Settings you likely want to customize ## Part 2: Settings you may want to customize ## Part 3: Settings that control the policy of when condor will ## start and stop jobs on your machines ## Part 4: Settings you should probably leave alone (unless you ## know what you're doing) ## ## Please read the INSTALL file (or the Install chapter in the ## Condor Administrator's Manual) for detailed explanations of the ## various settings in here and possible ways to configure your ## pool. ## ## Unless otherwise specified, settings that are commented out show ## the defaults that are used if you don't define a value. Settings ## that are defined here MUST BE DEFINED since they have no default ## value. ## ## Unless otherwise indicated, all settings which specify a time are ## defined in seconds. ## ###################################################################### ###################################################################### ###################################################################### ## ## ###### # ## # # ## ##### ##### ## ## # # # # # # # # # ## ###### # # # # # # ## # ###### ##### # # ## # # # # # # # ## # # # # # # ##### ## ## Part 1: Settings you likely want to customize: ###################################################################### ###################################################################### ## What machine is your central manager? CONDOR_HOST = central-manager-hostname.your.domain ##-------------------------------------------------------------------- ## Pathnames: ##-------------------------------------------------------------------- ## Where have you installed the bin, sbin and lib condor directories? RELEASE_DIR = /usr ## Where is the local condor directory for each host? ## This is where the local config file(s), logs and ## spool/execute directories are located LOCAL_DIR = $(TILDE) #LOCAL_DIR = $(RELEASE_DIR)/hosts/$(HOSTNAME) ## Looking for LOCAL_CONFIG_FILE? You will not find it here. Instead ## put a file in the LOCAL_CONFIG_DIR below. It is a more extensible ## means to manage configuration. The order in which configuration ## files are read from the LOCAL_CONFIG_DIR is lexicographic. For ## instance, config in 00MyConfig will be overridden by config in ## 97MyConfig. ## Where are optional machine-specific local config files located? ## Config files are included in lexicographic order. ## No default. LOCAL_CONFIG_DIR = $(ETC)/config.d ## Blacklist for file processing in the LOCAL_CONFIG_DIR LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$ ## If the local config file is not present, is it an error? ## WARNING: This is a potential security issue. ## If not specificed, the default is True #REQUIRE_LOCAL_CONFIG_FILE = TRUE ##-------------------------------------------------------------------- ## Mail parameters: ##-------------------------------------------------------------------- ## When something goes wrong with condor at your site, who should get ## the email? CONDOR_ADMIN = root@$(FULL_HOSTNAME) ## Full path to a mail delivery program that understands that "-s" ## means you want to specify a subject: MAIL = /bin/mail ##-------------------------------------------------------------------- ## Network domain parameters: ##-------------------------------------------------------------------- ## Internet domain of machines sharing a common UID space. If your ## machines don't share a common UID space, set it to ## UID_DOMAIN = $(FULL_HOSTNAME) ## to specify that each machine has its own UID space. UID_DOMAIN = $(FULL_HOSTNAME) ## Internet domain of machines sharing a common file system. ## If your machines don't use a network file system, set it to ## FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) ## to specify that each machine has its own file system. FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) ## This macro is used to specify a short description of your pool. ## It should be about 20 characters long. For example, the name of ## the UW-Madison Computer Science Condor Pool is ``UW-Madison CS''. COLLECTOR_NAME = My Pool - $(CONDOR_HOST) ###################################################################### ###################################################################### ## ## ###### ##### ## # # ## ##### ##### # # ## # # # # # # # # ## ###### # # # # # ##### ## # ###### ##### # # ## # # # # # # # ## # # # # # # ####### ## ## Part 2: Settings you may want to customize: ## (it is generally safe to leave these untouched) ###################################################################### ###################################################################### ## ## The user/group ID <uid>.<gid> of the "Condor" user. ## (this can also be specified in the environment) ## Note: the CONDOR_IDS setting is ignored on Win32 platforms #CONDOR_IDS=x.x ##-------------------------------------------------------------------- ## Flocking: Submitting jobs to more than one pool ##-------------------------------------------------------------------- ## Flocking allows you to run your jobs in other pools, or lets ## others run jobs in your pool. ## ## To let others flock to you, define FLOCK_FROM. ## ## To flock to others, define FLOCK_TO. ## FLOCK_FROM defines the machines where you would like to grant ## people access to your pool via flocking. (i.e. you are granting ## access to these machines to join your pool). FLOCK_FROM = ## An example of this is: #FLOCK_FROM = somehost.friendly.domain, anotherhost.friendly.domain ## FLOCK_TO defines the central managers of the pools that you want ## to flock to. (i.e. you are specifying the machines that you ## want your jobs to be negotiated at -- thereby specifying the ## pools they will run in.) FLOCK_TO = ## An example of this is: #FLOCK_TO = central_manager.friendly.domain, condor.cs.wisc.edu ## FLOCK_COLLECTOR_HOSTS should almost always be the same as ## FLOCK_NEGOTIATOR_HOSTS (as shown below). The only reason it would be ## different is if the collector and negotiator in the pool that you are ## flocking too are running on different machines (not recommended). ## The collectors must be specified in the same corresponding order as ## the FLOCK_NEGOTIATOR_HOSTS list. FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO) FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO) ## An example of having the negotiator and the collector on different ## machines is: #FLOCK_NEGOTIATOR_HOSTS = condor.cs.wisc.edu, condor-negotiator.friendly.domain #FLOCK_COLLECTOR_HOSTS = condor.cs.wisc.edu, condor-collector.friendly.domain ##-------------------------------------------------------------------- ## Host/IP access levels ##-------------------------------------------------------------------- ## Please see the administrator's manual for details on these ## settings, what they're for, and how to use them. ## What machines have administrative rights for your pool? This ## defaults to your central manager. You should set it to the ## machine(s) where whoever is the condor administrator(s) works ## (assuming you trust all the users who log into that/those ## machine(s), since this is machine-wide access you're granting). ALLOW_ADMINISTRATOR = $(CONDOR_HOST) ## If there are no machines that should have administrative access ## to your pool (for example, there's no machine where only trusted ## users have accounts), you can uncomment this setting. ## Unfortunately, this will mean that administering your pool will ## be more difficult. #DENY_ADMINISTRATOR = * ## What machines should have "owner" access to your machines, meaning ## they can issue commands that a machine owner should be able to ## issue to their own machine (like condor_vacate). This defaults to ## machines with administrator access, and the local machine. This ## is probably what you want. ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR) ## Read access. Machines listed as allow (and/or not listed as deny) ## can view the status of your pool, but cannot join your pool ## or run jobs. ## NOTE: By default, without these entries customized, you ## are granting read access to the whole world. You may want to ## restrict that to hosts in your domain. If possible, please also ## grant read access to "*.cs.wisc.edu", so the Condor developers ## will be able to view the status of your pool and more easily help ## you install, configure or debug your Condor installation. ## It is important to have this defined. ALLOW_READ = * #ALLOW_READ = *.your.domain, *.cs.wisc.edu #DENY_READ = *.bad.subnet, bad-machine.your.domain, 144.77.88.* ## Write access. Machines listed here can join your pool, submit ## jobs, etc. Note: Any machine which has WRITE access must ## also be granted READ access. Granting WRITE access below does ## not also automatically grant READ access; you must change ## ALLOW_READ above as well. ## ## You must set this to something else before Condor will run. ## This most simple option is: ## ALLOW_WRITE = * ## but note that this will allow anyone to submit jobs or add ## machines to your pool and is a serious security risk. ALLOW_WRITE = $(FULL_HOSTNAME) #ALLOW_WRITE = *.your.domain, your-friend's-machine.other.domain #DENY_WRITE = bad-machine.your.domain ## Are you upgrading to a new version of Condor and confused about ## why the above ALLOW_WRITE setting is causing Condor to refuse to ## start up? If you are upgrading from a configuration that uses ## HOSTALLOW/HOSTDENY instead of ALLOW/DENY we recommend that you ## convert all uses of the former to the latter. The syntax of the ## authorization settings is identical. They both support ## unauthenticated IP-based authorization as well as authenticated ## user-based authorization. To avoid confusion, the use of ## HOSTALLOW/HOSTDENY is discouraged. Support for it may be removed ## in the future. ## Negotiator access. Machines listed here are trusted central ## managers. You should normally not have to change this. ALLOW_NEGOTIATOR = $(CONDOR_HOST) ## Now, with flocking we need to let the SCHEDD trust the other ## negotiators we are flocking with as well. You should normally ## not have to change this either. ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS) ## Config access. Machines listed here can use the condor_config_val ## tool to modify all daemon configurations. This level of host-wide ## access should only be granted with extreme caution. By default, ## config access is denied from all hosts. #ALLOW_CONFIG = trusted-host.your.domain ## Flocking Configs. These are the real things that Condor looks at, ## but we set them from the FLOCK_FROM/TO macros above. It is safe ## to leave these unchanged. ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM) ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM) ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM) ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM) ##-------------------------------------------------------------------- ## Security parameters for setting configuration values remotely: ##-------------------------------------------------------------------- ## These parameters define the list of attributes that can be set ## remotely with condor_config_val for the security access levels ## defined above (for example, WRITE, ADMINISTRATOR, CONFIG, etc). ## Please see the administrator's manual for futher details on these ## settings, what they're for, and how to use them. There are no ## default values for any of these settings. If they are not ## defined, no attributes can be set with condor_config_val. ## Do you want to allow condor_config_val -rset to work at all? ## This feature is disabled by default, so to enable, you must ## uncomment the following setting and change the value to "True". ## Note: changing this requires a restart not just a reconfig. #ENABLE_RUNTIME_CONFIG = False ## Do you want to allow condor_config_val -set to work at all? ## This feature is disabled by default, so to enable, you must ## uncomment the following setting and change the value to "True". ## Note: changing this requires a restart not just a reconfig. #ENABLE_PERSISTENT_CONFIG = False ## Directory where daemons should write persistent config files (used ## to support condor_config_val -set). This directory should *ONLY* ## be writable by root (or the user the Condor daemons are running as ## if non-root). There is no default, administrators must define this. ## Note: changing this requires a restart not just a reconfig. #PERSISTENT_CONFIG_DIR = /full/path/to/root-only/local/directory ## Attributes that can be set by hosts with "CONFIG" permission (as ## defined with ALLOW_CONFIG and DENY_CONFIG above). ## The commented-out value here was the default behavior of Condor ## prior to version 6.3.3. If you don't need this behavior, you ## should leave this commented out. #SETTABLE_ATTRS_CONFIG = * ## Attributes that can be set by hosts with "ADMINISTRATOR" ## permission (as defined above) #SETTABLE_ATTRS_ADMINISTRATOR = *_DEBUG, MAX_*_LOG ## Attributes that can be set by hosts with "OWNER" permission (as ## defined above) NOTE: any Condor job running on a given host will ## have OWNER permission on that host by default. If you grant this ## kind of access, Condor jobs will be able to modify any attributes ## you list below on the machine where they are running. This has ## obvious security implications, so only grant this kind of ## permission for custom attributes that you define for your own use ## at your pool (custom attributes about your machines that are ## published with the STARTD_ATTRS setting, for example). #SETTABLE_ATTRS_OWNER = your_custom_attribute, another_custom_attr ## You can also define daemon-specific versions of each of these ## settings. For example, to define settings that can only be ## changed in the condor_startd's configuration by hosts with OWNER ## permission, you would use: #STARTD_SETTABLE_ATTRS_OWNER = your_custom_attribute_name ##-------------------------------------------------------------------- ## Network filesystem parameters: ##-------------------------------------------------------------------- ## Do you want to use NFS for file access instead of remote system ## calls? #USE_NFS = False ## Do you want to use AFS for file access instead of remote system ## calls? #USE_AFS = False ##-------------------------------------------------------------------- ## Checkpoint server: ##-------------------------------------------------------------------- ## Do you want to use a checkpoint server if one is available? If a ## checkpoint server isn't available or USE_CKPT_SERVER is set to ## False, checkpoints will be written to the local SPOOL directory on ## the submission machine. #USE_CKPT_SERVER = True ## What's the hostname of this machine's nearest checkpoint server? #CKPT_SERVER_HOST = checkpoint-server-hostname.your.domain ## Do you want the starter on the execute machine to choose the ## checkpoint server? If False, the CKPT_SERVER_HOST set on ## the submit machine is used. Otherwise, the CKPT_SERVER_HOST set ## on the execute machine is used. The default is true. #STARTER_CHOOSES_CKPT_SERVER = True ##-------------------------------------------------------------------- ## Miscellaneous: ##-------------------------------------------------------------------- ## Try to save this much swap space by not starting new shadows. ## Specified in megabytes. #RESERVED_SWAP = 0 ## What's the maximum number of jobs you want a single submit machine ## to spawn shadows for? The default is a function of $(DETECTED_MEMORY) ## and a guess at the number of ephemeral ports available. ## Example 1: #MAX_JOBS_RUNNING = 10000 ## Example 2: ## This is more complicated, but it produces the same limit as the default. ## First define some expressions to use in our calculation. ## Assume we can use up to 80% of memory and estimate shadow private data ## size of 800k. #MAX_SHADOWS_MEM = ceiling($(DETECTED_MEMORY)*0.8*1024/800) ## Assume we can use ~21,000 ephemeral ports (avg ~2.1 per shadow). ## Under Linux, the range is set in /proc/sys/net/ipv4/ip_local_port_range. #MAX_SHADOWS_PORTS = 10000 ## Under windows, things are much less scalable, currently. ## Note that this can probably be safely increased a bit under 64-bit windows. #MAX_SHADOWS_OPSYS = ifThenElse(regexp("WIN.*","$(OPSYS)"),200,100000) ## Now build up the expression for MAX_JOBS_RUNNING. This is complicated ## due to lack of a min() function. #MAX_JOBS_RUNNING = $(MAX_SHADOWS_MEM) #MAX_JOBS_RUNNING = \ # ifThenElse( $(MAX_SHADOWS_PORTS) < $(MAX_JOBS_RUNNING), \ # $(MAX_SHADOWS_PORTS), \ # $(MAX_JOBS_RUNNING) ) #MAX_JOBS_RUNNING = \ # ifThenElse( $(MAX_SHADOWS_OPSYS) < $(MAX_JOBS_RUNNING), \ # $(MAX_SHADOWS_OPSYS), \ # $(MAX_JOBS_RUNNING) ) ## Maximum number of simultaneous downloads of output files from ## execute machines to the submit machine (limit applied per schedd). ## The value 0 means unlimited. #MAX_CONCURRENT_DOWNLOADS = 10 ## Maximum number of simultaneous uploads of input files from the ## submit machine to execute machines (limit applied per schedd). ## The value 0 means unlimited. #MAX_CONCURRENT_UPLOADS = 10 ## Condor needs to create a few lock files to synchronize access to ## various log files. Because of problems we've had with network ## filesystems and file locking over the years, we HIGHLY recommend ## that you put these lock files on a local partition on each ## machine. If you don't have your LOCAL_DIR on a local partition, ## be sure to change this entry. Whatever user (or group) condor is ## running as needs to have write access to this directory. If ## you're not running as root, this is whatever user you started up ## the condor_master as. If you are running as root, and there's a ## condor account, it's probably condor. Otherwise, it's whatever ## you've set in the CONDOR_IDS environment variable. See the Admin ## manual for details on this. LOCK = /var/lock/condor ## If you don't use a fully qualified name in your /etc/hosts file ## (or NIS, etc.) for either your official hostname or as an alias, ## Condor wouldn't normally be able to use fully qualified names in ## places that it'd like to. You can set this parameter to the ## domain you'd like appended to your hostname, if changing your host ## information isn't a good option. This parameter must be set in ## the global config file (not the LOCAL_CONFIG_FILE from above). #DEFAULT_DOMAIN_NAME = your.domain.name ## If you don't have DNS set up, Condor will normally fail in many ## places because it can't resolve hostnames to IP addresses and ## vice-versa. If you enable this option, Condor will use ## pseudo-hostnames constructed from a machine's IP address and the ## DEFAULT_DOMAIN_NAME. Both NO_DNS and DEFAULT_DOMAIN must be set in ## your top-level config file for this mode of operation to work ## properly. #NO_DNS = True ## Condor can be told whether or not you want the Condor daemons to ## create a core file if something really bad happens. This just ## sets the resource limit for the size of a core file. By default, ## we don't do anything, and leave in place whatever limit was in ## effect when you started the Condor daemons. If this parameter is ## set and "True", we increase the limit to as large as it gets. If ## it's set to "False", we set the limit at 0 (which means that no ## core files are even created). Core files greatly help the Condor ## developers debug any problems you might be having. #CREATE_CORE_FILES = True ## When Condor daemons detect a fatal internal exception, they ## normally log an error message and exit. If you have turned on ## CREATE_CORE_FILES, in some cases you may also want to turn on ## ABORT_ON_EXCEPTION so that core files are generated when an ## exception occurs. Set the following to True if that is what you ## want. #ABORT_ON_EXCEPTION = False ## Condor Glidein downloads binaries from a remote server for the ## machines into which you're gliding. This saves you from manually ## downloading and installing binaries for every architecture you ## might want to glidein to. The default server is one maintained at ## The University of Wisconsin. If you don't want to use the UW ## server, you can set up your own and change the following to ## point to it, instead. GLIDEIN_SERVER_URLS = \ http://www.cs.wisc.edu/condor/glidein/binaries ## List the sites you want to GlideIn to on the GLIDEIN_SITES. For example, ## if you'd like to GlideIn to some Alliance GiB resources, ## uncomment the line below. ## Make sure that $(GLIDEIN_SITES) is included in ALLOW_READ and ## ALLOW_WRITE, or else your GlideIns won't be able to join your pool. ## This is _NOT_ done for you by default, because it is an even better ## idea to use a strong security method (such as GSI) rather than ## host-based security for authorizing glideins. #GLIDEIN_SITES = *.ncsa.uiuc.edu, *.cs.wisc.edu, *.mcs.anl.gov #GLIDEIN_SITES = ## If your site needs to use UID_DOMAIN settings (defined above) that ## are not real Internet domains that match the hostnames, you can ## tell Condor to trust whatever UID_DOMAIN a submit machine gives to ## the execute machine and just make sure the two strings match. The ## default for this setting is False, since it is more secure this ## way. ## Default is False TRUST_UID_DOMAIN = True ## If you would like to be informed in near real-time via condor_q when ## a vanilla/standard/java job is in a suspension state, set this attribute to ## TRUE. However, this real-time update of the condor_schedd by the shadows ## could cause performance issues if there are thousands of concurrently ## running vanilla/standard/java jobs under a single condor_schedd and they ## are allowed to suspend and resume. #REAL_TIME_JOB_SUSPEND_UPDATES = False ## A standard universe job can perform arbitrary shell calls via the ## libc 'system()' function. This function call is routed back to the shadow ## which performs the actual system() invocation in the initialdir of the ## running program and as the user who submitted the job. However, since the ## user job can request ARBITRARY shell commands to be run by the shadow, this ## is a generally unsafe practice. This should only be made available if it is ## actually needed. If this attribute is not defined, then it is the same as ## it being defined to False. Set it to True to allow the shadow to execute ## arbitrary shell code from the user job. #SHADOW_ALLOW_UNSAFE_REMOTE_EXEC = False ## KEEP_OUTPUT_SANDBOX is an optional feature to tell Condor-G to not ## remove the job spool when the job leaves the queue. To use, just ## set to TRUE. Since you will be operating Condor-G in this manner, ## you may want to put leave_in_queue = false in your job submit ## description files, to tell Condor-G to simply remove the job from ## the queue immediately when the job completes (since the output files ## will stick around no matter what). #KEEP_OUTPUT_SANDBOX = False ## This setting tells the negotiator to ignore user priorities. This ## avoids problems where jobs from different users won't run when using ## condor_advertise instead of a full-blown startd (some of the user ## priority system in Condor relies on information from the startd -- ## we will remove this reliance when we support the user priority ## system for grid sites in the negotiator; for now, this setting will ## just disable it). #NEGOTIATOR_IGNORE_USER_PRIORITIES = False ## These are the directories used to locate classad plug-in functions #CLASSAD_SCRIPT_DIRECTORY = #CLASSAD_LIB_PATH = ## This setting tells Condor whether to delegate or copy GSI X509 ## credentials when sending them over the wire between daemons. ## Delegation can take up to a second, which is very slow when ## submitting a large number of jobs. Copying exposes the credential ## to third parties if Condor isn't set to encrypt communications. ## By default, Condor will delegate rather than copy. #DELEGATE_JOB_GSI_CREDENTIALS = True ## This setting controls whether Condor delegates a full or limited ## X509 credential for jobs. Currently, this only affects grid-type ## gt2 grid universe jobs. The default is False. #DELEGATE_FULL_JOB_GSI_CREDENTIALS = False ## This setting controls the default behaviour for the spooling of files ## into, or out of, the Condor system by such tools as condor_submit ## and condor_transfer_data. Here is the list of valid settings for this ## parameter and what they mean: ## ## stm_use_schedd_only ## Ask the condor_schedd to solely store/retreive the sandbox ## ## stm_use_transferd ## Ask the condor_schedd for a location of a condor_transferd, then ## store/retreive the sandbox from the transferd itself. ## ## The allowed values are case insensitive. ## The default of this parameter if not specified is: stm_use_schedd_only #SANDBOX_TRANSFER_METHOD = stm_use_schedd_only ##-------------------------------------------------------------------- ## Settings that control the daemon's debugging output: ##-------------------------------------------------------------------- ## ## The flags given in ALL_DEBUG are shared between all daemons. ## ALL_DEBUG = MAX_COLLECTOR_LOG = 1000000 COLLECTOR_DEBUG = MAX_KBDD_LOG = 1000000 KBDD_DEBUG = MAX_NEGOTIATOR_LOG = 1000000 NEGOTIATOR_DEBUG = D_MATCH MAX_NEGOTIATOR_MATCH_LOG = 1000000 MAX_SCHEDD_LOG = 1000000 SCHEDD_DEBUG = D_PID MAX_SHADOW_LOG = 1000000 SHADOW_DEBUG = MAX_STARTD_LOG = 1000000 STARTD_DEBUG = MAX_STARTER_LOG = 1000000 MAX_MASTER_LOG = 1000000 MASTER_DEBUG = ## When the master starts up, should it truncate it's log file? #TRUNC_MASTER_LOG_ON_OPEN = False MAX_JOB_ROUTER_LOG = 1000000 JOB_ROUTER_DEBUG = MAX_ROOSTER_LOG = 1000000 ROOSTER_DEBUG = MAX_HDFS_LOG = 1000000 HDFS_DEBUG = MAX_TRIGGERD_LOG = 1000000 TRIGGERD_DEBUG = # High Availability Logs MAX_HAD_LOG = 1000000 HAD_DEBUG = MAX_REPLICATION_LOG = 1000000 REPLICATION_DEBUG = MAX_TRANSFERER_LOG = 1000000 TRANSFERER_DEBUG = ## The daemons touch their log file periodically, even when they have ## nothing to write. When a daemon starts up, it prints the last time ## the log file was modified. This lets you estimate when a previous ## instance of a daemon stopped running. This paramete controls how often ## the daemons touch the file (in seconds). #TOUCH_LOG_INTERVAL = 60 ###################################################################### ###################################################################### ## ## ###### ##### ## # # ## ##### ##### # # ## # # # # # # # # ## ###### # # # # # ##### ## # ###### ##### # # ## # # # # # # # # ## # # # # # # ##### ## ## Part 3: Settings control the policy for running, stopping, and ## periodically checkpointing condor jobs: ###################################################################### ###################################################################### ## This section contains macros are here to help write legible ## expressions: MINUTE = 60 HOUR = (60 * $(MINUTE)) StateTimer = (CurrentTime - EnteredCurrentState) ActivityTimer = (CurrentTime - EnteredCurrentActivity) ActivationTimer = ifThenElse(JobStart =!= UNDEFINED, (CurrentTime - JobStart), 0) LastCkpt = (CurrentTime - LastPeriodicCheckpoint) ## The JobUniverse attribute is just an int. These macros can be ## used to specify the universe in a human-readable way: STANDARD = 1 VANILLA = 5 MPI = 8 VM = 13 IsMPI = (TARGET.JobUniverse == $(MPI)) IsVanilla = (TARGET.JobUniverse == $(VANILLA)) IsStandard = (TARGET.JobUniverse == $(STANDARD)) IsVM = (TARGET.JobUniverse == $(VM)) NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 StartIdleTime = 15 * $(MINUTE) ContinueIdleTime = 5 * $(MINUTE) MaxSuspendTime = 10 * $(MINUTE) MaxVacateTime = 10 * $(MINUTE) KeyboardBusy = (KeyboardIdle < $(MINUTE)) ConsoleBusy = (ConsoleIdle < $(MINUTE)) CPUIdle = ($(NonCondorLoadAvg) <= $(BackgroundLoad)) CPUBusy = ($(NonCondorLoadAvg) >= $(HighLoad)) KeyboardNotBusy = ($(KeyboardBusy) == False) BigJob = (TARGET.ImageSize >= (50 * 1024)) MediumJob = (TARGET.ImageSize >= (15 * 1024) && TARGET.ImageSize < (50 * 1024)) SmallJob = (TARGET.ImageSize < (15 * 1024)) JustCPU = ($(CPUBusy) && ($(KeyboardBusy) == False)) MachineBusy = ($(CPUBusy) || $(KeyboardBusy)) ## The RANK expression controls which jobs this machine prefers to ## run over others. Some examples from the manual include: ## RANK = TARGET.ImageSize ## RANK = (Owner == "coltrane") + (Owner == "tyner") \ ## + ((Owner == "garrison") * 10) + (Owner == "jones") ## By default, RANK is always 0, meaning that all jobs have an equal ## ranking. #RANK = 0 ##################################################################### ## This where you choose the configuration that you would like to ## use. It has no defaults so it must be defined. We start this ## file off with the UWCS_* policy. ###################################################################### ## Also here is what is referred to as the TESTINGMODE_*, which is ## a quick hardwired way to test Condor with a simple no-preemption policy. ## Replace UWCS_* with TESTINGMODE_* if you wish to do testing mode. ## For example: ## WANT_SUSPEND = $(UWCS_WANT_SUSPEND) ## becomes ## WANT_SUSPEND = $(TESTINGMODE_WANT_SUSPEND) # When should we only consider SUSPEND instead of PREEMPT? WANT_SUSPEND = $(UWCS_WANT_SUSPEND) # When should we preempt gracefully instead of hard-killing? WANT_VACATE = $(UWCS_WANT_VACATE) ## When is this machine willing to start a job? START = $(UWCS_START) ## When should a local universe job be allowed to start? #START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 200 ## When should a scheduler universe job be allowed to start? #START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 200 ## When to suspend a job? SUSPEND = $(UWCS_SUSPEND) ## When to resume a suspended job? CONTINUE = $(UWCS_CONTINUE) ## When to nicely stop a job? ## (as opposed to killing it instantaneously) PREEMPT = $(UWCS_PREEMPT) ## When to instantaneously kill a preempting job ## (e.g. if a job is in the pre-empting stage for too long) KILL = $(UWCS_KILL) PERIODIC_CHECKPOINT = $(UWCS_PERIODIC_CHECKPOINT) PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS) PREEMPTION_RANK = $(UWCS_PREEMPTION_RANK) NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK) NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK) MaxJobRetirementTime = $(UWCS_MaxJobRetirementTime) CLAIM_WORKLIFE = $(UWCS_CLAIM_WORKLIFE) ##################################################################### ## This is the UWisc - CS Department Configuration. ##################################################################### # When should we only consider SUSPEND instead of PREEMPT? # Only when SUSPEND is True and one of the following is also true: # - the job is small # - the keyboard is idle # - it is a vanilla universe job UWCS_WANT_SUSPEND = ( $(SmallJob) || $(KeyboardNotBusy) || $(IsVanilla) ) && \ ( $(SUSPEND) ) # When should we preempt gracefully instead of hard-killing? UWCS_WANT_VACATE = ( $(ActivationTimer) > 10 * $(MINUTE) || $(IsVanilla) ) # Only start jobs if: # 1) the keyboard has been idle long enough, AND # 2) the load average is low enough OR the machine is currently # running a Condor job # (NOTE: Condor will only run 1 job at a time on a given resource. # The reasons Condor might consider running a different job while # already running one are machine Rank (defined above), and user # priorities.) UWCS_START = ( (KeyboardIdle > $(StartIdleTime)) \ && ( $(CPUIdle) || \ (State != "Unclaimed" && State != "Owner")) ) # Suspend jobs if: # 1) the keyboard has been touched, OR # 2a) The cpu has been busy for more than 2 minutes, AND # 2b) the job has been running for more than 90 seconds UWCS_SUSPEND = ( $(KeyboardBusy) || \ ( (CpuBusyTime > 2 * $(MINUTE)) \ && $(ActivationTimer) > 90 ) ) # Continue jobs if: # 1) the cpu is idle, AND # 2) we've been suspended more than 10 seconds, AND # 3) the keyboard hasn't been touched in a while UWCS_CONTINUE = ( $(CPUIdle) && ($(ActivityTimer) > 10) \ && (KeyboardIdle > $(ContinueIdleTime)) ) # Preempt jobs if: # 1) The job is suspended and has been suspended longer than we want # 2) OR, we don't want to suspend this job, but the conditions to # suspend jobs have been met (someone is using the machine) UWCS_PREEMPT = ( ((Activity == "Suspended") && \ ($(ActivityTimer) > $(MaxSuspendTime))) \ || (SUSPEND && (WANT_SUSPEND == False)) ) # Maximum time (in seconds) to wait for a job to finish before kicking # it off (due to PREEMPT, a higher priority claim, or the startd # gracefully shutting down). This is computed from the time the job # was started, minus any suspension time. Once the retirement time runs # out, the usual preemption process will take place. The job may # self-limit the retirement time to _less_ than what is given here. # By default, nice user jobs and standard universe jobs set their # MaxJobRetirementTime to 0, so they will not wait in retirement. UWCS_MaxJobRetirementTime = 0 ## If you completely disable preemption of claims to machines, you ## should consider limiting the timespan over which new jobs will be ## accepted on the same claim. See the manual section on disabling ## preemption for a comprehensive discussion. Since this example ## configuration does not disable preemption of claims, we leave ## CLAIM_WORKLIFE undefined (infinite). #UWCS_CLAIM_WORKLIFE = 1200 # Kill jobs if they have taken too long to vacate gracefully UWCS_KILL = $(ActivityTimer) > $(MaxVacateTime) ## Only define vanilla versions of these if you want to make them ## different from the above settings. #SUSPEND_VANILLA = ( $(KeyboardBusy) || \ # ((CpuBusyTime > 2 * $(MINUTE)) && $(ActivationTimer) > 90) ) #CONTINUE_VANILLA = ( $(CPUIdle) && ($(ActivityTimer) > 10) \ # && (KeyboardIdle > $(ContinueIdleTime)) ) #PREEMPT_VANILLA = ( ((Activity == "Suspended") && \ # ($(ActivityTimer) > $(MaxSuspendTime))) \ # || (SUSPEND_VANILLA && (WANT_SUSPEND == False)) ) #KILL_VANILLA = $(ActivityTimer) > $(MaxVacateTime) ## Checkpoint every 3 hours on average, with a +-30 minute random ## factor to avoid having many jobs hit the checkpoint server at ## the same time. UWCS_PERIODIC_CHECKPOINT = $(LastCkpt) > (3 * $(HOUR) + \ $RANDOM_INTEGER(-30,30,1) * $(MINUTE) ) ## You might want to checkpoint a little less often. A good ## example of this is below. For jobs smaller than 60 megabytes, we ## periodic checkpoint every 6 hours. For larger jobs, we only ## checkpoint every 12 hours. #UWCS_PERIODIC_CHECKPOINT = \ # ( (TARGET.ImageSize < 60000) && \ # ($(LastCkpt) > (6 * $(HOUR) + $RANDOM_INTEGER(-30,30,1))) ) || \ # ( $(LastCkpt) > (12 * $(HOUR) + $RANDOM_INTEGER(-30,30,1)) ) ## The rank expressions used by the negotiator are configured below. ## This is the order in which ranks are applied by the negotiator: ## 1. NEGOTIATOR_PRE_JOB_RANK ## 2. rank in job ClassAd ## 3. NEGOTIATOR_POST_JOB_RANK ## 4. cause of preemption (0=user priority,1=startd rank,2=no preemption) ## 5. PREEMPTION_RANK ## The NEGOTIATOR_PRE_JOB_RANK expression overrides all other ranks ## that are used to pick a match from the set of possibilities. ## The following expression matches jobs to unclaimed resources ## whenever possible, regardless of the job-supplied rank. UWCS_NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED ## The NEGOTIATOR_POST_JOB_RANK expression chooses between ## resources that are equally preferred by the job. ## The following example expression steers jobs toward ## faster machines and tends to fill a cluster of multi-processors ## breadth-first instead of depth-first. It also prefers online ## machines over offline (hibernating) ones. In this example, ## the expression is chosen to have no effect when preemption ## would take place, allowing control to pass on to ## PREEMPTION_RANK. UWCS_NEGOTIATOR_POST_JOB_RANK = \ (RemoteOwner =?= UNDEFINED) * (KFlops - SlotID - 1.0e10*(Offline=?=True)) ## The negotiator will not preempt a job running on a given machine ## unless the PREEMPTION_REQUIREMENTS expression evaluates to true ## and the owner of the idle job has a better priority than the owner ## of the running job. This expression defaults to true. UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (1 * $(HOUR)) && \ RemoteUserPrio > SubmitterUserPrio * 1.2 ) || (MY.NiceUser == True) ## The PREEMPTION_RANK expression is used in a case where preemption ## is the only option and all other negotiation ranks are equal. For ## example, if the job has no preference, it is usually preferable to ## preempt a job with a small ImageSize instead of a job with a large ## ImageSize. The default is to rank all preemptable matches the ## same. However, the negotiator will always prefer to match the job ## with an idle machine over a preemptable machine, if all other ## negotiation ranks are equal. UWCS_PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize ##################################################################### ## This is a Configuration that will cause your Condor jobs to ## always run. This is intended for testing only. ###################################################################### ## This mode will cause your jobs to start on a machine an will let ## them run to completion. Condor will ignore all of what is going ## on in the machine (load average, keyboard activity, etc.) TESTINGMODE_WANT_SUSPEND = False TESTINGMODE_WANT_VACATE = False TESTINGMODE_START = True TESTINGMODE_SUSPEND = False TESTINGMODE_CONTINUE = True TESTINGMODE_PREEMPT = False TESTINGMODE_KILL = False TESTINGMODE_PERIODIC_CHECKPOINT = False TESTINGMODE_PREEMPTION_REQUIREMENTS = False TESTINGMODE_PREEMPTION_RANK = 0 # Prevent machine claims from being reused indefinitely, since # preemption of claims is disabled in the TESTINGMODE configuration. TESTINGMODE_CLAIM_WORKLIFE = 1200 ###################################################################### ###################################################################### ## ## ###### # ## # # ## ##### ##### # # ## # # # # # # # # # ## ###### # # # # # # # ## # ###### ##### # ####### ## # # # # # # # ## # # # # # # # ## ## Part 4: Settings you should probably leave alone: ## (unless you know what you're doing) ###################################################################### ###################################################################### ###################################################################### ## Daemon-wide settings: ###################################################################### ## Pathnames LOG = /var/log/condor SPOOL = $(LOCAL_DIR)/spool EXECUTE = $(LOCAL_DIR)/execute BIN = $(RELEASE_DIR)/bin LIB = $(RELEASE_DIR)/lib64/condor INCLUDE = $(RELEASE_DIR)/include/condor SBIN = $(RELEASE_DIR)/sbin LIBEXEC = $(RELEASE_DIR)/libexec/condor SHARE = $(RELEASE_DIR)/share/condor RUN = /var/run/condor DATA = $(SPOOL) ETC = /etc/condor ## If you leave HISTORY undefined (comment it out), no history file ## will be created. HISTORY = $(SPOOL)/history ## Log files COLLECTOR_LOG = $(LOG)/CollectorLog KBDD_LOG = $(LOG)/KbdLog MASTER_LOG = $(LOG)/MasterLog NEGOTIATOR_LOG = $(LOG)/NegotiatorLog NEGOTIATOR_MATCH_LOG = $(LOG)/MatchLog SCHEDD_LOG = $(LOG)/SchedLog SHADOW_LOG = $(LOG)/ShadowLog STARTD_LOG = $(LOG)/StartLog STARTER_LOG = $(LOG)/StarterLog JOB_ROUTER_LOG = $(LOG)/JobRouterLog ROOSTER_LOG = $(LOG)/RoosterLog SHARED_PORT_LOG = $(LOG)/SharedPortLog TRIGGERD_LOG = $(LOG)/TriggerLog # High Availability Logs HAD_LOG = $(LOG)/HADLog REPLICATION_LOG = $(LOG)/ReplicationLog TRANSFERER_LOG = $(LOG)/TransfererLog HDFS_LOG = $(LOG)/HDFSLog ## Lock files SHADOW_LOCK = $(LOCK)/ShadowLock ## This setting controls how often any lock files currently in use have their ## timestamp updated. Updating the timestamp prevents administrative programs ## like 'tmpwatch' from deleting long lived lock files. The parameter is ## an integer in seconds with a minimum of 60 seconds. The default if not ## specified is 28800 seconds, or 8 hours. ## This attribute only takes effect on restart of the daemons or at the next ## update time. # LOCK_FILE_UPDATE_INTERVAL = 28800 ## This setting primarily allows you to change the port that the ## collector is listening on. By default, the collector uses port ## 9618, but you can set the port with a ":port", such as: ## COLLECTOR_HOST = $(CONDOR_HOST):1234 COLLECTOR_HOST = $(CONDOR_HOST) ## The NEGOTIATOR_HOST parameter has been deprecated. The port where ## the negotiator is listening is now dynamically allocated and the IP ## and port are now obtained from the collector, just like all the ## other daemons. However, if your pool contains any machines that ## are running version 6.7.3 or earlier, you can uncomment this ## setting to go back to the old fixed-port (9614) for the negotiator. #NEGOTIATOR_HOST = $(CONDOR_HOST) ## How long are you willing to let daemons try their graceful ## shutdown methods before they do a hard shutdown? (30 minutes) #SHUTDOWN_GRACEFUL_TIMEOUT = 1800 ## How much disk space would you like reserved from Condor? In ## places where Condor is computing the free disk space on various ## partitions, it subtracts the amount it really finds by this ## many megabytes. (If undefined, defaults to 0). RESERVED_DISK = 5 ## If your machine is running AFS and the AFS cache lives on the same ## partition as the other Condor directories, and you want Condor to ## reserve the space that your AFS cache is configured to use, set ## this to true. #RESERVE_AFS_CACHE = False ## By default, if a user does not specify "notify_user" in the submit ## description file, any email Condor sends about that job will go to ## "username@UID_DOMAIN". If your machines all share a common UID ## domain (so that you would set UID_DOMAIN to be the same across all ## machines in your pool), *BUT* email to user@UID_DOMAIN is *NOT* ## the right place for Condor to send email for your site, you can ## define the default domain to use for email. A common example ## would be to set EMAIL_DOMAIN to the fully qualified hostname of ## each machine in your pool, so users submitting jobs from a ## specific machine would get email sent to user@machine.your.domain, ## instead of user@your.domain. In general, you should leave this ## setting commented out unless two things are true: 1) UID_DOMAIN is ## set to your domain, not $(FULL_HOSTNAME), and 2) email to ## user@UID_DOMAIN won't work. #EMAIL_DOMAIN = $(FULL_HOSTNAME) ## Should Condor daemons create a UDP command socket (for incomming ## UDP-based commands) in addition to the TCP command socket? By ## default, classified ad updates sent to the collector use UDP, in ## addition to some keep alive messages and other non-essential ## communication. However, in certain situations, it might be ## desirable to disable the UDP command port (for example, to reduce ## the number of ports represented by a GCB broker, etc). If not ## defined, the UDP command socket is enabled by default, and to ## modify this, you must restart your Condor daemons. Also, this ## setting must be defined machine-wide. For example, setting ## "STARTD.WANT_UDP_COMMAND_SOCKET = False" while the global setting ## is "True" will still result in the startd creating a UDP socket. #WANT_UDP_COMMAND_SOCKET = True ## If your site needs to use TCP updates to the collector, instead of ## UDP, you can enable this feature. HOWEVER, WE DO NOT RECOMMEND ## THIS FOR MOST SITES! In general, the only sites that might want ## this feature are pools made up of machines connected via a ## wide-area network where UDP packets are frequently or always ## dropped. If you enable this feature, you *MUST* turn on the ## COLLECTOR_SOCKET_CACHE_SIZE setting at your collector, and each ## entry in the socket cache uses another file descriptor. If not ## defined, this feature is disabled by default. #UPDATE_COLLECTOR_WITH_TCP = True ## HIGHPORT and LOWPORT let you set the range of ports that Condor ## will use. This may be useful if you are behind a firewall. By ## default, Condor uses port 9618 for the collector, 9614 for the ## negotiator, and system-assigned (apparently random) ports for ## everything else. HIGHPORT and LOWPORT only affect these ## system-assigned ports, but will restrict them to the range you ## specify here. If you want to change the well-known ports for the ## collector or negotiator, see COLLECTOR_HOST or NEGOTIATOR_HOST. ## Note that both LOWPORT and HIGHPORT must be at least 1024 if you ## are not starting your daemons as root. You may also specify ## different port ranges for incoming and outgoing connections by ## using IN_HIGHPORT/IN_LOWPORT and OUT_HIGHPORT/OUT_LOWPORT. #HIGHPORT = 9700 #LOWPORT = 9600 ## If a daemon doens't respond for too long, do you want go generate ## a core file? This bascially controls the type of the signal ## sent to the child process, and mostly affects the Condor Master #NOT_RESPONDING_WANT_CORE = False ###################################################################### ## Daemon-specific settings: ###################################################################### ##-------------------------------------------------------------------- ## condor_master ##-------------------------------------------------------------------- ## Daemons you want the master to keep running for you: DAEMON_LIST = MASTER, STARTD, SCHEDD ## Which daemons use the Condor DaemonCore library (i.e., not the ## checkpoint server or custom user daemons)? #DC_DAEMON_LIST = \ #MASTER, STARTD, SCHEDD, KBDD, COLLECTOR, NEGOTIATOR, EVENTD, \ #VIEW_SERVER, CONDOR_VIEW, VIEW_COLLECTOR, HAWKEYE, CREDD, HAD, \ #DBMSD, QUILL, JOB_ROUTER, ROOSTER, LEASEMANAGER, HDFS, SHARED_PORT, TRIGGERD ## Where are the binaries for these daemons? MASTER = $(SBIN)/condor_master STARTD = $(SBIN)/condor_startd SCHEDD = $(SBIN)/condor_schedd KBDD = $(SBIN)/condor_kbdd NEGOTIATOR = $(SBIN)/condor_negotiator COLLECTOR = $(SBIN)/condor_collector STARTER_LOCAL = $(SBIN)/condor_starter JOB_ROUTER = $(LIBEXEC)/condor_job_router ROOSTER = $(LIBEXEC)/condor_rooster HDFS = $(LIBEXEC)/condor_hdfs SHARED_PORT = $(LIBEXEC)/condor_shared_port TRIGGERD = $(SBIN)/condor_triggerd ## When the master starts up, it can place it's address (IP and port) ## into a file. This way, tools running on the local machine don't ## need to query the central manager to find the master. This ## feature can be turned off by commenting out this setting. MASTER_ADDRESS_FILE = $(LOG)/.master_address ## Where should the master find the condor_preen binary? If you don't ## want preen to run at all, just comment out this setting. PREEN = $(SBIN)/condor_preen ## How do you want preen to behave? The "-m" means you want email ## about files preen finds that it thinks it should remove. The "-r" ## means you want preen to actually remove these files. If you don't ## want either of those things to happen, just remove the appropriate ## one from this setting. PREEN_ARGS = -m -r ## How often should the master start up condor_preen? (once a day) #PREEN_INTERVAL = 86400 ## If a daemon dies an unnatural death, do you want email about it? #PUBLISH_OBITUARIES = True ## If you're getting obituaries, how many lines of the end of that ## daemon's log file do you want included in the obituary? #OBITUARY_LOG_LENGTH = 20 ## Should the master run? #START_MASTER = True ## Should the master start up the daemons you want it to? #START_DAEMONS = True ## How often do you want the master to send an update to the central ## manager? #MASTER_UPDATE_INTERVAL = 300 ## How often do you want the master to check the timestamps of the ## daemons it's running? If any daemons have been modified, the ## master restarts them. #MASTER_CHECK_NEW_EXEC_INTERVAL = 300 ## Once you notice new binaries, how long should you wait before you ## try to execute them? #MASTER_NEW_BINARY_DELAY = 120 ## What's the maximum amount of time you're willing to give the ## daemons to quickly shutdown before you just kill them outright? #SHUTDOWN_FAST_TIMEOUT = 120 ###### ## Exponential backoff settings: ###### ## When a daemon keeps crashing, we use "exponential backoff" so we ## wait longer and longer before restarting it. This is the base of ## the exponent used to determine how long to wait before starting ## the daemon again: #MASTER_BACKOFF_FACTOR = 2.0 ## What's the maximum amount of time you want the master to wait ## between attempts to start a given daemon? (With 2.0 as the ## MASTER_BACKOFF_FACTOR, you'd hit 1 hour in 12 restarts...) #MASTER_BACKOFF_CEILING = 3600 ## How long should a daemon run without crashing before we consider ## it "recovered". Once a daemon has recovered, we reset the number ## of restarts so the exponential backoff stuff goes back to normal. #MASTER_RECOVER_FACTOR = 300 ##-------------------------------------------------------------------- ## condor_collector ##-------------------------------------------------------------------- ## Address to which Condor will send a weekly e-mail with output of ## condor_status. ## Default is condor-admin@cs.wisc.edu CONDOR_DEVELOPERS = NONE ## Global Collector to periodically advertise basic information about ## your pool. ## Default is condor.cs.wisc.edu CONDOR_DEVELOPERS_COLLECTOR = NONE ##-------------------------------------------------------------------- ## condor_negotiator ##-------------------------------------------------------------------- ## Determine if the Negotiator will honor SlotWeight attributes, which ## may be used to give a slot greater weight when calculating usage. ## Default: True NEGOTIATOR_USE_SLOT_WEIGHTS = False ## How often the Negotiator starts a negotiation cycle, defined in ## seconds. #NEGOTIATOR_INTERVAL = 60 ##-------------------------------------------------------------------- ## condor_startd ##-------------------------------------------------------------------- ## Where are the various condor_starter binaries installed? STARTER_LIST = STARTER, STARTER_STANDARD STARTER = $(SBIN)/condor_starter STARTER_STANDARD = $(SBIN)/condor_starter.std STARTER_LOCAL = $(SBIN)/condor_starter ## When the startd starts up, it can place it's address (IP and port) ## into a file. This way, tools running on the local machine don't ## need to query the central manager to find the startd. This ## feature can be turned off by commenting out this setting. STARTD_ADDRESS_FILE = $(LOG)/.startd_address ## When a machine is claimed, how often should we poll the state of ## the machine to see if we need to evict/suspend the job, etc? #POLLING_INTERVAL = 5 ## How often should the startd send updates to the central manager? #UPDATE_INTERVAL = 300 ## How long is the startd willing to stay in the "matched" state? #MATCH_TIMEOUT = 300 ## How long is the startd willing to stay in the preempting/killing ## state before it just kills the starter directly? #KILLING_TIMEOUT = 30 ## When a machine unclaimed, when should it run benchmarks? ## LastBenchmark is initialized to 0, so this expression says as soon ## as we're unclaimed, run the benchmarks. Thereafter, if we're ## unclaimed and it's been at least 4 hours since we ran the last ## benchmarks, run them again. The startd keeps a weighted average ## of the benchmark results to provide more accurate values. ## Note, if you don't want any benchmarks run at all, either comment ## RunBenchmarks out, or set it to "False". BenchmarkTimer = (CurrentTime - LastBenchmark) RunBenchmarks : (LastBenchmark == 0 ) || ($(BenchmarkTimer) >= (4 * $(HOUR))) #RunBenchmarks : False ## Normally, when the startd is computing the idle time of all the ## users of the machine (both local and remote), it checks the utmp ## file to find all the currently active ttys, and only checks access ## time of the devices associated with active logins. Unfortunately, ## on some systems, utmp is unreliable, and the startd might miss ## keyboard activity by doing this. So, if your utmp is unreliable, ## set this setting to True and the startd will check the access time ## on all tty and pty devices. #STARTD_HAS_BAD_UTMP = False ## This entry allows the startd to monitor console (keyboard and ## mouse) activity by checking the access times on special files in ## /dev. Activity on these files shows up as "ConsoleIdle" time in ## the startd's ClassAd. Just give a comma-separated list of the ## names of devices you want considered the console, without the ## "/dev/" portion of the pathname. CONSOLE_DEVICES = mouse, console ## The STARTD_ATTRS (and legacy STARTD_EXPRS) entry allows you to ## have the startd advertise arbitrary attributes from the config ## file in its ClassAd. Give the comma-separated list of entries ## from the config file you want in the startd ClassAd. ## NOTE: because of the different syntax of the config file and ## ClassAds, you might have to do a little extra work to get a given ## entry into the ClassAd. In particular, ClassAds require double ## quotes (") around your strings. Numeric values can go in ## directly, as can boolean expressions. For example, if you wanted ## the startd to advertise its list of console devices, when it's ## configured to run benchmarks, and how often it sends updates to ## the central manager, you'd have to define the following helper ## macro: #MY_CONSOLE_DEVICES = "$(CONSOLE_DEVICES)" ## Note: this must come before you define STARTD_ATTRS because macros ## must be defined before you use them in other macros or ## expressions. ## Then, you'd set the STARTD_ATTRS setting to this: #STARTD_ATTRS = MY_CONSOLE_DEVICES, RunBenchmarks, UPDATE_INTERVAL ## ## STARTD_ATTRS can also be defined on a per-slot basis. The startd ## builds the list of attributes to advertise by combining the lists ## in this order: STARTD_ATTRS, SLOTx_STARTD_ATTRS. In the below ## example, the startd ad for slot1 will have the value for ## favorite_color, favorite_season, and favorite_movie, and slot2 ## will have favorite_color, favorite_season, and favorite_song. ## #STARTD_ATTRS = favorite_color, favorite_season #SLOT1_STARTD_ATTRS = favorite_movie #SLOT2_STARTD_ATTRS = favorite_song ## ## Attributes in the STARTD_ATTRS list can also be on a per-slot basis. ## For example, the following configuration: ## #favorite_color = "blue" #favorite_season = "spring" #SLOT2_favorite_color = "green" #SLOT3_favorite_season = "summer" #STARTD_ATTRS = favorite_color, favorite_season ## ## will result in the following attributes in the slot classified ## ads: ## ## slot1 - favorite_color = "blue"; favorite_season = "spring" ## slot2 - favorite_color = "green"; favorite_season = "spring" ## slot3 - favorite_color = "blue"; favorite_season = "summer" ## ## Finally, the recommended default value for this setting, is to ## publish the COLLECTOR_HOST setting as a string. This can be ## useful using the "$$(COLLECTOR_HOST)" syntax in the submit file ## for jobs to know (for example, via their environment) what pool ## they're running in. COLLECTOR_HOST_STRING = "$(COLLECTOR_HOST)" STARTD_ATTRS = COLLECTOR_HOST_STRING ## When the startd is claimed by a remote user, it can also advertise ## arbitrary attributes from the ClassAd of the job its working on. ## Just list the attribute names you want advertised. ## Note: since this is already a ClassAd, you don't have to do ## anything funny with strings, etc. This feature can be turned off ## by commenting out this setting (there is no default). STARTD_JOB_EXPRS = ImageSize, ExecutableSize, JobUniverse, NiceUser ## If you want to "lie" to Condor about how many CPUs your machine ## has, you can use this setting to override Condor's automatic ## computation. If you modify this, you must restart the startd for ## the change to take effect (a simple condor_reconfig will not do). ## Please read the section on "condor_startd Configuration File ## Macros" in the Condor Administrators Manual for a further ## discussion of this setting. Its use is not recommended. This ## must be an integer ("N" isn't a valid setting, that's just used to ## represent the default). #NUM_CPUS = N ## If you never want Condor to detect more the "N" CPUs, uncomment this ## line out. You must restart the startd for this setting to take ## effect. If set to 0 or a negative number, it is ignored. ## By default, it is ignored. Otherwise, it must be a positive ## integer ("N" isn't a valid setting, that's just used to ## represent the default). #MAX_NUM_CPUS = N ## Normally, Condor will automatically detect the amount of physical ## memory available on your machine. Define MEMORY to tell Condor ## how much physical memory (in MB) your machine has, overriding the ## value Condor computes automatically. For example: #MEMORY = 128 ## How much memory would you like reserved from Condor? By default, ## Condor considers all the physical memory of your machine as ## available to be used by Condor jobs. If RESERVED_MEMORY is ## defined, Condor subtracts it from the amount of memory it ## advertises as available. #RESERVED_MEMORY = 0 ###### ## SMP startd settings ## ## By default, Condor will evenly divide the resources in an SMP ## machine (such as RAM, swap space and disk space) among all the ## CPUs, and advertise each CPU as its own slot with an even share of ## the system resources. If you want something other than this, ## there are a few options available to you. Please read the section ## on "Configuring The Startd for SMP Machines" in the Condor ## Administrator's Manual for full details. The various settings are ## only briefly listed and described here. ###### ## The maximum number of different slot types. #MAX_SLOT_TYPES = 10 ## Use this setting to define your own slot types. This ## allows you to divide system resources unevenly among your CPUs. ## You must use a different setting for each different type you ## define. The "<N>" in the name of the macro listed below must be ## an integer from 1 to MAX_SLOT_TYPES (defined above), ## and you use this number to refer to your type. There are many ## different formats these settings can take, so be sure to refer to ## the section on "Configuring The Startd for SMP Machines" in the ## Condor Administrator's Manual for full details. In particular, ## read the section titled "Defining Slot Types" to help ## understand this setting. If you modify any of these settings, you ## must restart the condor_start for the change to take effect. #SLOT_TYPE_<N> = 1/4 #SLOT_TYPE_<N> = cpus=1, ram=25%, swap=1/4, disk=1/4 # For example: #SLOT_TYPE_1 = 1/8 #SLOT_TYPE_2 = 1/4 ## If you define your own slot types, you must specify how ## many slots of each type you wish to advertise. You do ## this with the setting below, replacing the "<N>" with the ## corresponding integer you used to define the type above. You can ## change the number of a given type being advertised at run-time, ## with a simple condor_reconfig. #NUM_SLOTS_TYPE_<N> = M # For example: #NUM_SLOTS_TYPE_1 = 6 #NUM_SLOTS_TYPE_2 = 1 ## The number of evenly-divided slots you want Condor to ## report to your pool (if less than the total number of CPUs). This ## setting is only considered if the "type" settings described above ## are not in use. By default, all CPUs are reported. This setting ## must be an integer ("N" isn't a valid setting, that's just used to ## represent the default). #NUM_SLOTS = N ## How many of the slots the startd is representing should ## be "connected" to the console (in other words, notice when there's ## console activity)? This defaults to all slots (N in a ## machine with N CPUs). This must be an integer ("N" isn't a valid ## setting, that's just used to represent the default). #SLOTS_CONNECTED_TO_CONSOLE = N ## How many of the slots the startd is representing should ## be "connected" to the keyboard (for remote tty activity, as well ## as console activity). Defaults to 1. #SLOTS_CONNECTED_TO_KEYBOARD = 1 ## If there are slots that aren't connected to the ## keyboard or the console (see the above two settings), the ## corresponding idle time reported will be the time since the startd ## was spawned, plus the value of this parameter. It defaults to 20 ## minutes. We do this because, if the slot is configured ## not to care about keyboard activity, we want it to be available to ## Condor jobs as soon as the startd starts up, instead of having to ## wait for 15 minutes or more (which is the default time a machine ## must be idle before Condor will start a job). If you don't want ## this boost, just set the value to 0. If you change your START ## expression to require more than 15 minutes before a job starts, ## but you still want jobs to start right away on some of your SMP ## nodes, just increase this parameter. #DISCONNECTED_KEYBOARD_IDLE_BOOST = 1200 ###### ## Settings for computing optional resource availability statistics: ###### ## If STARTD_COMPUTE_AVAIL_STATS = True, the startd will compute ## statistics about resource availability to be included in the ## classad(s) sent to the collector describing the resource(s) the ## startd manages. The following attributes will always be included ## in the resource classad(s) if STARTD_COMPUTE_AVAIL_STATS = True: ## AvailTime = What proportion of the time (between 0.0 and 1.0) ## has this resource been in a state other than "Owner"? ## LastAvailInterval = What was the duration (in seconds) of the ## last period between "Owner" states? ## The following attributes will also be included if the resource is ## not in the "Owner" state: ## AvailSince = At what time did the resource last leave the ## "Owner" state? Measured in the number of seconds since the ## epoch (00:00:00 UTC, Jan 1, 1970). ## AvailTimeEstimate = Based on past history, this is an estimate ## of how long the current period between "Owner" states will ## last. #STARTD_COMPUTE_AVAIL_STATS = False ## If STARTD_COMPUTE_AVAIL_STATS = True, STARTD_AVAIL_CONFIDENCE sets ## the confidence level of the AvailTimeEstimate. By default, the ## estimate is based on the 80th percentile of past values. #STARTD_AVAIL_CONFIDENCE = 0.8 ## STARTD_MAX_AVAIL_PERIOD_SAMPLES limits the number of samples of ## past available intervals stored by the startd to limit memory and ## disk consumption. Each sample requires 4 bytes of memory and ## approximately 10 bytes of disk space. #STARTD_MAX_AVAIL_PERIOD_SAMPLES = 100 ## CKPT_PROBE is the location of a program which computes aspects of the ## CheckpointPlatform classad attribute. By default the location of this ## executable will be here: $(LIBEXEC)/condor_ckpt_probe CKPT_PROBE = $(LIBEXEC)/condor_ckpt_probe ##-------------------------------------------------------------------- ## condor_schedd ##-------------------------------------------------------------------- ## Where are the various shadow binaries installed? SHADOW_LIST = SHADOW, SHADOW_STANDARD SHADOW = $(SBIN)/condor_shadow SHADOW_STANDARD = $(SBIN)/condor_shadow.std ## When the schedd starts up, it can place it's address (IP and port) ## into a file. This way, tools running on the local machine don't ## need to query the central manager to find the schedd. This ## feature can be turned off by commenting out this setting. SCHEDD_ADDRESS_FILE = $(SPOOL)/.schedd_address ## Additionally, a daemon may store its ClassAd on the local filesystem ## as well as sending it to the collector. This way, tools that need ## information about a daemon do not have to contact the central manager ## to get information about a daemon on the same machine. ## This feature is necessary for Quill to work. SCHEDD_DAEMON_AD_FILE = $(SPOOL)/.schedd_classad ## How often should the schedd send an update to the central manager? #SCHEDD_INTERVAL = 300 ## How long should the schedd wait between spawning each shadow? #JOB_START_DELAY = 2 ## How many concurrent sub-processes should the schedd spawn to handle ## queries? (Unix only) #SCHEDD_QUERY_WORKERS = 3 ## How often should the schedd send a keep alive message to any ## startds it has claimed? (5 minutes) #ALIVE_INTERVAL = 300 ## This setting controls the maximum number of times that a ## condor_shadow processes can have a fatal error (exception) before ## the condor_schedd will simply relinquish the match associated with ## the dying shadow. #MAX_SHADOW_EXCEPTIONS = 5 ## Estimated virtual memory size of each condor_shadow process. ## Specified in kilobytes. # SHADOW_SIZE_ESTIMATE = 800 ## The condor_schedd can renice the condor_shadow processes on your ## submit machines. How "nice" do you want the shadows? (1-19). ## The higher the number, the lower priority the shadows have. # SHADOW_RENICE_INCREMENT = 0 ## The condor_schedd can renice scheduler universe processes ## (e.g. DAGMan) on your submit machines. How "nice" do you want the ## scheduler universe processes? (1-19). The higher the number, the ## lower priority the processes have. # SCHED_UNIV_RENICE_INCREMENT = 0 ## By default, when the schedd fails to start an idle job, it will ## not try to start any other idle jobs in the same cluster during ## that negotiation cycle. This makes negotiation much more ## efficient for large job clusters. However, in some cases other ## jobs in the cluster can be started even though an earlier job ## can't. For example, the jobs' requirements may differ, because of ## different disk space, memory, or operating system requirements. ## Or, machines may be willing to run only some jobs in the cluster, ## because their requirements reference the jobs' virtual memory size ## or other attribute. Setting NEGOTIATE_ALL_JOBS_IN_CLUSTER to True ## will force the schedd to try to start all idle jobs in each ## negotiation cycle. This will make negotiation cycles last longer, ## but it will ensure that all jobs that can be started will be ## started. #NEGOTIATE_ALL_JOBS_IN_CLUSTER = False ## This setting controls how often, in seconds, the schedd considers ## periodic job actions given by the user in the submit file. ## (Currently, these are periodic_hold, periodic_release, and periodic_remove.) #PERIODIC_EXPR_INTERVAL = 60 ###### ## Queue management settings: ###### ## How often should the schedd truncate it's job queue transaction ## log? (Specified in seconds, once a day is the default.) #QUEUE_CLEAN_INTERVAL = 86400 ## How often should the schedd commit "wall clock" run time for jobs ## to the queue, so run time statistics remain accurate when the ## schedd crashes? (Specified in seconds, once per hour is the ## default. Set to 0 to disable.) #WALL_CLOCK_CKPT_INTERVAL = 3600 ## What users do you want to grant super user access to this job ## queue? (These users will be able to remove other user's jobs). ## By default, this only includes root. QUEUE_SUPER_USERS = root, condor ##-------------------------------------------------------------------- ## condor_shadow ##-------------------------------------------------------------------- ## If the shadow is unable to read a checkpoint file from the ## checkpoint server, it keeps trying only if the job has accumulated ## more than MAX_DISCARDED_RUN_TIME seconds of CPU usage. Otherwise, ## the job is started from scratch. Defaults to 1 hour. This ## setting is only used if USE_CKPT_SERVER (from above) is True. #MAX_DISCARDED_RUN_TIME = 3600 ## Should periodic checkpoints be compressed? #COMPRESS_PERIODIC_CKPT = False ## Should vacate checkpoints be compressed? #COMPRESS_VACATE_CKPT = False ## Should we commit the application's dirty memory pages to swap ## space during a periodic checkpoint? #PERIODIC_MEMORY_SYNC = False ## Should we write vacate checkpoints slowly? If nonzero, this ## parameter specifies the speed at which vacate checkpoints should ## be written, in kilobytes per second. #SLOW_CKPT_SPEED = 0 ## How often should the shadow update the job queue with job ## attributes that periodically change? Specified in seconds. #SHADOW_QUEUE_UPDATE_INTERVAL = 15 * 60 ## Should the shadow wait to update certain job attributes for the ## next periodic update, or should it immediately these update ## attributes as they change? Due to performance concerns of ## aggressive updates to a busy condor_schedd, the default is True. #SHADOW_LAZY_QUEUE_UPDATE = TRUE ##-------------------------------------------------------------------- ## condor_starter ##-------------------------------------------------------------------- ## The condor_starter can renice the processes from remote Condor ## jobs on your execute machines. If you want this, uncomment the ## following entry and set it to how "nice" you want the user ## jobs. (1-19) The larger the number, the lower priority the ## process gets on your machines. ## Note on Win32 platforms, this number needs to be greater than ## zero (i.e. the job must be reniced) or the mechanism that ## monitors CPU load on Win32 systems will give erratic results. #JOB_RENICE_INCREMENT = 10 ## Should the starter do local logging to its own log file, or send ## debug information back to the condor_shadow where it will end up ## in the ShadowLog? #STARTER_LOCAL_LOGGING = TRUE ## If the UID_DOMAIN settings match on both the execute and submit ## machines, but the UID of the user who submitted the job isn't in ## the passwd file of the execute machine, the starter will normally ## exit with an error. Do you want the starter to just start up the ## job with the specified UID, even if it's not in the passwd file? #SOFT_UID_DOMAIN = FALSE ##-------------------------------------------------------------------- ## condor_procd ##-------------------------------------------------------------------- ## # the path to the procd binary # PROCD = $(SBIN)/condor_procd # the path to the procd "address" # - on UNIX this will be a named pipe; we'll put it in the # $(LOCK) directory by default (note that multiple named pipes # will be created in this directory for when the procd responds # to its clients) # - on Windows, this will be a named pipe as well (but named pipes on # Windows are not even close to the same thing as named pipes on # UNIX); the name will be something like: # \\.\pipe\condor_procd # PROCD_ADDRESS = $(RUN)/procd_pipe # The procd currently uses a very simplistic logging system. Since this # log will not be rotated like other Condor logs, it is only recommended # to set PROCD_LOG when attempting to debug a problem. In other Condor # daemons, turning on D_PROCFAMILY will result in that daemon logging # all of its interactions with the ProcD. # #PROCD_LOG = $(LOG)/ProcLog # This is the maximum period that the procd will use for taking # snapshots (the actual period may be lower if a condor daemon registers # a family for which it wants more frequent snapshots) # PROCD_MAX_SNAPSHOT_INTERVAL = 60 # On Windows, we send a process a "soft kill" via a WM_CLOSE message. # This binary is used by the ProcD (and other Condor daemons if PRIVSEP # is not enabled) to help when sending soft kills. WINDOWS_SOFTKILL = $(SBIN)/condor_softkill ##-------------------------------------------------------------------- ## condor_submit ##-------------------------------------------------------------------- ## If you want condor_submit to automatically append an expression to ## the Requirements expression or Rank expression of jobs at your ## site, uncomment these entries. #APPEND_REQUIREMENTS = (expression to append job requirements) #APPEND_RANK = (expression to append job rank) ## If you want expressions only appended for either standard or ## vanilla universe jobs, you can uncomment these entries. If any of ## them are defined, they are used for the given universe, instead of ## the generic entries above. #APPEND_REQ_VANILLA = (expression to append to vanilla job requirements) #APPEND_REQ_STANDARD = (expression to append to standard job requirements) #APPEND_RANK_STANDARD = (expression to append to vanilla job rank) #APPEND_RANK_VANILLA = (expression to append to standard job rank) ## This can be used to define a default value for the rank expression ## if one is not specified in the submit file. #DEFAULT_RANK = (default rank expression for all jobs) ## If you want universe-specific defaults, you can use the following ## entries: #DEFAULT_RANK_VANILLA = (default rank expression for vanilla jobs) #DEFAULT_RANK_STANDARD = (default rank expression for standard jobs) ## If you want condor_submit to automatically append expressions to ## the job ClassAds it creates, you can uncomment and define the ## SUBMIT_EXPRS setting. It works just like the STARTD_EXPRS ## described above with respect to ClassAd vs. config file syntax, ## strings, etc. One common use would be to have the full hostname ## of the machine where a job was submitted placed in the job ## ClassAd. You would do this by uncommenting the following lines: #MACHINE = "$(FULL_HOSTNAME)" #SUBMIT_EXPRS = MACHINE ## Condor keeps a buffer of recently-used data for each file an ## application opens. This macro specifies the default maximum number ## of bytes to be buffered for each open file at the executing ## machine. #DEFAULT_IO_BUFFER_SIZE = 524288 ## Condor will attempt to consolidate small read and write operations ## into large blocks. This macro specifies the default block size ## Condor will use. #DEFAULT_IO_BUFFER_BLOCK_SIZE = 32768 ##-------------------------------------------------------------------- ## condor_preen ##-------------------------------------------------------------------- ## Who should condor_preen send email to? #PREEN_ADMIN = $(CONDOR_ADMIN) ## What files should condor_preen leave in the spool directory? VALID_SPOOL_FILES = job_queue.log, job_queue.log.tmp, history, \ Accountant.log, Accountantnew.log, \ local_univ_execute, .quillwritepassword, \ .pgpass, \ .schedd_address, .schedd_classad ## What files should condor_preen remove from the log directory? INVALID_LOG_FILES = core ##-------------------------------------------------------------------- ## Java parameters: ##-------------------------------------------------------------------- ## If you would like this machine to be able to run Java jobs, ## then set JAVA to the path of your JVM binary. If you are not ## interested in Java, there is no harm in leaving this entry ## empty or incorrect. JAVA = /usr/bin/java ## Some JVMs need to be told the maximum amount of heap memory ## to offer to the process. If your JVM supports this, give ## the argument here, and Condor will fill in the memory amount. ## If left blank, your JVM will choose some default value, ## typically 64 MB. The default (-Xmx) works with the Sun JVM. JAVA_MAXHEAP_ARGUMENT = -Xmx ## JAVA_CLASSPATH_DEFAULT gives the default set of paths in which ## Java classes are to be found. Each path is separated by spaces. ## If your JVM needs to be informed of additional directories, add ## them here. However, do not remove the existing entries, as Condor ## needs them. JAVA_CLASSPATH_DEFAULT = $(SHARE) $(SHARE)/scimark2lib.jar . ## JAVA_CLASSPATH_ARGUMENT describes the command-line parameter ## used to introduce a new classpath: JAVA_CLASSPATH_ARGUMENT = -classpath ## JAVA_CLASSPATH_SEPARATOR describes the character used to mark ## one path element from another: JAVA_CLASSPATH_SEPARATOR = : ## JAVA_BENCHMARK_TIME describes the number of seconds for which ## to run Java benchmarks. A longer time yields a more accurate ## benchmark, but consumes more otherwise useful CPU time. ## If this time is zero or undefined, no Java benchmarks will be run. JAVA_BENCHMARK_TIME = 2 ## If your JVM requires any special arguments not mentioned in ## the options above, then give them here. JAVA_EXTRA_ARGUMENTS = ## ##-------------------------------------------------------------------- ## Condor-G settings ##-------------------------------------------------------------------- ## Where is the GridManager binary installed? GRIDMANAGER = $(SBIN)/condor_gridmanager GT2_GAHP = $(SBIN)/gahp_server GRID_MONITOR = $(SBIN)/grid_monitor.sh ##-------------------------------------------------------------------- ## Settings that control the daemon's debugging output: ##-------------------------------------------------------------------- ## ## Note that the Gridmanager runs as the User, not a Condor daemon, so ## all users must have write permssion to the directory that the ## Gridmanager will use for it's logfile. Our suggestion is to create a ## directory called GridLogs in $(LOG) with UNIX permissions 1777 ## (just like /tmp ) ## Another option is to use /tmp as the location of the GridManager log. ## MAX_GRIDMANAGER_LOG = 1000000 GRIDMANAGER_DEBUG = GRIDMANAGER_LOG = $(LOG)/GridmanagerLog.$(USERNAME) GRIDMANAGER_LOCK = $(LOCK)/GridmanagerLock.$(USERNAME) ##-------------------------------------------------------------------- ## Various other settings that the Condor-G can use. ##-------------------------------------------------------------------- ## For grid-type gt2 jobs (pre-WS GRAM), limit the number of jobmanager ## processes the gridmanager will let run on the headnode. Letting too ## many jobmanagers run causes severe load on the headnode. GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE = 10 ## If we're talking to a Globus 2.0 resource, Condor-G will use the new ## version of the GRAM protocol. The first option is how often to check the ## proxy on the submit site of things. If the GridManager discovers a new ## proxy, it will restart itself and use the new proxy for all future ## jobs launched. In seconds, and defaults to 10 minutes #GRIDMANAGER_CHECKPROXY_INTERVAL = 600 ## The GridManager will shut things down 3 minutes before loosing Contact ## because of an expired proxy. ## In seconds, and defaults to 3 minutes #GRDIMANAGER_MINIMUM_PROXY_TIME = 180 ## Condor requires that each submitted job be designated to run under a ## particular "universe". ## ## If no universe is specificed in the submit file, Condor must pick one ## for the job to use. By default, it chooses the "vanilla" universe. ## The default can be overridden in the config file with the DEFAULT_UNIVERSE ## setting, which is a string to insert into a job submit description if the ## job does not try and define it's own universe ## #DEFAULT_UNIVERSE = vanilla # # The Cred_min_time_left is the first-pass at making sure that Condor-G # does not submit your job without it having enough time left for the # job to finish. For example, if you have a job that runs for 20 minutes, and # you might spend 40 minutes in the queue, it's a bad idea to submit with less # than an hour left before your proxy expires. # 2 hours seemed like a reasonable default. # CRED_MIN_TIME_LEFT = 120 ## ## The GridMonitor allows you to submit many more jobs to a GT2 GRAM server ## than is normally possible. #ENABLE_GRID_MONITOR = TRUE ## ## When an error occurs with the GridMonitor, how long should the ## gridmanager wait before trying to submit a new GridMonitor job? ## The default is 1 hour (3600 seconds). #GRID_MONITOR_DISABLE_TIME = 3600 ## ## The location of the wrapper for invoking ## Condor GAHP server ## CONDOR_GAHP = $(SBIN)/condor_c-gahp CONDOR_GAHP_WORKER = $(SBIN)/condor_c-gahp_worker_thread ## ## The Condor GAHP server has it's own log. Like the Gridmanager, the ## GAHP server is run as the User, not a Condor daemon, so all users must ## have write permssion to the directory used for the logfile. Our ## suggestion is to create a directory called GridLogs in $(LOG) with ## UNIX permissions 1777 (just like /tmp ) ## Another option is to use /tmp as the location of the CGAHP log. ## MAX_C_GAHP_LOG = 1000000 #C_GAHP_LOG = $(LOG)/GridLogs/CGAHPLog.$(USERNAME) C_GAHP_LOG = /tmp/CGAHPLog.$(USERNAME) C_GAHP_LOCK = /tmp/CGAHPLock.$(USERNAME) C_GAHP_WORKER_THREAD_LOG = /tmp/CGAHPWorkerLog.$(USERNAME) C_GAHP_WORKER_THREAD_LOCK = /tmp/CGAHPWorkerLock.$(USERNAME) ## ## The location of the wrapper for invoking ## GT4 GAHP server ## GT4_GAHP = $(SBIN)/gt4_gahp ## ## The location of GT4 files. This should normally be lib/gt4 ## GT4_LOCATION = $(LIB)/gt4 ## ## The location of the wrapper for invoking ## GT4 GAHP server ## GT42_GAHP = $(SBIN)/gt42_gahp ## ## The location of GT4 files. This should normally be lib/gt4 ## GT42_LOCATION = $(LIB)/gt42 ## ## gt4 gram requires a gridftp server to perform file transfers. ## If GRIDFTP_URL_BASE is set, then Condor assumes there is a gridftp ## server set up at that URL suitable for its use. Otherwise, Condor ## will start its own gridftp servers as needed, using the binary ## pointed at by GRIDFTP_SERVER. GRIDFTP_SERVER_WRAPPER points to a ## wrapper script needed to properly set the path to the gridmap file. ## #GRIDFTP_URL_BASE = gsiftp://$(FULL_HOSTNAME) GRIDFTP_SERVER = $(LIBEXEC)/globus-gridftp-server GRIDFTP_SERVER_WRAPPER = $(LIBEXEC)/gridftp_wrapper.sh ## ## Location of the PBS/LSF gahp and its associated binaries ## GLITE_LOCATION = $(LIB)/glite PBS_GAHP = $(GLITE_LOCATION)/bin/batch_gahp LSF_GAHP = $(GLITE_LOCATION)/bin/batch_gahp ## ## The location of the wrapper for invoking the Unicore GAHP server ## UNICORE_GAHP = $(SBIN)/unicore_gahp ## ## The location of the wrapper for invoking the NorduGrid GAHP server ## NORDUGRID_GAHP = $(SBIN)/nordugrid_gahp ## The location of the CREAM GAHP server CREAM_GAHP = $(SBIN)/cream_gahp ## Condor-G and CredD can use MyProxy to refresh GSI proxies which are ## about to expire. #MYPROXY_GET_DELEGATION = /path/to/myproxy-get-delegation ## ## EC2: Universe = Grid, Grid_Resource = Amazon ## ## The location of the amazon_gahp program, required AMAZON_GAHP = $(SBIN)/amazon_gahp ## Location of log files, useful for debugging, must be in ## a directory writable by any user, such as /tmp #AMAZON_GAHP_DEBUG = D_FULLDEBUG AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME) ## The number of seconds between status update requests to EC2. You can ## make this short (5 seconds) if you want Condor to respond quickly to ## instances as they terminate, or you can make it long (300 seconds = 5 ## minutes) if you know your instances will run for awhile and don't mind ## delay between when they stop and when Condor responds to them ## stopping. GRIDMANAGER_JOB_PROBE_INTERVAL = 300 ## As of this writing Amazon EC2 has a hard limit of 20 concurrently ## running instances, so a limit of 20 is imposed so the GridManager ## does not waste its time sending requests that will be rejected. GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON = 20 ## ##-------------------------------------------------------------------- ## condor_credd credential managment daemon ##-------------------------------------------------------------------- ## Where is the CredD binary installed? CREDD = $(SBIN)/condor_credd ## When the credd starts up, it can place it's address (IP and port) ## into a file. This way, tools running on the local machine don't ## need an additional "-n host:port" command line option. This ## feature can be turned off by commenting out this setting. CREDD_ADDRESS_FILE = $(LOG)/.credd_address ## Specify a remote credd server here, #CREDD_HOST = $(CONDOR_HOST):$(CREDD_PORT) ## CredD startup arguments ## Start the CredD on a well-known port. Uncomment to to simplify ## connecting to a remote CredD. Note: that this interface may change ## in a future release. CREDD_PORT = 9620 CREDD_ARGS = -p $(CREDD_PORT) -f ## CredD daemon debugging log CREDD_LOG = $(LOG)/CredLog CREDD_DEBUG = D_FULLDEBUG MAX_CREDD_LOG = 4000000 ## The credential owner submits the credential. This list specififies ## other user who are also permitted to see all credentials. Defaults ## to root on Unix systems, and Administrator on Windows systems. #CRED_SUPER_USERS = ## Credential storage location. This directory must exist ## prior to starting condor_credd. It is highly recommended to ## restrict access permissions to _only_ the directory owner. CRED_STORE_DIR = $(LOCAL_DIR)/cred_dir ## Index file path of saved credentials. ## This file will be automatically created if it does not exist. #CRED_INDEX_FILE = $(CRED_STORE_DIR/cred-index ## condor_credd will attempt to refresh credentials when their ## remaining lifespan is less than this value. Units = seconds. #DEFAULT_CRED_EXPIRE_THRESHOLD = 3600 ## condor-credd periodically checks remaining lifespan of stored ## credentials, at this interval. #CRED_CHECK_INTERVAL = 60 ## ##-------------------------------------------------------------------- ## Stork data placment server ##-------------------------------------------------------------------- ## Where is the Stork binary installed? STORK = $(SBIN)/stork_server ## When Stork starts up, it can place it's address (IP and port) ## into a file. This way, tools running on the local machine don't ## need an additional "-n host:port" command line option. This ## feature can be turned off by commenting out this setting. STORK_ADDRESS_FILE = $(LOG)/.stork_address ## Specify a remote Stork server here, #STORK_HOST = $(CONDOR_HOST):$(STORK_PORT) ## STORK_LOG_BASE specifies the basename for heritage Stork log files. ## Stork uses this macro to create the following output log files: ## $(STORK_LOG_BASE): Stork server job queue classad collection ## journal file. ## $(STORK_LOG_BASE).history: Used to track completed jobs. ## $(STORK_LOG_BASE).user_log: User level log, also used by DAGMan. STORK_LOG_BASE = $(LOG)/Stork ## Modern Condor DaemonCore logging feature. STORK_LOG = $(LOG)/StorkLog STORK_DEBUG = D_FULLDEBUG MAX_STORK_LOG = 4000000 ## Stork startup arguments ## Start Stork on a well-known port. Uncomment to to simplify ## connecting to a remote Stork. Note: that this interface may change ## in a future release. #STORK_PORT = 34048 STORK_PORT = 9621 STORK_ARGS = -p $(STORK_PORT) -f -Serverlog $(STORK_LOG_BASE) ## Stork environment. Stork modules may require external programs and ## shared object libraries. These are located using the PATH and ## LD_LIBRARY_PATH environments. Further, some modules may require ## further specific environments. By default, Stork inherits a full ## environment when invoked from condor_master or the shell. If the ## default environment is not adequate for all Stork modules, specify ## a replacement environment here. This environment will be set by ## condor_master before starting Stork, but does not apply if Stork is ## started directly from the command line. #STORK_ENVIRONMENT = TMP=/tmp;CONDOR_CONFIG=/special/config;PATH=/lib ## Limits the number of concurrent data placements handled by Stork. #STORK_MAX_NUM_JOBS = 5 ## Limits the number of retries for a failed data placement. #STORK_MAX_RETRY = 5 ## Limits the run time for a data placement job, after which the ## placement is considered failed. #STORK_MAXDELAY_INMINUTES = 10 ## Temporary credential storage directory used by Stork. #STORK_TMP_CRED_DIR = /tmp ## Directory containing Stork modules. #STORK_MODULE_DIR = $(LIBEXEC) ## ##-------------------------------------------------------------------- ## Quill Job Queue Mirroring Server ##-------------------------------------------------------------------- ## Where is the Quill binary installed and what arguments should be passed? QUILL = $(SBIN)/condor_quill #QUILL_ARGS = # Where is the log file for the quill daemon? QUILL_LOG = $(LOG)/QuillLog # The identification and location of the quill daemon for local clients. QUILL_ADDRESS_FILE = $(LOG)/.quill_address # If this is set to true, then the rest of the QUILL arguments must be defined # for quill to function. If it is Fase or left undefined, then quill will not # be consulted by either the scheduler or the tools, but in the case of a # remote quill query where the local client has quill turned off, but the # remote client has quill turned on, things will still function normally. #QUILL_ENABLED = TRUE # # If Quill is enabled, by default it will only mirror the current job # queue into the database. For historical jobs, and classads from other # sources, the SQL Log must be enabled. #QUILL_USE_SQL_LOG=FALSE # # The SQL Log can be enabled on a per-daemon basis. For example, to collect # historical job information, but store no information about execute machines, # uncomment these two lines #QUILL_USE_SQL_LOG = FALSE #SCHEDD.QUILL_USE_SQL_LOG = TRUE # This will be the name of a quill daemon using this config file. This name # should not conflict with any other quill name--or schedd name. #QUILL_NAME = quill@postgresql-server.machine.com # The Postgreql server requires usernames that can manipulate tables. This will # be the username associated with this instance of the quill daemon mirroring # a schedd's job queue. Each quill daemon must have a unique username # associated with it otherwise multiple quill daemons will corrupt the data # held under an indentical user name. #QUILL_DB_NAME = name_of_db # The required password for the DB user which quill will use to read # information from the database about the queue. #QUILL_DB_QUERY_PASSWORD = foobar # What kind of database server is this? # For now, only PGSQL is supported #QUILL_DB_TYPE = PGSQL # The machine and port of the postgres server. # Although this says IP Addr, it can be a DNS name. # It must match whatever format you used for the .pgpass file, however #QUILL_DB_IP_ADDR = machine.domain.com:5432 # The login to use to attach to the database for updating information. # There should be an entry in file $SPOOL/.pgpass that gives the password # for this login id. #QUILL_DB_USER = quillwriter # Polling period, in seconds, for when quill reads transactions out of the # schedd's job queue log file and puts them into the database. #QUILL_POLLING_PERIOD = 10 # Allows or disallows a remote query to the quill daemon and database # which is reading this log file. Defaults to true. #QUILL_IS_REMOTELY_QUERYABLE = TRUE # Add debugging flags to here if you need to debug quill for some reason. #QUILL_DEBUG = D_FULLDEBUG # Number of seconds the master should wait for the Quill daemon to respond # before killing it. This number might need to be increased for very # large logfiles. # The default is 3600 (one hour), but kicking it up to a few hours won't hurt #QUILL_NOT_RESPONDING_TIMEOUT = 3600 # Should Quill hold open a database connection to the DBMSD? # Each open connection consumes resources at the server, so large pools # (100 or more machines) should set this variable to FALSE. Note the # default is TRUE. #QUILL_MAINTAIN_DB_CONN = TRUE ## ##-------------------------------------------------------------------- ## Database Management Daemon settings ##-------------------------------------------------------------------- ## Where is the DBMSd binary installed and what arguments should be passed? DBMSD = $(SBIN)/condor_dbmsd DBMSD_ARGS = -f # Where is the log file for the quill daemon? DBMSD_LOG = $(LOG)/DbmsdLog # Interval between consecutive purging calls (in seconds) #DATABASE_PURGE_INTERVAL = 86400 # Interval between consecutive database reindexing operations # This is only used when dbtype = PGSQL #DATABASE_REINDEX_INTERVAL = 86400 # Number of days before purging resource classad history # This includes things like machine ads, daemon ads, submitters #QUILL_RESOURCE_HISTORY_DURATION = 7 # Number of days before purging job run information # This includes job events, file transfers, matchmaker matches, etc # This does NOT include the final job ad. condor_history does not need # any of this information to work. #QUILL_RUN_HISTORY_DURATION = 7 # Number of days before purging job classad history # This is the information needed to run condor_history #QUILL_JOB_HISTORY_DURATION = 3650 # DB size threshold for warning the condor administrator. This is checked # after every purge. The size is given in gigabytes. #QUILL_DBSIZE_LIMIT = 20 # Number of seconds the master should wait for the DBMSD to respond before # killing it. This number might need to be increased for very large databases # The default is 3600 (one hour). #DBMSD_NOT_RESPONDING_TIMEOUT = 3600 ## ##-------------------------------------------------------------------- ## VM Universe Parameters ##-------------------------------------------------------------------- ## Where is the Condor VM-GAHP installed? (Required) VM_GAHP_SERVER = $(SBIN)/condor_vm-gahp ## If the VM-GAHP is to have its own log, define ## the location of log file. ## ## Optionally, if you do NOT define VM_GAHP_LOG, logs of VM-GAHP will ## be stored in the starter's log file. ## However, on Windows machine you must always define VM_GAHP_LOG. # VM_GAHP_LOG = $(LOG)/VMGahpLog MAX_VM_GAHP_LOG = 1000000 #VM_GAHP_DEBUG = D_FULLDEBUG ## What kind of virtual machine program will be used for ## the VM universe? ## The two options are vmware and xen. (Required) #VM_TYPE = vmware ## How much memory can be used for the VM universe? (Required) ## This value is the maximum amount of memory that can be used by the ## virtual machine program. #VM_MEMORY = 128 ## Want to support networking for VM universe? ## Default value is FALSE #VM_NETWORKING = FALSE ## What kind of networking types are supported? ## ## If you set VM_NETWORKING to TRUE, you must define this parameter. ## VM_NETWORKING_TYPE = nat ## VM_NETWORKING_TYPE = bridge ## VM_NETWORKING_TYPE = nat, bridge ## ## If multiple networking types are defined, you may define ## VM_NETWORKING_DEFAULT_TYPE for default networking type. ## Otherwise, nat is used for default networking type. ## VM_NETWORKING_DEFAULT_TYPE = nat #VM_NETWORKING_DEFAULT_TYPE = nat #VM_NETWORKING_TYPE = nat ## In default, the number of possible virtual machines is same as ## NUM_CPUS. ## Since too many virtual machines can cause the system to be too slow ## and lead to unexpected problems, limit the number of running ## virtual machines on this machine with #VM_MAX_NUMBER = 2 ## When a VM universe job is started, a status command is sent ## to the VM-GAHP to see if the job is finished. ## If the interval between checks is too short, it will consume ## too much of the CPU. If the VM-GAHP fails to get status 5 times in a row, ## an error will be reported to startd, and then startd will check ## the availability of VM universe. ## Default value is 60 seconds and minimum value is 30 seconds #VM_STATUS_INTERVAL = 60 ## How long will we wait for a request sent to the VM-GAHP to be completed? ## If a request is not completed within the timeout, an error will be reported ## to the startd, and then the startd will check ## the availability of vm universe. Default value is 5 mins. #VM_GAHP_REQ_TIMEOUT = 300 ## When VMware or Xen causes an error, the startd will disable the ## VM universe. However, because some errors are just transient, ## we will test one more ## whether vm universe is still unavailable after some time. ## In default, startd will recheck vm universe after 10 minutes. ## If the test also fails, vm universe will be disabled. #VM_RECHECK_INTERVAL = 600 ## Usually, when we suspend a VM, the memory being used by the VM ## will be saved into a file and then freed. ## However, when we use soft suspend, neither saving nor memory freeing ## will occur. ## For VMware, we send SIGSTOP to a process for VM in order to ## stop the VM temporarily and send SIGCONT to resume the VM. ## For Xen, we pause CPU. Pausing CPU doesn't save the memory of VM ## into a file. It only stops the execution of a VM temporarily. #VM_SOFT_SUSPEND = TRUE ## If Condor runs as root and a job comes from a different UID domain, ## Condor generally uses "nobody", unless SLOTx_USER is defined. ## If "VM_UNIV_NOBODY_USER" is defined, a VM universe job will run ## as the user defined in "VM_UNIV_NOBODY_USER" instead of "nobody". ## ## Notice: In VMware VM universe, "nobody" can not create a VMware VM. ## So we need to define "VM_UNIV_NOBODY_USER" with a regular user. ## For VMware, the user defined in "VM_UNIV_NOBODY_USER" must have a ## home directory. So SOFT_UID_DOMAIN doesn't work for VMware VM universe job. ## If neither "VM_UNIV_NOBODY_USER" nor "SLOTx_VMUSER"/"SLOTx_USER" is defined, ## VMware VM universe job will run as "condor" instead of "nobody". ## As a result, the preference of local users for a VMware VM universe job ## which comes from the different UID domain is ## "VM_UNIV_NOBODY_USER" -> "SLOTx_VMUSER" -> "SLOTx_USER" -> "condor". #VM_UNIV_NOBODY_USER = login name of a user who has home directory ## If Condor runs as root and "ALWAYS_VM_UNIV_USE_NOBODY" is set to TRUE, ## all VM universe jobs will run as a user defined in "VM_UNIV_NOBODY_USER". #ALWAYS_VM_UNIV_USE_NOBODY = FALSE ##-------------------------------------------------------------------- ## VM Universe Parameters Specific to VMware ##-------------------------------------------------------------------- ## Where is perl program? (Required) VMWARE_PERL = perl ## Where is the Condor script program to control VMware? (Required) VMWARE_SCRIPT = $(SBIN)/condor_vm_vmware.pl ## Networking parameters for VMware ## ## What kind of VMware networking is used? ## ## If multiple networking types are defined, you may specify different ## parameters for each networking type. ## ## Examples ## (e.g.) VMWARE_NAT_NETWORKING_TYPE = nat ## (e.g.) VMWARE_BRIDGE_NETWORKING_TYPE = bridged ## ## If there is no parameter for specific networking type, VMWARE_NETWORKING_TYPE is used. ## #VMWARE_NAT_NETWORKING_TYPE = nat #VMWARE_BRIDGE_NETWORKING_TYPE = bridged VMWARE_NETWORKING_TYPE = nat ## The contents of this file will be inserted into the .vmx file of ## the VMware virtual machine before Condor starts it. #VMWARE_LOCAL_SETTINGS_FILE = /path/to/file ##-------------------------------------------------------------------- ## VM Universe Parameters common to libvirt controlled vm's (kvm & xen) ##-------------------------------------------------------------------- ## Where is the Condor script program to control KVM & Xen? (Required) VM_SCRIPT = $(SBIN)/condor_vm_xen.sh ## Networking parameters for KVM & Xen ## ## This is the path to the XML helper command; the libvirt_simple_script.awk ## script just reproduces what Condor already does for the kvm/xen VM ## universe LIBVIRT_XML_SCRIPT = $(LIBEXEC)/libvirt_simple_script.awk ## This is the optional debugging output file for the xml helper ## script. Scripts that need to output debugging messages should ## write them to the file specified by this argument, which will be ## passed as the second command line argument when the script is ## executed #LIBVRT_XML_SCRIPT_ARGS = /dev/stderr ##-------------------------------------------------------------------- ## VM Universe Parameters Specific to Xen ##-------------------------------------------------------------------- ## Where is bootloader for Xen domainU? (Required) ## ## The bootloader will be used in the case that a kernel image includes ## a disk image #XEN_BOOTLOADER = /usr/bin/pygrub ## The contents of this file will be added to the Xen virtual machine ## description that Condor writes. #XEN_LOCAL_SETTINGS_FILE = /path/to/file ## ##-------------------------------------------------------------------- ## condor_lease_manager lease manager daemon ##-------------------------------------------------------------------- ## Where is the LeaseManager binary installed? LeaseManager = $(SBIN)/condor_lease_manager # Turn on the lease manager #DAEMON_LIST = $(DAEMON_LIST), LeaseManager # The identification and location of the lease manager for local clients. LeaseManger_ADDRESS_FILE = $(LOG)/.lease_manager_address ## LeaseManager startup arguments #LeaseManager_ARGS = -local-name generic ## LeaseManager daemon debugging log LeaseManager_LOG = $(LOG)/LeaseManagerLog LeaseManager_DEBUG = D_FULLDEBUG MAX_LeaseManager_LOG = 1000000 # Basic parameters LeaseManager.GETADS_INTERVAL = 60 LeaseManager.UPDATE_INTERVAL = 300 LeaseManager.PRUNE_INTERVAL = 60 LeaseManager.DEBUG_ADS = False LeaseManager.CLASSAD_LOG = $(SPOOL)/LeaseManagerState #LeaseManager.QUERY_ADTYPE = Any #LeaseManager.QUERY_CONSTRAINTS = target.MyType == "SomeType" #LeaseManager.QUERY_CONSTRAINTS = target.TargetType == "SomeType" ## ##-------------------------------------------------------------------- ## KBDD - keyboard activity detection daemon ##-------------------------------------------------------------------- ## When the KBDD starts up, it can place it's address (IP and port) ## into a file. This way, tools running on the local machine don't ## need an additional "-n host:port" command line option. This ## feature can be turned off by commenting out this setting. KBDD_ADDRESS_FILE = $(LOG)/.kbdd_address ## ##-------------------------------------------------------------------- ## condor_ssh_to_job ##-------------------------------------------------------------------- # NOTE: condor_ssh_to_job is not supported under Windows. # Tell the starter (execute side) whether to allow the job owner or # queue super user on the schedd from which the job was submitted to # use condor_ssh_to_job to access the job interactively (e.g. for # debugging). TARGET is the job; MY is the machine. #ENABLE_SSH_TO_JOB = true # Tell the schedd (submit side) whether to allow the job owner or # queue super user to use condor_ssh_to_job to access the job # interactively (e.g. for debugging). MY is the job; TARGET is not # defined. #SCHEDD_ENABLE_SSH_TO_JOB = true # Command condor_ssh_to_job should use to invoke the ssh client. # %h --> remote host # %i --> ssh key file # %k --> known hosts file # %u --> remote user # %x --> proxy command # %% --> % #SSH_TO_JOB_SSH_CMD = ssh -oUser=%u -oIdentityFile=%i -oStrictHostKeyChecking=yes -oUserKnownHostsFile=%k -oGlobalKnownHostsFile=%k -oProxyCommand=%x %h # Additional ssh clients may be configured. They all have the same # default as ssh, except for scp, which omits the %h: #SSH_TO_JOB_SCP_CMD = scp -oUser=%u -oIdentityFile=%i -oStrictHostKeyChecking=yes -oUserKnownHostsFile=%k -oGlobalKnownHostsFile=%k -oProxyCommand=%x # Path to sshd #SSH_TO_JOB_SSHD = /usr/sbin/sshd # Arguments the starter should use to invoke sshd in inetd mode. # %f --> sshd config file # %% --> % #SSH_TO_JOB_SSHD_ARGS = "-i -e -f %f" # sshd configuration template used by condor_ssh_to_job_sshd_setup. SSH_TO_JOB_SSHD_CONFIG_TEMPLATE = $(ETC)/condor_ssh_to_job_sshd_config_template # Path to ssh-keygen #SSH_TO_JOB_SSH_KEYGEN = /usr/bin/ssh-keygen # Arguments to ssh-keygen # %f --> key file to generate # %% --> % #SSH_TO_JOB_SSH_KEYGEN_ARGS = "-N '' -C '' -q -f %f -t rsa" ###################################################################### ## ## Condor HDFS ## ## This is the default local configuration file for configuring Condor ## daemon responsible for running services related to hadoop ## distributed storage system.You should copy this file to the ## appropriate location and customize it for your needs. ## ## Unless otherwise specified, settings that are commented out show ## the defaults that are used if you don't define a value. Settings ## that are defined here MUST BE DEFINED since they have no default ## value. ## ###################################################################### ###################################################################### ## FOLLOWING MUST BE CHANGED ###################################################################### ## The location for hadoop installation directory. The default location ## is under 'libexec' directory. The directory pointed by HDFS_HOME ## should contain a lib folder that contains all the required Jars necessary ## to run HDFS name and data nodes. #HDFS_HOME = $(RELEASE_DIR)/libexec/hdfs ## The host and port for hadoop's name node. If this machine is the ## name node (see HDFS_SERVICES) then the specified port will be used ## to run name node. HDFS_NAMENODE = example.com:9000 HDFS_NAMENODE_WEB = example.com:8000 ## You need to pick one machine as name node by setting this parameter ## to HDFS_NAMENODE. The remaining machines in a storage cluster will ## act as data nodes (HDFS_DATANODE). HDFS_SERVICES = HDFS_DATANODE ## The two set of directories that are required by HDFS are for name ## node (HDFS_NAMENODE_DIR) and data node (HDFS_DATANODE_DIR). The ## directory for name node is only required for a machine running ## name node service and is used to store critical meta data for ## files. The data node needs its directory to store file blocks and ## their replicas. HDFS_NAMENODE_DIR = /tmp/hadoop_name HDFS_DATANODE_DIR = /scratch/tmp/hadoop_data ## Unlike name node address settings (HDFS_NAMENODE), that needs to be ## well known across the storage cluster, data node can run on any ## arbitrary port of given host. #HDFS_DATANODE_ADDRESS = 0.0.0.0:0 #################################################################### ## OPTIONAL ##################################################################### ## Sets the log4j debug level. All the emitted debug output from HDFS ## will go in 'hdfs.log' under $(LOG) directory. #HDFS_LOG4J=DEBUG ## The access to HDFS services both name node and data node can be ## restricted by specifying IP/host based filters. By default settings ## from ALLOW_READ/ALLOW_WRITE and DENY_READ/DENY_WRITE ## are used to specify allow and deny list. The below two parameters can ## be used to override these settings. Read the Condor manual for ## specification of these filters. ## WARN: HDFS doesn't make any distinction between read or write based connection. #HDFS_ALLOW=* #HDFS_DENY=* #Fully qualified name for Name node and Datanode class. #HDFS_NAMENODE_CLASS=org.apache.hadoop.hdfs.server.namenode.NameNode #HDFS_DATANODE_CLASS=org.apache.hadoop.hdfs.server.datanode.DataNode ## In case an old name for hdfs configuration files is required. #HDFS_SITE_FILE = hadoop-site.xml
Code | Universe | Details |
---|---|---|
5
| Vanilla universe | Single process, non-relinked jobs |
7
| Scheduler universe |
Jobs run under the schedd
|
9
| Grid universe |
Jobs managed by the condor_gridmanager
|
10
| Java universe | Jobs for the Java Virtual Machine |
11
| Parallel universe | General parallel jobs |
12
| Local universe |
A job run under the schedd using a starter
|
Code | Short Description | Long description |
---|---|---|
1
| I | Idle |
2
| R | Running |
3
| X | Removed |
4
| C | Completed |
5
| H | Held |
6
| E | Submission Error |
Code | Frequency |
---|---|
0
| Never |
1
| Always |
2
| Complete |
3
| Error |
condor_shadow
when exiting:
Code | Command | Description |
---|---|---|
4
|
JOB_EXCEPTION
| The job exited with an exception |
44
|
DPRINTF_ERROR
|
There is a fatal error with dprintf()
|
100
|
JOB_EXITED
| The job exited |
102
|
JOB_KILLED
| The job was killed |
103
|
JOB_COREDUMPED
| The job was killed and a core file produced |
105
|
JOB_NO_MEM
|
There was not enough memory to start condor_shadow
|
106
|
JOB_SHADOW_USAGE
|
Incorrect arguments were provided to condor_shadow
|
107
|
JOB_SHOULD_REQUEUE
| Requeue the job to be run again |
108
|
JOB_NOT_STARTED
|
Cannot connect to condor_startd , or the request was refused
|
109
|
JOB_BAD_STATUS
|
The job status was something other than RUNNING when it was started
|
110
|
JOB_EXEC_FAILED
| Execution failed for an unknown reason |
112
|
JOB_SHOULD_HOLD
| Put the job on hold |
113
|
JOB_SHOULD_REMOVE
| Remove the job |
Code | Error | Description |
---|---|---|
0
|
Unspecified
| This error code is being deprecated |
1
|
UserRequest
|
The user put the job on hold with condor_hold
|
3
|
JobPolicy
|
The periodic hold expression evaluated to TRUE
|
4
|
CorruptedCredential
| The credentials for the job were invalid |
5
|
JobPolicyUndefined
|
A job policy expression (such as PeriodicHold) evaluated to UNDEFINED
|
6
|
FailedToCreateProcess
|
The condor_starter could not start the executable
|
7
|
UnableToOpenOutput
| The standard output file for the job could not be opened |
8
|
UnableToOpenInput
| The standard input file for the job could not be opened |
9
|
UnableToOpenOutputStream
| The standard output stream for the job could not be opened |
10
|
UnableToOpenInputStream
| The standard input stream for the job could not be opened |
11
|
InvalidTransferAck
| An internal protocol error was encountered when transferring files |
12
|
DownloadFileError
|
The condor_starter could not download the input files
|
13
|
UploadFileError
|
The condor_starter could not upload the output files
|
14
|
IwdError
| The initial working directory of the job cannot be accessed |
15
|
SubmittedOnHold
| The user requested the job be submitted on hold |
16
|
SpoolingInput
| Input files are being spooled |
AviaryScheduler conflicts: None depends: BaseScheduler included: Axis2Home Axis2Home conflicts: None depends: None included: None BaseJobExecuter conflicts: None depends: None included: None BaseScheduler conflicts: None depends: Master, NodeAccess included: BaseJobExecuter CentralManager conflicts: None depends: NodeAccess included: Collector, Negotiator Collector conflicts: None depends: Master, NodeAccess included: None CommonUIDDomain conflicts: None depends: None included: None ConcurrencyLimits conflicts: None depends: None included: Negotiator ConsoleCollector conflicts: None depends: QMF included: Collector ConsoleScheduler conflicts: None depends: QMF, BaseScheduler included: None ConsoleExecuteNode conflicts: None depends: QMF, ExecuteNode included: None ConsoleMaster conflicts: None depends: QMF, Master included: None ConsoleNegotiator conflicts: None depends: QMF, Negotiator included: None DedicatedResource conflicts: None depends: None included: ExecuteNode DedicatedScheduler conflicts: None depends: None included: Scheduler DynamicSlots conflicts: None depends: None included: ExecuteNode EC2 conflicts: None depends: None included: ExecuteNode EC2Enhanced conflicts: None depends: None included: JobRouter ExecuteNode conflicts: None depends: Master included: BaseJobExecuter ExecuteNodeDedicatedPreemption conflicts: None depends: None included: DedicatedResource ExecuteNodeTriggerData conflicts: None depends: None included: ExecuteNode HACentralManager conflicts: None depends: None included: CentralManager HAScheduler conflicts: Scheduler depends: None included: JobQueueLocation, BaseScheduler JobHooks conflicts: None depends: None included: None JobQueueLocation conflicts: None depends: None included: None JobRouter conflicts: None depends: Master, BaseScheduler included: None JobServer conflicts: None depends: Master, QMF, JobQueueLocation included: None KeyboardMonitor conflicts: None depends: Master included: None LowLatency conflicts: None depends: JobHooks included: ExecuteNode Master conflicts: None depends: NodeAccess included: None Negotiator conflicts: None depends: Master, NodeAccess included: None NodeAccess conflicts: None depends: None included: None PowerManagementCollector conflicts: None depends: None included: Collector PowerManagementNode conflicts: Collector, Negotiator, Scheduler depends: None included: ExecuteNode PowerManagementSubnetManager conflicts: PowerManagementNode depends: None included: None QMF conflicts: None depends: None included: None QueryServer conflicts: None depends: Master, JobQueueLocation included: Axis2Home Scheduler conflicts: HAScheduler depends: None included: JobQueueLocation, BaseScheduler SchedulerDedicatedPreemption conflicts: None depends: None included: DedicatedScheduler SharedFileSystem conflicts: None depends: None included: None SharedPort conflicts: None depends: None included: None TriggerService conflicts: None depends: Master, QMF included: None VMUniverse conflicts: None depends: None included: ExecuteNode
BaseJobExecuter
and BaseScheduler
features are not intended to be installed alone. They must be installed with a feature that depends on them.
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and
distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright
owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all other entities
that control, are controlled by, or are under common control with that entity.
For the purposes of this definition, "control" means (i) the power, direct or
indirect, to cause the direction or management of such entity, whether by
contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity exercising
permissions granted by this License.
"Source" form shall mean the preferred form for making modifications, including
but not limited to software source code, documentation source, and configuration
files.
"Object" form shall mean any form resulting from mechanical transformation or
translation of a Source form, including but not limited to compiled object code,
generated documentation, and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or Object form, made
available under the License, as indicated by a copyright notice that is included
in or attached to the work (an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object form, that
is based on (or derived from) the Work and for which the editorial revisions,
annotations, elaborations, or other modifications represent, as a whole, an
original work of authorship. For the purposes of this License, Derivative Works
shall not include works that remain separable from, or merely link (or bind by
name) to the interfaces of, the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including the original version
of the Work and any modifications or additions to that Work or Derivative Works
thereof, that is intentionally submitted to Licensor for inclusion in the Work
by the copyright owner or by an individual or Legal Entity authorized to submit
on behalf of the copyright owner. For the purposes of this definition,
"submitted" means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems, and
issue tracking systems that are managed by, or on behalf of, the Licensor for
the purpose of discussing and improving the Work, but excluding communication
that is conspicuously marked or otherwise designated in writing by the copyright
owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity on behalf
of whom a Contribution has been received by Licensor and subsequently
incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of this
License, each Contributor hereby grants to You a perpetual, worldwide,
non-exclusive, no-charge, royalty-free, irrevocable copyright license to
reproduce, prepare Derivative Works of, publicly display, publicly perform,
sublicense, and distribute the Work and such Derivative Works in Source or
Object form.
3. Grant of Patent License. Subject to the terms and conditions of this License,
each Contributor hereby grants to You a perpetual, worldwide, non-exclusive,
no-charge, royalty-free, irrevocable (except as stated in this section) patent
license to make, have made, use, offer to sell, sell, import, and otherwise
transfer the Work, where such license applies only to those patent claims
licensable by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s) with the Work
to which such Contribution(s) was submitted. If You institute patent litigation
against any entity (including a cross-claim or counterclaim in a lawsuit)
alleging that the Work or a Contribution incorporated within the Work
constitutes direct or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate as of the date
such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the Work or
Derivative Works thereof in any medium, with or without modifications, and in
Source or Object form, provided that You meet the following conditions:
You must give any other recipients of the Work or Derivative Works a copy of
this License; and
You must cause any modified files to carry prominent notices stating that
You changed the files; and
You must retain, in the Source form of any Derivative Works that You
distribute, all copyright, patent, trademark, and attribution notices from the
Source form of the Work, excluding those notices that do not pertain to any part
of the Derivative Works; and
If the Work includes a "NOTICE" text file as part of its distribution, then
any Derivative Works that You distribute must include a readable copy of the
attribution notices contained within such NOTICE file, excluding those notices
that do not pertain to any part of the Derivative Works, in at least one of the
following places: within a NOTICE text file distributed as part of the
Derivative Works; within the Source form or documentation, if provided along
with the Derivative Works; or, within a display generated by the Derivative
Works, if and wherever such third-party notices normally appear. The contents of
the NOTICE file are for informational purposes only and do not modify the
License. You may add Your own attribution notices within Derivative Works that
You distribute, alongside or as an addendum to the NOTICE text from the Work,
provided that such additional attribution notices cannot be construed as
modifying the License. You may add Your own copyright statement to Your
modifications and may provide additional or different license terms and
conditions for use, reproduction, or distribution of Your modifications, or for
any such Derivative Works as a whole, provided Your use, reproduction, and
distribution of the Work otherwise complies with the conditions stated in this
License.
5. Submission of Contributions. Unless You explicitly state otherwise, any
Contribution intentionally submitted for inclusion in the Work by You to the
Licensor shall be under the terms and conditions of this License, without any
additional terms or conditions. Notwithstanding the above, nothing herein shall
supersede or modify the terms of any separate license agreement you may have
executed with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade names,
trademarks, service marks, or product names of the Licensor, except as required
for reasonable and customary use in describing the origin of the Work and
reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or agreed to in
writing, Licensor provides the Work (and each Contributor provides its
Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied, including, without limitation, any warranties
or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any risks
associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory, whether in
tort (including negligence), contract, or otherwise, unless required by
applicable law (such as deliberate and grossly negligent acts) or agreed to in
writing, shall any Contributor be liable to You for damages, including any
direct, indirect, special, incidental, or consequential damages of any character
arising as a result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill, work stoppage,
computer failure or malfunction, or any and all other commercial damages or
losses), even if such Contributor has been advised of the possibility of such
damages.
9. Accepting Warranty or Additional Liability. While redistributing the Work or
Derivative Works thereof, You may choose to offer, and charge a fee for,
acceptance of support, warranty, indemnity, or other liability obligations
and/or rights consistent with this License. However, in accepting such
obligations, You may act only on Your own behalf and on Your sole
responsibility, not on behalf of any other Contributor, and only if You agree to
indemnify, defend, and hold each Contributor harmless for any liability incurred
by, or claims asserted against, such Contributor by reason of your accepting any
such warranty or additional liability.
Revision History | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Revision 1-11 | Wed Sep 7 2011 | ||||||||||||
| |||||||||||||
Revision 1-10 | Mon Sep 5 2011 | ||||||||||||
| |||||||||||||
Revision 1-9 | Thu Sep 1 2011 | ||||||||||||
| |||||||||||||
Revision 1-8 | Wed Aug 31 2011 | ||||||||||||
| |||||||||||||
Revision 1-7 | Fri Aug 26 2011 | ||||||||||||
| |||||||||||||
Revision 1-5 | Mon Aug 22 2011 | ||||||||||||
| |||||||||||||
Revision 1-4 | Tue Aug 16 2011 | ||||||||||||
| |||||||||||||
Revision 1-3 | Fri Aug 12 2011 | ||||||||||||
| |||||||||||||
Revision 1-2 | Wed Aug 10 2011 | ||||||||||||
| |||||||||||||
Revision 1-1 | Tue Aug 09 2011 | ||||||||||||
| |||||||||||||
Revision 1-0 | Thu Jun 23 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-17 | Fri Jun 17 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-16 | Thu Jun 16 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-15 | Wed Jun 15 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-14 | Wed Jun 15 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-13 | Tue Jun 14 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-12 | Tue Jun 14 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-11 | Wed Jun 08 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-10 | Tue Jun 07 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-9 | Thu Jun 02 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-8 | Wed Jun 01 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-7 | Tue May 31 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-6 | Tue May 17 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-5 | Tue May 10 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-4 | Thu Apr 28 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-3 | Wed Apr 13 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-2 | Wed Mar 30 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-1 | Tue Mar 01 2011 | ||||||||||||
| |||||||||||||
Revision 0.1-0 | Tue Feb 22 2011 | ||||||||||||
|