Project: Integration of NAMIC Kit and the 'Grid'
- Provide a tool for the NA-MIC kit to enable distributed execution of programs
- Supply interfaces to these tools via Slicer3:Execution_Model
- Identify constraints and requirements for Slicer3 for cluster-based execution
The process of running executables on a distributed computing platform requires several steps.
- Collect credential information from users
- Determine accessible compute resources
- Compute required workload
- What is the executable, and what options does it support?
- What are the input and output files?
- Prepare an application schedule, batching individual tasks as necessary
- Transfer application schedule and run on remote computing resource
- Poll for job status
- Report job completion, with any errors
Solution : Grid Wizard (aka GWiz) is an open source application scheduler aimed at making your life easier. It lets you run tens of thousands of commands simultaneously (well, more or less) on multiple clusters of computers by typing a single command, without making any change to your code or writing scripts. It can be used by itself, or as part of a web-based portal environment. It is written in Java and works well with clusters based on Sun Grid Engine and Condor, with other resource managers soon to follow.
Status : A beta release is available at the NAMIC 2007 Summer Programmer's Week. At the 2007 Summer Programmer's week a set of use cases were developed for the Grid Interface Slicer3:Grid_Interface_UseCases. These use cases define concrete projects which will utilize the grid interface in NA-MIC related analyses.
Additionaly, an initial integration with the Slicer3 environment has been accomplished (see image thumbnail) via the Slicer Execution Model and with the BIRN Portal environment. This is an example of another NA-MIC deliverable (i.e. in addition to the Slicer environment itself) that has been integrated with the BIRN infrastructure.
In order to run an application on a remote cluster, one needs to describe the cluster. This is done via an XML file, as follows:
<?xml version="1.0"?> <cluster-resources> <cluster name="fwgrid" priority="1"> <connection> <ssh username="ncjones" hostname="fwg-cs0.ucsd.edu" priv-key-file="/Users/njones/.ssh/gwiz" /> </connection> <resource-scheduler type="sge" /> <properties> <key>user_home_path</key> <val>/home/ncjones</val> <key>job_spool_path</key> <val>/home/ncjones/gwiz/spool</val> <key>cluster_tools_path</key> <val>/home/ncjones/gwiz/bin</val> <key>compute_node_local_storage_path</key> <val>/state_partition1/ncjones</val> <key>cluster_shared_storage_path</key> <val>/home/ncjones/gwiz/data</val> <key>max_num_jobs</key> <val>32</val> <key>num_compute_nodes</key> <val>124</val> <key>num_processors_per_node</key> <val>2</val> <key>num_free_slots</key> <val>2</val> <key>average_processors_occupied</key> <val>37</val> <key>each_node_is_a_slot</key> <val>true</val> </properties> </cluster> </cluster-resources>
This example describes a cluster, which happens to be in the Computer Science department at UCSD, called the FWGrid. It has 124 compute nodes each of which has two processors in each node. The cluster is accessible via SSH through the user account =ncjones=, and can be accessed via an SSH private/public key pair named "gwiz". The GridWizard tool will run an SSH agent in order to connect to this resource.
A number of additional pieces of information are required:
- The local storage path per node (most clusters have node-local storage via a fast(ish) direct-attached drive) is /state/partition1/ncjones
- Shared storage, on the other hand, is in /home/ncjones/gwiz/data (most clusters have an NFS-mounted shared drive, which home directories are frequently stored on)
- The cluster tools (described below) are installed in /home/ncjones/gwiz
- Job artifacts---the script files that are actually run to accomplish the work---are transfered to /home/ncjones/gwiz/spool
- Any application schedule cannot have more than =32-2= jobs for this particular cluster (SGE has been configured by the system administrator to reject any more than 32 simultaneous queued requests, and I always want an extra node or two for other work)
- Each compute node, as opposed to each compute processor is considered a slot in the resource manager
Describing the Program
Running an executable on a grid necessarily requires that the executable be described in some fashion. There are probably many ways of doing this, but the most handy is through the Slicer Execution model. However, a small amount of additional information is necessary for the purposes of remote execution: the channel selector of "input" or "output" is not sufficient---there may be files that are both input and output, so the "inout" value is needed. This is really only a semantic difference in a local execution, but in a remote execution it determines whether or not a file gets copied from a data grid location, to it, or both.
Further, several additional attributes need to be added to the executable tag: location, to represent where the executable is installed (this is potentially ambiguous in a remote execution model, which will be the subject of additional research before release); script-engine, which can be used if the program itself is an input to another program, like perl, matlab, or python; and script-options, which are arguments to the scripting engine. For example, to run R programs remotely, one would use
<executable script-engine="R" script-options="cmd batch" location="myprog.R"> ... </executable>
The following example execution model is for a program completely unrelated to Slicer; rather, it is a small tool in a larger pipeline for comparative sequence genomics.
<?xml version="1.0"?> <executable location="/home/ncjones/nrse/get-tuples"> <category>Comparative Genomics</category> <title>get-tuples</title> <description> Given a fasta-formatted file of DNA sequences (1 for each species), this will identify all motifs of a given length and divergence that are common to all species. Note that for certain values (e.g., length 4, divergence 3) pretty much every pattern in the file is going to be conserved. However, for sensible values like l=20 and d=4, one finds (in mammals, anyway) some surprising results. </description> <version>1.0</version> <documentationurl>http://nrse.bioprojects.org</documentationurl> <license>BSD</license> <contributor>Neil Jones</contributor> <parameters> <label>Program parameters</label> <description/> <integer> <name>Distance</name> <flag>d</flag> <description> Hamming distance between species. </description> <label>Distance</label> </integer> <integer> <name>Length</name> <flag>l</flag> <description> Length of motif to find </description> <label>Length</label> </integer> <file> <name>Input</name> <flag>f</flag> <description> The FASTA-format file to process </description> <channel>input</channel> <label>Input file</label> </file> <file> <name>Output</name> <flag>o</flag> <description> The FASTA-format file to produce. </description> <channel>output</channel> <label>Output file</label> </file> </parameters> </executable>
Describing the Job
A job is a set of tasks, each task an independent command-line invocation of the program described through the execution model. However, tasks are generally not listed explicitly, but are described implicitly through the use of parameter lists, file globs, and a special format of templates. The design goal here was to make it as easy as possible for a researcher to describe what they wanted done, without requiring that they write a program to list all the tasks. For example,
<task> <arg name="-f"> ssh://firstname.lastname@example.org/home/ncjones/nrse/data/samples/*.fa</arg> <arg name="-d">4:6</arg> <arg name="-l">20:30:5</arg> <arg name="-o"> ssh://email@example.com/home/ncjones/nrse/data/samples/out/$base(-f)$.$-d$.$-l$.out </arg> </task>
The set of input files is located on a host that is accessible via sftp, named dooku.ucsd.edu. They are fasta-formatted sequence files. By referring to the execution model example above, it should be clear that "-f" is an input file, so the appropriate files will be copied to the compute nodes. "-d" represents a distance, which is an integer parameter that, here, varies among 4, 5, and 6. "-l" represents a length of target to search for, and will vary among 20, 25, and 30. The output files will be moved to /home/ncjones/nrse/data/samples/out (also on dooku) when each job completes. The name of the output files will be, say,
In this example, there are 100 input files, and a total of 900 output files.
Standard input, output, and error streams can be redirected to files (using the same templating language) by using the elements <stdin>, <stdout> and <stderr>. <stdin> defines a stage-in file, <stdout> and <stderr> define stage-out files.
The cluster XML is stored in a file, say, cluster.xml. The execution model and task lists are stored in another file, say, job.xml, as
<?xml version="1.0" ?> <task-set> <executable> ... </executable> <task> <arg>...</arg> ... </task> ... </task-set>
The program gwiz-run actually performs the execution. Other command-line interfaces are possible. Clearly, those command-line tools will need to be documented through the Slicer execution model as well.
In order for the application to run remotely, a number of supporting tasks are performed by a separate application suite called cluster-tools. These programs are fairly small and come bundled as a tarball and do not require root permissions for a user to install. In fact, the intended use is that a user wishing to use a cluster as part of a gwiz "grid" would simply download the cluster-tools tarball, untar it into his or her home directory, and then generate an XML descriptor of the cluster.
The cluster tools programs include facilities for logging, staging file transfers among the local host and a number of data grids, and monitoring of executables. They are, however, not in a releasable state; a fairly difficult problem is how credentials can be distributed among the remote clusters so that file transfers can take place.
We have implemented an interface to the Sun Grid Engine, and are currently working on a Condor interface.
A number of scheduling algorithms are possible. There is a fairly rich literature that describes the various scheduling algorithms and their respective tradeoffs. One particularly useful algorithm that we are currently using is the simplest one possible: simply divvy up all of the tasks among the total number of resources. While this is simple and appears to work reliably, it is decidedly suboptimal when compared with even a simple greedy algorithm. Additional work is needed to add grid-specific application scheduling algorithms.