Debian Clusters for Education and Research: The Missing Manual

Using a Scheduler and Queue

From Debian Clusters

Jump to: navigation, search

The scheduler and the queue are two essential parts for a cluster. Together, they transform a group of networked machines into a cluster, or at least something closer to one. They're what allow users, working only on the head node, to submit "jobs" to the cluster. These jobs are transparently assigned to different worker nodes, and then - without the user needing to know where the jobs were - the results are deposited back into the user's home directory.

This process requires software in two different roles: the resource manager, responsible for accepting jobs to the queue and running jobs on worker nodes, and the scheduler, responsible for deciding when and where jobs in the queue should be run in order to optimize resources. I'll be using Torque for the resource manager and Maui for the scheduler. Both of these are open source projects.

Note: Torque has a built-in scheduler that can be used instead of Maui. However, Maui integrates seamlessly and provides more options and customization than Torque's scheduler.

Installation

Before setting up Torque and Maui, DNS must be working. If that's not an option, this requisite can be "cheated" around by setting up /etc/hosts on the head node with an entry for each of the nodes and then copying this file out to each of the worker nodes. (See the Cluster Time-saving Tricks page for help with the copying.)

Torque needs to be installed in two parts. First, a pbs_server is set up on the head node and configured to know where all of the worker nodes are. Then, each of the worker nodes are set up to run pbs_mom, a sort of client, that will accept jobs from the pbs_server and run them on the worker node. A basic queue for Torque also needs to be configured.

Maui is installed only on the head node, and needs to be set up to interact with the pbs_server. It does not communicate with the worker nodes, but instead talks to them by way of the server.

After installing both, it's wise to try submitting a simple job before moving onto more complex configuration options. I recommend going through pages of interest in this order -

Use and Features

After both are installed and working properly, you might want to look at

References

Personal tools