Planet of the Crabs
🦀 fearless rust


Computer Cluster

Posted on by Rust — Jim Turner

In my lab at Duke University, we had a lot of old computers from prior research projects that were no longer being used. I volunteered to put them together into a cluster for the lab to use for computationally-intensive tasks. I didn’t know anything about cluster computing before this project, so it was a great experience learning how to put together and use a computer cluster.

If you’re new to cluster computing and are interested in setting up your own small computer cluster, the following overview may be helpful.

Hardware & Network

The cluster has seven x86-64 desktop computers of varying age with a range of processors and memory capacities. They are all connected with a single 8-port unmanaged network switch that is connected to Duke’s network. This is a photograph of the cluster:

alt="Photograph of seven desktop computers of different types on the floor of the lab, connected with a single network switch."> <figcaption> Photograph of the computer cluster. Image © 2016 Jim Turner and licensed under <a title="Creative Commons Attribution-ShareAlike 4.0 International License" href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY‑SA 4.0</a>. </figcaption>

Six of the computers (dsg01, dsg03, dsg04, …, dsg07) are compute nodes, and the remaining one (dsg02) is the login node, SLURM controller, and file server. This is the network topology:

The nodes are connected to a single switch. That switch is connected to Duke's network, which is separated from the Internet by a firewall. Users can connect to Duke's network directly or, if they are elsewhere on the Internet, through Duke's VPN.
Network topology of the cluster and users. Image © 2016 Jim Turner and licensed under CC BY‑SA 4.0.

Software

The hardest part of setting up the cluster was figuring out what software to use and how to configure it. Since I was unfamiliar with cluster computing, I strongly favored projects with good documentation that were fairly easy to set up. I decided on the following:

I installed additional software for users to develop and run their programs, including:

Usage

If you’re unfamiliar with computer clusters, it’s helpful to know how they work from the user’s perspective. This is how the small cluster I built is set up:

The user has access to his/her home directory and the /tmp directory on each node. The user’s home directory is shared across the nodes with Gluster, so all programs and input/output files in the user’s home directory are available on all nodes. To run a job on the cluster:

  1. The user transfers his/her program and input data to the login node with SFTP.

  2. The user SSHes into the cluster’s login node. He/she can run inexpensive tasks on the login node, such as compiling small programs. However, for computationally-intensive tasks, the user should submit a job with SLURM to run on the compute nodes.

  3. On the login node, the user can use the following SLURM commands:

    • srun to run a single job and wait for it to complete,
    • salloc to allocate resources (primarily for an interactive job), or
    • sbatch to schedule a batch job for execution.
  4. When the necessary resources (i.e. processors and memory) become available on the compute nodes, SLURM starts the job on the available compute nodes.

  5. The user can cancel the job with scancel or check its status with squeue.

  6. If the user submitted a batch job, SLURM saves the standard output and standard error from the job to the specified location (typically the user would specify files in his/her home directory). The program being run can also save output files itself to the user’s home directory, because the user’s home directory is transparently synchronized between the nodes with Gluster.

  7. When the job is complete, the user can download the output files from the login node with SFTP.

Configuration Management & Testing

One of my goals was to automate the installation and configuration of the cluster as much as possible in order to simplify maintenance and enable version control of the configuration. For installation and configuration, I’m using:

Since users could be running jobs on the cluster, I needed a way to test changes that didn’t interfere with the actual cluster. I’m using the following additional software to test the configuration with a network of virtual machines on my laptop:

Documentation & Sustainability

One of my goals when building the cluster was to make it sustainable after I leave Duke. As a result, I automated as much of the configuration as possible and documented everything. I’m using Sphinx for documentation, and I’m keeping the configuration and documentation on Duke’s GitLab instance.

Other Resources

If you’d like to set up your own small cluster, the following resources may be helpful:


  1. To generate an initial configuration, use one of the configuration builders, which are available at /usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html and /usr/share/doc/slurmctld/slurm-wlm-configurator.html once you have slurmctld installed. Look at the man page for slurm.conf(5) for more information about the options. [return]