The Talos HPC cluster

The Talos High Performance Compute (HPC) cluster is a typical computer cluster, that uses poor man's parallellization using relatively cheap commodity hardware: the total workload is split in many small jobs (a.k.a. tasks) that process a chunk of data each. The jobs are submitted to a workload manager, which distributes them efficiently over the compute nodes.

Key Features

The key features of the Talos cluster include:

Linux OS: Rocky version 9.6 .
Completely virtualised on an OpenStack cloud.
Deployment of HPC cluster with Ansible playbooks under version control in a Git repo: league-of-robots
Job scheduling: Slurm Workload Manager
Account management:
- Local admin users+groups provisioned using Ansible.
- Regular users+groups in a dedicated LDAP for this cluster and provisioned either with Ansible playbook too or using info from federated AAIM.
Module system: Lmod
Deployment of (Bioinformatics) software using EasyBuild

Cluster Components

Talos consists of various types of servers and storage systems. Some of these can be accessed directly by users, whereas others cannot be accessed directly.

cluster

Jumphost:
- For all users
- Security hardened machine for multi-hop SSH access to UI, DAI & SAI.
- Nothings else: no job management, no mounts of shared file systems, no homes, no /apps, etc.
User Interface (UI):
- Logins for all users (via the jumphost).
- Slurm tools/commands installed job management: submitting batch jobs, canceling batch jobs, interactive jobs, job profiling, etc.
- Read-only access to software, modules and reference data deployed with EasyBuild + Lmod in /apps/…
- Both tmp and prm folders from large, shared, parallel file systems mounted (with root squash) for data transfers/staging.
Deploy Admin Interface (DAI):
- Logins only for deploy admins (via the jumphost).
- For deployment of centrally managed software or reference data sets using: Ansible playbooks + Lmod + EasyBuild
- No Slurm tools/commands installed, so no accidental job management.
- Read-write access to software, modules and reference data deployed with EasyBuild+Lmod in /apps/...
- No access to large, shared, parallel file systems.
Sys Admin Interface (SAI):
- Loging only for sys admins (via the jumphost).
- Used to manage and monitor the cluster components: generate quota and Slurm usage reports, run cron jobs, etc.
- Runs Slurm scheduling daemon that determines when jobs will be executed on which nodes.
- Access to (complete) large shared parallel file systems (without root squash).
Compute Nodes:
- No direct logins.
- Crunch batch jobs submitted to Slurm workload manager.

Naming Themes

Talos is part of the League of Robots - a collection of HPC clusters - that use the same themes for naming various components:

Cluster itself and UIs are named after robots.
Production clusters are named after robots from the Futurama scifi sitcom. Test/development clusters are named after other robots.
E.g.: Talos UI = talos
Jumphosts are named after rooms preceding other rooms.
E.g.: Talos Jumphost = reception
Other machines that are part of the cluster and only accessible using internal network interfaces (schedulers, compute nodes, account servers, etc.)
will use a two character prefix tl followed by a dash and the function of the machine.
E.g. Talos compute node = tl-node-a01
SAIs & DAIs may be named after root/carrot varieties or simply use the two character prefix tl plus function of the machine.
E.g.: Talos DAI = tl-dai
E.g.: Talos SAI = tl-sai