Closing on: Mar 31, 2025
Level of effort: Full-time
Type of engagement: Ottawa Remote
Duration: 5 years
Sector: Public
Language: English
Public Sector client is seeking a System Administrator to manage high performance computing (HPC) cluster (HPC administrator) and support users (HPC analyst) with respect to the installation, execution and debugging of research applications and code on high performance computing (HPC) clusters.
Mandatory Requirements
- Must have 5 years’ experience within the last 10 years in administrating HPC (High Performance Computing) systems and performing HPC analyst tasks, as per listed general tasks below;
- Must have 3 distinct HPC system administration projects or HPC analyst projects on which the candidate has worked for more than 12 months;
- Relevant education diplomas/certificates.
Optional requirements
- Must have 9 years to 10 years’ experience in HPC cluster administration and analyst functions, within the last 15 years;
- Must have experience in HPC administration within a research environment, including direct interaction with scientists or research officers, within the last 10 years;
- Must have experience in HPC system administration and/or analyst functions, within the last ten 10 years for the Federal Government of Canada, including direct interaction with Shared Services Canada;
- Must have a minimum of 3 years’ experience working with the tasks listed in general tasks section;
- Must have a minimum of 3 years’ experience working with technologies listed in »Technologies Associated with Tasks » section.
General Tasks
- Maintain a HPC cluster (hardware, image management, local networking, scheduler, backups);
- Troubleshoot the environment when an incident occurs to ensure a quick return to normal operations;
- Meet with scientists and evaluate their requirements for HPC support;
- Develop a task plan to meet scientists’ needs and consult the technical authority for approval;
- Application builds and installs, runtime troubleshooting (GNU, Intel, Fortran, Nvidia);
- Support for open-source and commercial off-the-shelf (COTS) software, including:
o Python and Anaconda installs;
o Bash scripts, build/make tools, EasyBuild, and Spack;
o MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI); - Assist with in-house developed applications (compilation and runtime);
- Management of:
o Operating system (patching schedule, reliability for Linux distributions);
o Accounts (creation, deletion);
o Configuration via Git, MS DevOps, Ansible Playbooks;
o RPM/DEB Packages;
o Environment modules;
o ThinLinc troubleshooting. - Troubleshooting jobs on schedulers (PBS Pro/Torque, SLURM, SGE);
- Ensure reliable CUDA installs, troubleshoot GPU failures and other CUDA software/driver issues;
- Hardware support (memory upgrades, storage arrays, power and network cabling, ILO);
- Document each process for every task to ensure enterprise knowledge continuity.
Technologies Associated with Tasks
- Compilers and interpreters (e.g. GNU GCC, Intel oneAPI, Python)
- Anaconda
- Bash scripts
- Build/make tools (e.g: GNU Make Tools)
- EasyBuild
- Spack
- MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI)
- Git
- MS DevOps
- Ansible Playbooks
- RPM/DEB packages
- Environment modules
- ThinLinc
- PBS Pro/Torque
- SLURM
- SGE
- CUDA