Qlustar Feature Update 11/25
A large bunch of new features were completed and published since the Qlustar 14 release. Here is an
overview.
QluMan
QluMan versions up to 14.0.7 have added the following:
Access to cluster web interfaces via QluMan GUI
- A network proxy to the managed clusters was implemented that allows a firefox web browser
(now embedded in the QluMan singularity image) to easily connect to cluster-internal web
interfaces by the click of a menu entry. This works even through ssh-tunneled connections between
the GUI and the cluster head-node.
- Supports access to the node’s BMC web interface including remote console.
- Allows convenient access to all Prometheus and Grafana web interfaces.
Redfish BIOS and Power management
Redfish BIOS and Power management
via the QluMan GUI was substantially improved:
- Allows removing and reordering BIOS boot devices.
- Checks BIOS boot order on write and synchronizes changes if necessary.
- Move configured BIOS boot devices to the front of boot order if they exist.
- Added a filter to limit BIOS Settings displayed in the GUI. Makes it easier to find relevant
entries.
- Improve on the display of BIOS options: Show the real BIOS text rather than cryptic option keys.
- Implemented power status, -cycle and -reset via Redfish.
LDAP User Management
The LDAP user creation process was significantly improved by adding a new pop-up
window that provides transparent info and error resiliency for all sub-tasks involved when a user
is added.
- Via a new Retry button for slurm account creation in case errors were detected.
- Color-codes table entries for pending, done and error states.
- Handles errors when slurmctld or slurmdbd are offline.
- Added Run, Run without Slurm and Save and Sync Node buttons
Miscellaneous enhancements
- Support for the transfer of binary files from qlumand to qluman-execd was implemented. This
allows to include binary files in root-fs customizations which enhances their usability and
removes previous limitations.
- Added the possibility to ignore MACs in the ‘New Hosts’ widget.
- Added a slurm daemon restart menu in the Slurmd LED at the bottom of the GUI. Allows to
restart slurmctld, slurmdbd and qluman-slurmd.
Nvidia GPU support
Nvidia enterprise GPU support was greatly enhanced in the Qlustar Nvidia image modules. This
enables running DGX/HGX nodes out-of-the-box including full-featured Kubernetes support:
- Added auto-starting NVIDIA Fabric Manager to automatically support NVSwitch systems.
- Added auto-starting NVIDIA Datacenter GPU Manager.
- Implemented a new Qlustar image modules structure for the NVIDIA GPU stack.
- Have separate image modules for different NVIDIA release series. We have started with the
570 series. This allows the support of legacy hardware in the future. All NVIDIA components
that must match the driver version are integrated also in exactly this version to guarantee
flawless operation.
- Switched to NVIDIA open GPU kernel module drivers.
- Added nvidia-extra image module that contains everything that doesn’t depend on a specific
driver version (e.g. container toolkit and kubernetes device plugin).
- Introduced a new Depends and Provides mechanism for image modules, so that the
driver-version independent image modules like nvidia-extra can be configured safely with any
nvidia-xyz image module.