Qlustar

Contact Info

Legal Information

Qlustar

Contact Info

Legal Information

Qlustar Feature Update 11/25

Qlustar Feature Update 11/25

Qlustar Feature Update 11/25

A large bunch of new features were completed and published since the Qlustar 14 release. Here is an overview.

QluMan

QluMan versions up to 14.0.7 have added the following:

Access to cluster web interfaces via QluMan GUI

  • A network proxy to the managed clusters was implemented that allows a firefox web browser (now embedded in the QluMan singularity image) to easily connect to cluster-internal web interfaces by the click of a menu entry. This works even through ssh-tunneled connections between the GUI and the cluster head-node.
  • Supports access to the node’s BMC web interface including remote console.
  • Allows convenient access to all Prometheus and Grafana web interfaces.

Redfish BIOS and Power management

Redfish BIOS and Power management via the QluMan GUI was substantially improved:

  • Allows removing and reordering BIOS boot devices.
  • Checks BIOS boot order on write and synchronizes changes if necessary.
  • Move configured BIOS boot devices to the front of boot order if they exist.
  • Added a filter to limit BIOS Settings displayed in the GUI. Makes it easier to find relevant entries.
  • Improve on the display of BIOS options: Show the real BIOS text rather than cryptic option keys.
  • Implemented power status, -cycle and -reset via Redfish.

LDAP User Management

The LDAP user creation process was significantly improved by adding a new pop-up window that provides transparent info and error resiliency for all sub-tasks involved when a user is added.

  • Via a new Retry button for slurm account creation in case errors were detected.
  • Color-codes table entries for pending, done and error states.
  • Handles errors when slurmctld or slurmdbd are offline.
  • Added Run, Run without Slurm and Save and Sync Node buttons

Miscellaneous enhancements

  • Support for the transfer of binary files from qlumand to qluman-execd was implemented. This allows to include binary files in root-fs customizations which enhances their usability and removes previous limitations.
  • Added the possibility to ignore MACs in the ‘New Hosts’ widget.
  • Added a slurm daemon restart menu in the Slurmd LED at the bottom of the GUI. Allows to restart slurmctld, slurmdbd and qluman-slurmd.

Nvidia GPU support

Nvidia enterprise GPU support was greatly enhanced in the Qlustar Nvidia image modules. This enables running DGX/HGX nodes out-of-the-box including full-featured Kubernetes support:

  • Added auto-starting NVIDIA Fabric Manager to automatically support NVSwitch systems.
  • Added auto-starting NVIDIA Datacenter GPU Manager.
  • Implemented a new Qlustar image modules structure for the NVIDIA GPU stack.
    • Have separate image modules for different NVIDIA release series. We have started with the 570 series. This allows the support of legacy hardware in the future. All NVIDIA components that must match the driver version are integrated also in exactly this version to guarantee flawless operation.
    • Switched to NVIDIA open GPU kernel module drivers.
    • Added nvidia-extra image module that contains everything that doesn’t depend on a specific driver version (e.g. container toolkit and kubernetes device plugin).
    • Introduced a new Depends and Provides mechanism for image modules, so that the driver-version independent image modules like nvidia-extra can be configured safely with any nvidia-xyz image module.