Qlustar

Contact Info

Legal Information

The HA Stack

The HA Stack

Ensuring a cluster's availability with minimal downtime is a top priority for all admins and users. While issues with single compute or GPU nodes reduce the available compute power only slightly, problems with head-, login-, or storage-nodes can render the entire system unusable.

The Qlustar HA Stack (QHAS) is designed to mitigate such risks by offering a robust framework to set up cluster head-nodes and storage-nodes in a high-availability (HA) configuration. With QHAS, Qlustar services are fully protected from SPOFs (single points of failure), allowing for proven HA configurations customizable for your requirements.

The QluMan-based QHAS is the result of decades in experience working with many different HA systems. Until recently QHAS was based on corosync + pacemaker, but eventually it became apparent that their design is not that well suited for HPC and storage cluster setups. QluMan HA is optimally tuned for these use-cases and offers a simpler, scalable and more reliable alternative including a powerful GUI component.

Reinventing Cluster HA with QluMan

The QluMan-based QHAS redefines cluster high availability by addressing the shortcomings of traditional HA solutions. Designed specifically for HPC, AI, and storage clusters, QHAS offers the following key advantages:

  • Fine-tunable automation levels
  • Manual mode for efficient debugging
  • Dependency-based start and stop mechanism
  • Fast startup and migrations
  • Graphical adjustable action log
  • Minimizes unnecessary failovers and downtime

With its simplified architecture, QHAS ensures reliability, stability, and efficiency, making it the ideal HA solution for mission-critical workloads where uptime is non-negotiable.

Key Features of QHAS

The Qlustar HA Stack provides a set of advanced features to enhance high-availability clusters:

  • Automated failover with minimal downtime
  • Redundant system setups for reliability
  • Support for Lustre and BeeGFS parallel file systems
  • Optimized failover management for head- and storage nodes
  • Full-featured CLI as an alternative to the GUI
  • Tailored for HPC and AI workloads
  • Simplified configuration and operating with QluMan GUI
  • Flexible automation settings to prevent unnecessary failovers
  • Reliable monitoring of critical cluster services