Ensuring a cluster's availability with minimal downtime is a top priority for all admins and users. While issues with single compute or GPU nodes reduce the available compute power only slightly, problems with head-, login-, or storage-nodes can render the entire system unusable.
The Qlustar HA Stack (QHAS) is designed to mitigate such risks by offering a robust framework to set up cluster head-nodes and storage-nodes in a high-availability (HA) configuration. With QHAS, Qlustar services are fully protected from SPOFs (single points of failure), allowing for proven HA configurations customizable for your requirements.
The QluMan-based QHAS is the result of decades in experience working with many different HA systems. Until recently QHAS was based on corosync + pacemaker, but eventually it became apparent that their design is not that well suited for HPC and storage cluster setups. QluMan HA is optimally tuned for these use-cases and offers a simpler, scalable and more reliable alternative including a powerful GUI component.
The QluMan-based QHAS redefines cluster high availability by addressing the shortcomings of traditional HA solutions. Designed specifically for HPC, AI, and storage clusters, QHAS offers the following key advantages:
With its simplified architecture, QHAS ensures reliability, stability, and efficiency, making it the ideal HA solution for mission-critical workloads where uptime is non-negotiable.
The Qlustar HA Stack provides a set of advanced features to enhance high-availability clusters: