A new approach to Cluster High-Availability

It’s been a bit quiet recently concerning new features in Qlustar and QluMan and you might have wondered whether the developers were enjoying a longer (certainly well-deserved) break at the beach. Wrong guess :) The reason is that we quietly started off an ambitious new project about two years ago: QluMan HA. Today we’re very proud to announce its first public release and we believe that the final result might actually deserve the attribute revolutionary. But you’ll be your own judge.

Let’s elaborate a little and put our approach into a historical context: For more than two decades, HA systems have been at the heart of mission-critical infrastructure. The history of this started quite a while before Linux even made any in-roads in the datacenter (thinking back to SUN Cluster, pioneering the concept already in the 90s).

Having seen a number of Linux solutions in the field come and go (SGI failsafe, heartbeat, OpenAIS, … did I forget one?), the corosync + pacemaker combo has pretty much established itself as the de-facto standard since quite a while. However one shortcoming remained: Most of the admins responsible for HA clusters still have a hard time learning the complicated intrinsics and how to configure/operate this tradional type of HA system.

The expected reward for all this effort is the peace of mind that on day X when a failure scenario kicks in, everything will continue to run smoothly because HA automation will trigger a clean and fast services migration and save the day. Unfortunately reality often looks different: Many times unwanted node resets occur or automatic service recovery fails and even worse, manual recovery of such an HA system is a lot more complicated and time-consuming as compared to a non-HA setup. This situation has caused quite a bit of justfied skepticism towards traditional HA solutions among admins.

With the advent of container-based computing a different approach to HA emerged. The keyword here is Kubernetes, the standard for clustered container orchestration with built-in high-availabilty and it’s no exaggeration to call this approach a revolution. In such an environment, the task of making a system highly available is actually quite a bit simpler than for non-container HA: All you need to care for is that a container can run on multiple nodes in a cluster and there is no need to worry about what’s actually been done inside of the container. If a container fails to run or crashes on one node, Kubernetes notices and simply starts it up on another node.

But there is still a world outside of containers for which HA solutions are desparately needed as well: Think about HPC head-nodes or storage clusters for parallel filesystems e.g. For such use-cases one usually either prefers to avoid the additional complexity added by a container-based setup or containers just make little sense at all e.g for software that is running in the Linux kernel like Lustre.

Implemented on top of general Qlustar functionality, QluMan HA was specifically designed for HPC, AI and storage cluster solutions where a scalable number of compute and storage nodes are controlled and managed by a pair of redundant head-nodes. And just like compute nodes in a standard HPC cluster are booted, configured and operated via a traditional single cluster head-node, additional HA nodes (for example the ones of a Lustre cluster) are managed and monitored via the QluMan HA controllers running on the head-node pair.

The beauty of it is that the complexity of the HA logic is well hidden behind an intuitive user interface. Hence, once the HA setup is configured, an admin doesn’t need to know terribly much about its internals to be able to operate it. Using the QluMan HA GUI interface, there isn’t really a lot more to it than clicking on start/stop buttons and selecting the desired automation levels. The GUI is supplemented by a CLI tool which supports to operate the HA stack solely from the command line.

If you’re ready to dig deeper, under the hood of QluMan HA you will discover powerful and sophisticated features. Among them are

Fine-tunable automation levels letting you progress from a purely manual control mode which is useful in the intial production phase to a fully automatic level that you can enable once you completely trust your implementation and settings.
A dependency-based start and stop mechanism of HA resources which dramatically speeds up the start, stop, migration and fail-over duration times of services as compared to a corosync/pacemaker implementation.
A graphical action log which transparently shows all HA actions including their dependencies and allows to select the desired log level even for past actions.

To find out more about QluMan HA, have a look at its comprehensive documentation: Learn about its concepts and components, study its architecure, figure out how to configure a Qlustar HA cluster or get to know how to operate it.

Finally, if you think your clusters need HA protection, but lack the resources and/or experience to design an efficient and reliable HA setup yourself, don’t hesitate to contact us for help.

Contact Info

Legal Information

Contact Info

Legal Information

A new approach to Cluster High-Availability

A new approach to Cluster High-Availability