QluMan is Qlustar’s control center. Apart from the initial base installation of the head-node(s), cluster configuration, management and operation is mainly performed via the simple to use QluMan GUI based on the Qt toolkit.
The QluMan GUI is multi-user as well as multi-cluster capable: Different users are allowed to work simultaneously with the GUI. Changes made by one user are updated and visible in real-time in the windows opened by all the other users. On the other hand, it is possible to manage a virtually unlimited number of clusters within a single instance of the QluMan GUI at the same time. Each cluster is shown in a tab or in a separate main window.
QluMan has a distributed architecture. It is designed for limitless scalability and efficiency without compromise. Its configuration space is backed by a MariaDB database. The central switch board is qlumand the admin service to which all other components connect at least initially. All QluMan node operations are executed via the execd component, an instance of which runs on any node. Whenever necessary and possible, QluMan’s components operate asynchronously.
One of the most import design goals for QluMan’s communication platform is the requirement for reliable and scalable messaging including guaranteed delivery. To prevent an inconsistent cluster state, we need to make sure that communication messages between different QluMan components are not lost. We have to take into account node failures and network package loss. Furthermore, in case that communication channels (network connections) are blocked, we want to be noticed and resume message transfer as soon as the channels are functional again.
Here are some of the reasons, why these requirements are so important for cluster consistency and reliability:
Communication between different QluMan components uses intelligent heartbeat and message acknowledgment techniques with adaptive timeouts and resend intervals to detect failure of nodes and/or to ensure delivery of messages.
Transport type | Reliable Transport | Scalable | Reliable Messaging | Guaranteed Delivery |
---|---|---|---|---|
UDP | ||||
TCP | ||||
ZeroMQ | ||||
QluMan/ZeroMQ |
Communication between different QluMan components uses intelligent heartbeat and message acknowledgment techniques with adaptive timeouts and resend intervals to detect failure of nodes and/or to ensure delivery of messages.
QluMan implements highest security through CurveZMQ, a ZeroMQ component based on CurveCP (invented by D.J. Bernstein). We use public/private key certificates for encryption between admind, execd and the QluMan GUI. No security-relevant communication between any two QluMan components is unencrypted!
QluMan technology provides perfect forward-security since key certificates are unique for each session. QluMan user authentification also works with public keys, so there is no /weak password problem/. In particular, QluMan communications are protected by design against replay, man-in-the-middle and hijacking attacks, an absolute requirement for secure cloud computing.
Having a simple and flexible yet highly precise mechanism to configure cluster nodes and group them is one of the most crucial components a smart cluster management must provide. To keep the administrative effort as low as possible, it should allow to generalize as much as possible, while at the same time offer the possibility to assign specific configurations to nodes when needed. The design goal is to support arbitrary complex cluster configurations each with a minimal amount of settings to be defined by an admin.
QluMan meets this design goal by a) providing three different configuration property categories:
and b) by supplementing these with a hierarchical configuration space consisting of four hierarchy levels ranging from least to most specific:
Given all the options and tweaks QluMan offers to create an optimal cluster setup, for less experienced cluster admins, it can take a while, to grasp some of the ideas and concepts involved with advanced configurations. QluMan’s setup wizards are created for exactly this reason.
Using auto-detected node properties, the node setup and GPU wizards e.g. guides newcomers through the creation of host templates and the corresponding property sets such that machines with different hardware and functionality will be grouped and configured in the most effective way. Whenever multiple alternative configuration options are selectable, the wizards make suggestions about which ones will make most sense in a given context. As a consequence, admins are enabled to bootstrap and expand arbitrarily complex clusters in hours rather than days or weeks.
QluMan provides a powerful graphical remote command execution engine (RXEngine). It allows to run shell commands on any number of hosts in parallel and analyze their output/status in real-time. Commands fall into two categories: Pre-defined commands and custom commands.
Large clusters are often managed by a group of administrators each with different skill levels. To prevent damage by wrong configurations or other operational mistakes, it is important to have the capability to limit the actions less experienced admins are allowed to tackle.
QluMan has all the features required to handle such scenarios. It comes with multi-user support and provides an interface to setup its users together with their permissions within the QluMan framework. Note, that QluMan users are not connected to system users in any way.
The concept of user roles is implemented for an efficient permission management. User roles are defined as a collection of permissions to perform various QluMan operations. Once generated, they can be assigned to a user. User roles can also be assigned to RXEngine commands, thereby allowing to restrict the list of users that have the permissions to execute a specific command on cluster nodes.