Data Center Manager Leveraging OpenStack

这是去年的一个基于OpenStack的数据中心管理软件的想法。

Abstract

OpenStack facilates users to provision and manage cloud services in a convenient way, including compute instances, storage and network. Meanwhile, data center requires a converged, uniformed management solution to provision, monitor, manage and diagnostic servers, and even collaborate seamlessly with other existing IT solution.This data center manager addresses this requirement by providing an open-source, easy-to-customizing, converged management solution based on OpenStack technologies, plugin mechanisms.

Problem statement

There are many softwares managing machines in data center but it seems that none of them provide a solution perfectly resolving all the requirements in data center: management softwares provided by hardware vendors always focus on machines manufactured by them; the existing third-party softwares only provides high-level management due to no knowledges of specific hardwares.  

As a result, IT department has to use several management sofwares, which brings about disadvantages: increasing learning cost; struggling with different software intefaces; no easy second-development or integration with the existing IT systems.

Our solution

The solution propose a new data center manager(henceforth, called DCM), which is customizable, modular and supporting plugins provided by third-party or hardware vendors and also leverages Ceilometer (telemetering), Ironic (baremetal), Glance (Image), Heat (collaboration) of OpenStack.

Customizable

First, this DCM provides RESTful API, besides traditional CLI and Web UI, which facilitates easy collaboration with other IT system, such as machine lifecycle management. And even user can develop mobile app to manage data center. Second, user can define his monitoring target, event trigger threshold and select corrective actions for this event if necessary in a template. Last but not least important, IT department, hardware vendor or third-party can easily add their specific solution based on their specific requirement or their specific hardware capabilities.

Modular

As OpenStack always does, the communication between components in a project is via AMQP or OSLO messages and that between different project via RESTful API. It will provide good isolation so that the failure of one component doesn’t affect other components in the same project or in other projects.

Pluginable

Plugin interface facilitates hardware vendor to implement failure handling or diagnostics solution based on their own hardware. And it also allow to integrated the brilliant features of current system management software available.

This data center manager provides 3 basic functionalities:

Deploy and Configure

Currently, 2 ways to deploy and configure are in plan: golden images and Puppet plus OpenStack Heat. The former one facilitates users to deploy on a large number of the identical hardwares with their own customized golden image, which leverages Glance; the latter is the current Puppet way and it can be integrated into this data center manager through OpenStack Heat. Also, we are watching for the development of OpenStack TripleO and plan to integrate TripleO into DCM.

Monitor/Manage

All the information and status of hardware devices are displayed and once any failure and the sign of failure occurs, DCM will notify user of this event and take corrective actions (such as retiring or rebooting this machine)  or preventative actions (such as marking the bad page unavailable in DIMM if there are a number of sticky bits) based on user’s customized policy. Meanwhile, critical software or services running on these physical machines are also be monitored in the same way. User can define their monitoring target/threshold/action in the template. DCM also provides some default templates based on the role of physical machines, such as mysql database server or apache web server. These funcationalities utilize OpenStack Ceilometer and Ironic.

Diagnostics

Sometimes user is notified a hardware error event and he might want to know the possible root cause and the status of the related hardware. Diagnostics functionality serves this purpose. Online diagnostics will do this without impacting  your business or Operating system’s running and offlne diagnostics will automatically rebooting this machine and enter into UEFI and execute hardware tests which might erasing the user data.

Here is the architecture diagram. 

 

Evidence the solution works

OpenStack are used to provision, manage, monitor virtual machines to provide cloud services, which manifest that these technologies can work with virtual machines perfectly and as long as there is an interface which works with physical machines, these technologies should be work with physical machines. And OpenStack Ironic can work as this role. Although the original purpose of Ironic is to provision Cloud services on physical machine, it can be used as an interface to work with physical machines.

A working prototype has been implemented which address the monitoring/manage functionality, which works in our local machine room. But obviously, it need to be extended to address all the features mentioned above.

Competitive approaches

Currently, there are 2 kinds of system management softwares: provided by hardware vendors and provided by third-party software vendors, but neither of them are considered as perfectly addressing data center’s requirement.

Softwares provided by hardware vendors might work well with their own physical hardwares but they don’t work well with, or even don’t support physical machines of other vendors and in most cases, one data centers hosts physical machines provided by different vendors.

Softwares provided by third-party software vendors might work with all physical machines but they just provide the general or high-level features because they don’t know the details or the specifics of hardwares provided by one vendors.

The biggest advantage of this DCM is software-defined in all level. User can specify the monitor target and the threshold via template mechanism; user can also define how they are monitored and the corresponding corrective or preventive actions once an event occurs via plugin mechanism. In addition, plugin mechanism allows hardware vendors to provide their specific diagnostic solution based on their own hardware or firmware features.

 

References

Ceilometer architecture:   http://docs.openstack.org/developer/ceilometer/architecture.html

Ironic architecture:  http://docs.openstack.org/developer/ironic/dev/architecture.html

 

你可能感兴趣的:(openstack)