将Hadoop建于Openstack之上

Savanna 0.2 released: Bring on the new Hadoop on OpenStack features!

by Sergey Lukjanov

Savanna, the Hadoop on OpenStack project started by Mirantis and now also with contributions from Red Hat and Hortonworks, has now seen version 0.2 released into the wild.

The Savanna project is designed to provide users with a simple means to provision a Hadoop cluster on OpenStack by specifying just a few parameters, such as the Hadoop version, the cluster topology, and node hardware details. In just a few minutes, you have working Hadoop cluster to use, with OpenStack as its infrastructure.

With that and the number of people who are now interested in Savanna in mind, today we are happy to announce the release of Savanna 0.2. In this post, we’ll explain the new characteristics, details and improvements that were implemented to improve Savanna, both as an OpenStack component and as a standalone project.

You can see most of the new features we’d like to share in this video by my colleague, Dmitry Mescheryakov:

Brand new Savanna 0.2 version features

Here are some of the new features you can use right now in Savanna 0.2.

Pluggable Provisioning Mechanism

Savanna now supports integration with 3rd party management tools. This integration is possible due to the implemented extension mechanism for provisioning providers. In short, responsibilities are divided between the Savanna core and plugin as follows:

  • Savanna interacts with the user and provisions different infrastructure resources, such as the virtual machines, block storage, public IPs and so on.
  • The plugin installs and configures the Hadoop cluster on the already launched virtual machines. Optionally, the plugin can deploy management and monitoring tools for the cluster and expose endpoints to enable the user able to work with them. Additionally Savanna includes some tools that are designed to help the plugin to communicate with virtual machines and other potential infrastructure resources.

With this pluggable provisioning mechanism implemented, Savanna can be extended with future plugins for third party management tools such as Apache Ambari, Cloudera Management Console, Intel Hadoop and so on. (Hortonworks Data Platform plugin integration will be supported in the next minor release, Savanna 0.2.1.)

Vanilla Hadoop plugin

The Vanilla Hadoop plugin is a reference plugin implementation that enables the user to launch Apache Hadoop clusters without any management consoles. This plugin is based on images with pre-installed instances of Apache Hadoop. Almost all Hadoop cluster topologies have been supported by the Vanilla plugin, with ability to manually scale existing clusters and use OpenStack Swift as input/output for MapReduce jobs.
To make things even easier, we’ve created diskimage-builder elements to automate Hadoop image creation, with support for both Ubuntu and Fedora.

Hadoop cluster scaling

The mechanism for cluster scaling is designed to enable the user to change the number of running instances without creating a new cluster. The user may change the number of instances in existing Node Groups, or add new Node Groups. If for any reason the cluster fails to scale properly, all changes will be rolled back, preserving the integrity of the cluster. (Currently only the Vanilla plugin supports this feature.)

Cinder supported as a block storage provider

OpenStack Cinder is a block storage service that can be used as an alternative for ephemeral drives — which are, in fact, just files on the same host as the virtual machine, and vulnerable if there’s a problem with the host. Using Cinder volumes instead of ephemeral storage increases both the reliability of data and the I/O performance, both of which are absolutely crucial for HDFS service.

The user can set the number of volumes to be attached to each node and the size of each volume for both cluster creation and scaling operations.

Anti-affinity supported for Hadoop processes

One of the problems with virtualized Hadoop is that there is no inherent ability to control where the machine is actually running. This is important because in this situation, we cannot be sure that two new virtual machines are started on different physical machines. As a result, any replication within the cluster is not reliable because all replicas may turn up on the same physical host, which defeats the purpose of replicating in the first place.

We’re happy to say that we were thrilled to fix this in Savanna 0.2! The anti-affinity feature provides an ability to explicitly tell Savanna to run specified processes on different compute nodes. This is especially useful for the Hadoop DataNode process when it comes to making HDFS replicas reliable. This feature is implemented on the Savanna core side, and there’s no need to support it in a plugin.

OpenStack Dashboard plugin

Savanna Dashboard is the plugin for the OpenStack Dashboard, and it supports almost all operations exposed through the Savanna REST API, including template management, cluster creation, and scaling, as you can see in the video above.

Conclusion

The release of Savanna 0.2 is a big step for Savanna and Mirantis, and we believe this project has a great future as a part of both the OpenStack and Hadoop communities. As with any open source project, we want you to use the software and welcome any contributions you have to make. Users, developers, and deployment specialists can all find documentation tailored to their needs at these links:

Savanna docs: https://savanna.readthedocs.org/en/0.2/index.html
Savanna wiki: https://wiki.openstack.org/wiki/Savanna
Launchpad project: https://launchpad.net/savanna

Enjoy!

3 comments

3 Responses

  1. savant

    “Using Cinder volumes instead of ephemeral storage increases both the reliability of data and the I/O performance, both of which are absolutely crucial for HDFS service.”

    Hadoop prefers local disks and it expects the hosts to fail once in a while. Thats why the replication factor.

    Is cinder the preferred storage for Hadoop on OpenStack?

    July 19, 2013 21:51
    Reply
    • Sergey Lukjanov

      Thank you for the question.

      The default block storage in OpenStack is ephemeral drive, as I sad before, it’s just a file on the same host as the virtual machine, so, dozens of ephemeral drives could be located on the same HDD – it’s a performance problem and Cinder solves it good. Another problem is that several replicas could be stored on ephemeral drives (of different virtual machines with DataNodes) that are backed by one HDD – it’s a reliability problem. For solving reliability problem we have two solutions – Cinder and Anti-affinity.

      BTW Cinder volumes could be backed by different drivers, for example, NAS or local HDD per volume (we’ll support it in future releases).

      July 20, 2013 21:53
      Reply
  2. jason

    Savanna is very nice.
    I need test clusters and I can have one in minutes, and tear them down when done.
    I can also us openstack security groups to create find grained network access controls to the clusters, and change it on demand.

你可能感兴趣的:(将Hadoop建于Openstack之上)