This chapter covers
■ The origins of Hadoop, HBase, and NoSQL
■ Common use cases for HBase
■ A basic HBase installation
■ Storing and querying data with HBase
本章要点
Hadoop,HBase和NoSQL的起源
HBase的常见应用案例
HBase的基本安装
基于HBase保存与查询数据
HBase is a database: the Hadoop database. It’s often described as a sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key, and timestamp. You’ll hear people refer to it as a key value store, a column family-oriented database, and sometimes a database storing versioned maps of maps. All these descriptions are correct. But fundamentally, it’s a platform for storing and retrieving data with random access, meaning you can write data as you like and read it back again as you need it. HBase stores structured and semistructured data naturally so you can load it with tweets and parsed log files and a catalog of all your products right along with their customer reviews. It can store unstructured data too, as long as it’s not too large. It doesn’t care about types and allows for a dynamic and flexible data model that doesn’t constrain the kind of data you store.
HBase本身界定为数据库:是基于Hadoop框架上的数据库。它采用一种稀疏的、分布式的、持久化的、多维度的、排序的映射(map)存储模式,这种存储模式是基于数据行的主键(row key),数据列的主键(column key)与时间戳(timestamp)来建立索引的。平常人们更倾向于把它看作是键值式(key-valune)的存储系统,面向列式(column family-oriented)存储的数据库,或是保存多版本数据映射(map)的映射(map)的数据库。但是从根本上来讲,它是一个采用随机访问的数据存取平台,你可以基于它来随意写入保存你的数据,同时在需要时把这些数据读取出来。HBase支持存储结构化和半结构化数据,所以你可以用它来存储微博,解析日志文件,分类存储所有的产品信息以及产品的顾客评论。它也可以存储非结构化数据,不过这些数据最好不要太大。它对数据类型并不敏感,允许建立动态灵活的,同时不限制于数据类型的数据模型。
HBase isn’t a relational database like the ones to which you’re likely accustomed. It doesn’t speak SQLor enforce relationships within your data. It doesn’t allow interrow transactions, and it doesn’t mind storing an integer in one row and a string in another for the same column.
HBase并不是大家所习惯使用的关系型数据库。它不支持SQL语句或强调数据关联关系,不允许数据行内的事务操作,它允许在不同行的同一列中存储整数或存储字符串这些不同的数据类型。
HBase is designed to run on a cluster of computers instead of a single computer. The cluster can be built using commodity hardware; HBase scales horizontally as you add more machines to the cluster. Each node in the cluster provides a bit of storage, a bit of cache, and a bit of computation as well. This makes HBase incredibly flexible and forgiving. No node is unique, so if one of those machines breaks down, you simply replace it with another. This adds up to a powerful, scalable approach to data that, until now, hasn’t been commonly available to mere mortals.
HBase设计的初衷,就是运行在计算机集群上的,而不是单台计算机上。HBase集群采用硬件动态构建,可以通过添加更多的机器来水平扩充集群。集群中的每个节点都提供部分数据的存储,部分数据的缓存,以及一些计算能力。这使得HBase具备令人难以置信的灵活性和扩展性。集群的每个节点并不是独一无二的,它备有相似性,所以如果其中的一台机器坏了,你可以用另一台直接代替它。这意味着HBase提供了一种前所未有的、强大的、可伸缩的数据存储方法。
Join the community
Unfortunately, no official public numbers specify the largest HBase clusters running in production. This kind of information easily falls under the realm of business confidential and isn’t often shared. For now, the curious must rely on footnotes in publications, bullets in presentations, and the friendly, unofficial chatter you’ll find at user groups, meet-ups, and conferences.
So participate! It’s good for you, and it’s how we became involved as well. HBase is an open source project in an extremely specialized space. It has well-financed competition from some of the largest software companies on the planet. It’s the community that created HBase and the community that keeps it competitive and innovative.
Plus, it’s an intelligent, friendly group. The best way to get started is to join the mailing lists.
1 You can follow the features, enhancements, and bugs being currently worked on using the JIRA site.
2 It’s open source and collaborative, and users like yourself drive the project’s direction and development.
Step up, say hello, and tell them we sent you!
请加入这个社区吧
不幸的是,目前没有官方公共数据指出最大的HBase集群生产环境运行情况是什么样的。 这种信息容易属于商业机密的范畴,不是经常会共享出来的。目前,这种好奇心只能是通过查看出版物的备注,演讲PPT的条目摘要,和友好的用户组信息,约会信息和会议信息来满足下。
所以参加社区肯定是对你有好处的,我们也参与社区并成为社区成员了。HBase是一个开源项目,托管在一个非常专业的存储空间里。一些世界上最大的软件公司对其提供充足的资金。正是社区成就了HBase并使其保持了竞争性和创新性。
另外,社区是一个智慧和友好的群体。现在开始参与它,最好的方法是加入它的邮件列表。
1.您可以通过JIRA网站查看正在进行开发与解决的功能、增强特性和bug。
2.它是开源和协作性的,用户可以自己来驱动项目的发展方向和开发。
过来吧,打声招呼,告诉他们,是我们推荐你来的,呵呵 !
Given that HBase has a different design and different goals as compared to traditional database systems, building applications using HBase involves a different approach as well. This book is geared toward teaching you how to effectively use the features
HBase has to offer in building applications that are required to work with large amounts of data. Before you set out on the journey of learning how to use HBase, let’s get historical perspective about how HBase came into being and the motivations
behind it. We’ll then touch on use cases people have successfully solved using HBase. If you’re like us, you’ll want to play with HBase before going much further. We’ll wrap up by walking through installing HBase on your laptop, tossing in some data, and pulling it out. Context is important, so let’s start at the beginning.
HBase相比传统的数据库系统有着不同的设计理念和不同的设计目标,构建应用程序使用HBase会涉及到一些不同的设计方法。这本书是针对如何有效地使用 HBase为处理大数据的应用程序服务。在你开始学习如何使用HBase之前,让我们一起从历史的角度出发看看HBase创造出来的动机和它背后的渊源。然后我们将了解一些人们使用HBase成功解决问题的案例。如果你像我们一样,希望把HBase应用得更好。那我们就继续深入,在你的笔记本电脑上安装HBase,插入一些数据,再查询出来。开发学习环境是很重要的,让我们一起从头开始做起吧。
HBase project mailing lists: http://hbase.apache.org/mail-lists.html.
HBase JIRA site: https://issues.apache.org/jira/browse/HBASE.
HBase项目邮件列表:http://hbase.apache.org/mail-lists.html。
HBase JIRA网站:https://issues.apache.org/jira/browse/HBASE。