Big Data, Crystal Balls and Looking Glasses: Reviewing 2016, predicting 2017

End-of-year reviews are boring -- and everyone does them. Predictions are boring -- and they are hard. Of course, this is different -- because big data.

How do big data people go about making end-of-year reviews and predictions? Using data is the obvious answer, but there's a few issues with that approach: there is no synthesis in data alone -- you have to find the story behind data, pick an angle and seek meaning. In addition, that approach does not account for subtle hints, industry knowledge, and big ideas.

To paraphrase Carl Sagan, "we wish to find the truth, no matter where it lies. But to find the truth we need imagination and data both. We will not be afraid to speculate, but we will be careful to distinguish speculation from fact." In this spirit, let's keep things equally opinionated and objective in 2017.
卡尔萨根的意思是,“我们希望找到真相,无论它在哪里。但是为了找到真相,我们需要想象力和数据。我们不害怕推测,但是我们会很仔细从事实中获取推测结果。” 在这种精神下,让我们在2017同等主观又客观地看事情吧。

It's the end of Hadoop as we know it, and I feel fine

Hadoop turned 10 in 2016. It's come a long way from a pet project named after a toy elephant to the (metaphorical) stampeding beast now in most every CXO's name-dropping list. The latest Big Data maturity survey showed that 73 percent of respondents are now in production with Hadoop (vs. 65 percent last year). And yet we're here to tell you Hadoop as we know it is dead. And that's not even news.

Hadoop has been constantly evolving, expanding, and re-inventing itself throughout its lifetime. A massive ecosystem has been developing around the initial bare-bones offering, and today Hadoop is more of a platform than "just" a storage and compute framework. The introduction of YARN was a game changer, enabling Hadoop to become a Big Data OS and to break away from its batch-oriented MapReduce origins.

In 2016, data and stories from the trenches all pointed to the same direction: batch, MapReduce Hadoop is dead, long live real-time, Spark Hadoop. 25 percent of organizations are using Spark in production today with an additional 33 percent using it in development, and all major Hadoop vendors are involved in it. Adding up suggests that by the end of 2017 up to 50 percent of organizations could be using Spark in production.
在2016年,现实中的数据和事例都指向了同一个方向:批处理,MapReduce Hadoop已死,实时处理万岁,Spark Hadoop。现在百分之二十五的组织中线上产品中都在用Spark,另外有33%正在使用Spark做开发,并且所有主流的Hadoop服务商都参与到Spark中了。到2017年底,加起来会有多达50%的公司在它们的线上产品中使用Spark。

But it's not necessarily a Spark or bust future: neither is Spark the only streaming game in town, nor is Hadoop the only Big Data platform. Alternatives do exist, and users may migrate or leapfrog to them skipping Spark or Hadoop altogether, the same way they are now migrating from or skipping MapReduce.
The Big Data landscape is host to a multitude of different approaches. But more and more it looks like everyone is adding everyone else's features. Convergence or me-too? Image: Martin Kleppmann.
大数据框架是基于许多不同方法的。但是看起来每个模块都在加入越来越多其余模块的功能。聚合还是复制?图片:Martin Kleppmann

Becoming all things to all men to save some
Spark can do both streaming and batch processing. And it can also do SQL, and graphs. And of course on Hadoop you can also do SQL and/or NoSQL in a number of other ways, utilizing a wide choice of tools. That's what being an ecosystem is all about, right? But then again, everyone seems to be at it these days.

NoSQL databases like Cassandra / DataStax Enterprise can now also do graph, in addition to key-value, tabular and document. What about the iconic NoSQL document store - MongoDB? Well, besides document, you can now also do SQL . Microsoft's SQL Server? Youraverage SQL server no more: it can run on Linux, it supports R, in-memory processing and column store. MariaDB, the poor man's SQL server, also has its column store now.
像Cassandra / DataStax Enterprise 这样子的NoSQL数据库在能处理键值,格式化和文档之外现在也能处理图片。那著名的NoSQL文档库MongoDB怎么样呢?好吧,除了文档,你也能使用SQL了。微软的SQL Server呢?它不再是你认识那个平庸的SQL服务器了:它能再Linux上运行,它支持R语言,内存运行和列存储。MariaDB,穷人的SQL服务器,它现在也支持列存储了。

Neo4J, the iconic graph store? It's going ACID. Google's BigQuery now supports standard SQL , joining Amazon Redshift that has had it for a while as it's based on Postgres. Of course, analytics-oriented column stores have long supported SQL. And traditional relational DBs like Oracle and IBM have been adding features like in-memory processing and column store for a while as well. Key-stores do it, document-stores do it, graph-stores do it, even SQL incumbents do it.
Neo4J, 典型的图形数据库?它也要支持ACID了。谷歌的BigQuery现在支持标准SQL,Amazon Redshift使用了BigQuery一段时间了因为它基于Postgres。当然,面向统计的列存储数据库长久以来就支持SQL。传统的关系型数据库像Oracle和IBM也一直在增加像内存处理和列存储这样子的功能。键值存储数据库这样子,文档存储数据库这样子,图形存储数据库这样子,甚至就连SQL数据库也是如此。

The boundaries are blurring, as more and more data platforms try to be more things to more people. Doing most everything on the same platform is good for vendors that want to increase their retention and good for users who don't want to have to mix and match disparate platforms to get things done. But it's not a sheer land-ho of opportunity - threats lie ahead too. Most notably, vendor lock-in, half-baked features, and half-hearted users.
因为越来越多的平台都在为更多的人群提供更多的功能,平台之间的界限正越来越模糊。对于想增加客户保留率的供应商和不想混用和拼接不相干的平台来达到目的的用户来说,在相同的一个平台上把几乎所有事情都做了是极好的。但是它并不是一个纯粹的充满机会的土地,危险也同样存在. 最显著的问题有,供应商锁定,半吊子功能和意兴阑珊的用户。
Some are trying to get the basics right, while some are after up in the sky goals. Yet, there's a place for everyone under Big Data. Image: Martin Kleppmann
一些人在为了基本的权利而努力,同时一些人在追求远大的目标。然而,大数据下每个人都有自己的容身之地。 图片:Martin Kleppmann

