Learn about BigBench, the new industrywide effort to create a sorely needed Big Data benchmark.
Benchmarking Big Data systems is an open problem. To address this concern, numerous hardware and software vendors are working together to create a comprehensive end-to-end big data benchmark suite calledBigBench. BigBench builds upon and borrows elements from existing benchmarking efforts in the Big Data space (such as YCSB, TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, and TPC-DS). Intel and Cloudera, along with other industry partners, are working to define and implement extensions to BigBench 1.0. (A TPC proposal for BigBench 2.0 is in the works.)
BigBench is a specification-based benchmark with an open-source reference implementation kit, which sets it apart from its predecessors. As a specification-based benchmark, it would be technology-agnostic and provide the necessary formalism and flexibility to support multiple implementations. As a “kit”, it would lower the barrier of entry to benchmarking by providing a readily available reference implementation as a starting point. As open source, it would allow multiple implementations to co-exist in one place and be reused by different vendors, while providing consistency where expected for the ability to provide meaningful comparisons.
The BigBench specification comprises two key components: a data model specification, and a workload/query specification. The structured part of the BigBench data model is adopted from the TPC-DS data model depicting a product retailer, which sells products to customers via physical and online stores. BigBench’s schema uses the data of the store and web sales distribution channel and augments it with semi-structured and unstructured data as shown in Figure 1.
Figure 1: BigBench data model specification
The data model specification is implemented by a data generator, which is based on an extension of PDGF. Plugins for PDGF enable data generation for an arbitrary schema. Using the BigBench plugin, data can be generated for all three pats of the schema: structured, semi-structured and unstructured.
BigBench 1.0 workload specification consists of 30 queries/workloads. Ten of these queries have been taken from the TPC-DS workload and run against the structured part of the schema. The remaining 20 were adapted from a McKinsey report on Big Data use cases and opportunities. Seven of these run against the semi-structured portion and five run against the unstructured portion of the schema. The reference implementation of the workload specification is available here.
BigBench 1.0 specification includes a set of metrics (focused around execution time calculation) and multiple execution modes. The metrics can be reported for the end-to-end execution pipeline as well as each individual workload/query. The benchmark also defines a model for submitting concurrent workload streams in parallel, which can be extended to simulate the multi-user scenario.
BigBench has some ways to go before it can be declared complete. A work-in-progress paper about BigBench co-authored by various industry and academia experts, discusses the reference implementation, community feedback on what is done well, and shortcomings of the 1.0 specification and implementation. The concerns are addressed in the form of proposed extensions for BigBench 2.0, some of which are described below.
The current specification, while representative of a wide variety of big data use cases, falls short of being complete—primarily because it is structured-data-intensive, with some representation for semi-structured and unstructured content (which also gets formatted into structured data before being processed). By adding more procedural and analytic workloads that perform complex operations directly on unstructured data, the lopsidedness can be eliminated. This approach is also closely tied to the data specification model, which currently doesn’t state the rate of input data generation and refresh, thereby excluding streaming workloads from the current specification.
The specification also needs to be extended to enforce that all file formats demonstrate sufficient flexibility to be created, read, and written from multiple popular data processing engines (MapReduce, Apache Spark, Apache Hive, and so on). This capability would ensure that all data is immediately query-able with no ETL or format-conversion delays.
The current set of metrics excludes performance per cost. As our experiments show, this metric is critical for comparing software systems in the presence of hardware diversity. Performance subject to failures is another important metric currently missing in the specification. On the implementation side, the kit needs to be enhanced with implementation of existing and proposed queries over popular data processing engines, such as Impala, Spark, and MapReduce in addition to Hive.
Intel has evaluated a subset of BigBench 1.0 workloads against multiple hardware configurations. The goals of these experiments were to:
We selected four queries (2, 16, 20, 30), each representing a different processing model (for query descriptions, please see https://github.com/intel-hadoop/Big-Bench). A common input dataset of 2TB was used. Input data sizes for each of the four workloads (2, 16, 20, 30) were 244GB, 206GB, 144GB, and 244GB respectively. We used Intel’s Performance Analysis Tool (PAT) to collect and analyze data. (PAT automates benchmark execution and data collection and makes use of Linux performance counters.)
The table below shows the three different configurations under experimentation. Each of the three clusters contained 10 nodes. For Configuration 1, we selected Intel Xeon E5 2680 v3 with 12 cores, 30MB cache, 2.5GHz (3.3 GHz @Turbo) and operating TDP of 120W. 800GB DC3500 series SSD was used as the boot drive. For primary storage, 12x 2TB SATA drives were used.
In Configuration 2, Intel Xeon E5 2680 v3 was replaced with E5 2697 v3 (14 cores, 35MB cache, 2.6GHz with 3.6 GHz@Turbo, and 145W TDP). A 2TB Intel DC3700 SSD was added as primary storage alongside the hard-disk drives. Concerns with regard to endurance and affordability of a large capacity NAND drive have prevented customers from embracing SSD technology for their Hadoop clusters; this Cloudera blog post explains the merits of using SSD for performance and the drawbacks in terms of cost. However, SSD technology has advanced significantly since the blog was published. For example, capacity and endurance have improved and the price of Intel DC3600 SSD is now one-third the price of SSD reported in that post.
In Configuration 3, we replaced Intel Xeon E5 2697 v3 with Intel Xeon E5 2699 v3 (18 cores, 45MB cache, 2.3GHz with 3.6 Ghz@Turbo, and 145W TDP). The hard-disk drives were replaced with a second SSD drive. Memory was increased from 128GB to 192GB.
As shown in the table above, Configuration 2 costs 1.5x of Configuration 1 and Configuration 3 costs around 2x of Configuration 1. A summary of results from the experiments are shown in Figure 2.
Figure 2: BigBench testing results
For workloads #16 and #30, the performance gains from Configurations 2 and 3 are strictly proportional to the cost of the hardware. The customer has to pay 2x for getting 2x performance, and the performance per dollar ratio is close to 1. For Workloads #2 and #20, however, the performance per dollar ratio is less than 1. From these results we can conclude that Configuration 1 is a good choice in all cases—whereas scaled-up Configurations 2 and 3 make sense for certain type of workloads, especially those that are disk-IO intensive.
We monitored CPU utilization, disk bandwidth, network IO and memory usage for each workload (using PAT). Most of the gains come from the use of SSDs. In the presence of SSDs, the workloads tend to become CPU-bound at which point an increase in CPU cores and frequency starts to help. For in-memory data processing engines (Spark and Impala), an increase in memory size is likely to be the most important factor. We hope to cover that issue in a future study.
BigBench is an industrywide effort on creating a comprehensive and standardized BigData benchmark. Intel and Cloudera are working on defining and implementing extensions to BigBench (in the form of BigBench 2.0). A preliminary validation against multiple cluster configurations using latest Intel hardware shows that, from a price-performance viewpoint, scaled-up configurations (that use SSD and high-end Intel processors) are beneficial for workloads that are disk-IO bound.
BigBench is a joint effort with partners in industry and academia. The authors would like to thank Chaitan Baru, Milind Bhandarkar, Alain Crolotte, Carlo Curino, Manuel Danisch, Michael Frank, Ahmed Ghazal, Minqing Hu, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar, Raghu Nambiar, Meikel Poess, Francois Raab, Tilmann Rabl, Kai Sachs, Saptak Sen, Lan Yi and Choonhan Youn. We invite rest of the community to participate in development of the BigBench 2.0 kit.