q79969786

The Data Scientist Guide with Links

Frameworks
Apache Hadoop	framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system)	Apache Hadoop
Distributed Programming
AddThis Hydra	Hydra is a distributed data processing and storage system originally developed at AddThis. It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).	Github
Akela	Mozilla’s utility library for Hadoop, HBase, Pig, etc.	Website
AMPLab SIMR	Apache Spark was developed thinking in Apache YARN. However, up to now, it has been relatively hard to run Apache Spark on Hadoop MapReduce v1 clusters, i.e. clusters that do not have YARN installed. Typically, users would have to get permission to install Spark/Scala on some subset of the machines, a process that could be time consuming. SIMR allows anyone with access to a Hadoop MapReduce v1 cluster to run Spark out of the box. A user can run Spark directly on top of Hadoop MapReduce v1 without any administrative rights, and without having Spark or Scala installed on any of the nodes.	SIMR on GitHub
AMPLab Succinct	Enabling Queries on Compressed Data	Website
Apache Crunch	is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.	Website
Apache DataFu	DataFu provides a collection of Hadoop MapReduce jobs and functions in higher level languages based on it to perform data analysis. It provides functions for common statistics tasks (e.g. quantiles, sampling), PageRank, stream sessionization, and set and bag operations. DataFu also provides Hadoop jobs for incremental data processing in MapReduce. DataFu is a collection of Pig UDFs (including PageRank, sessionization, set operations, sampling, and much more) that were originally developed at LinkedIn.	1. DataFu Apache Incubator 2. LinkedIn DataFu
Apache Flink	high-performance runtime, and automatic program optimization	Website
Apache Gora	framework for in-memory data model and persistence	Apache Gora
Apache Hama	Apache Top-Level open source project, allowing you to do advanced analytics beyond MapReduce. Many data analysis techniques such as machine learning and graph algorithms require iterative computations, this is where Bulk Synchronous Parallel model can be more effective than “plain” MapReduce.	Hama site
Apache MapReduce	MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Apache MapReduce was derived from Google MapReduce: Simplified Data Processing on Large Clusters paper. The current Apache MapReduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed processing frameworks and applications. YARN’s execution model is more generic than the earlier MapReduce implementation. YARN can run applications that do not follow the MapReduce model, unlike the original Apache Hadoop MapReduce (also called MR1). Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.	1. Apache MapReduce 2. Google MapReduce paper 3. Writing YARN applications
Apache Pig	Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.	1. pig.apache.org/2. 2.Pig examples by Alan Gates
Apache S4	S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.	Apache S4
Apache Spark	Data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark provides an easier to use alternative to Hadoop MapReduce and offers performance up to 10 times faster than previous generation systems like Hadoop MapReduce for certain applications.	Apache Incubator Spark
Apache Spark Streaming	framework for stream processing, part of Spark	Apache Spark Streaming
Apache Storm	Storm is a complex event processor and distributed computation framework written predominantly in the Clojure programming language. Is a distributed real-time computation system for processing fast, large streams of data. Storm is an architecture based on master-workers paradigma. So a Storm cluster mainly consists of a master and worker nodes, with coordination done by Zookeeper.	1. Storm Project/ 2. Storm-on-YARN
Apache Tez	Tez is a proposal to develop a generic application which can be used to process complex data-processing task DAGs and runs natively on Apache Hadoop YARN.	Apache Tez
Apache Twill	Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their business logic. Twill uses a simple thread-based model that Java programmers will find familiar. YARN can be viewed as a compute fabric of a cluster, which means YARN applications like Twill will run on any Hadoop 2 cluster.	Apache Twill Incubator
Cascalog	data processing and querying library	Cascalog
Cheetah	High Performance, Custom Data Warehouse on Top of MapReduce	Paper
Concurrent Cascading	Application framework for Java developers to simply develop robust Data Analytics and Data Management applications on Apache Hadoop.	Cascanding
Damballa Parkour	Library for develop MapReduce programs using the LISP like language Clojure. Parkour aims to provide deep Clojure integration for Hadoop. Programs using Parkour are normal Clojure programs, using standard Clojure functions instead of new framework abstractions. Programs using Parkour are also full Hadoop programs, with complete access to absolutely everything possible in raw Java Hadoop MapReduce.	Parkour GitHub Project
Datasalt Pangool	A new MapReduce paradigm. A new API for MR jobs, in higher level than Java.	Website
DataTorrent StrAM	real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.	Website
DistributedR	scalable high-performance platform for the R language	Website
eBay Oink	REST based interface for PIG execution	1. Website 2. Website
Facebook Corona	“The next version of Map-Reduce” from Facebook, based in own fork of Hadoop. The current Hadoop implementation of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets. The Apache Hadoop developers have been creating their own next-generation MapReduce, called YARN, which Facebook engineers looked at but discounted because of the highly-customised nature of the company’s deployment of Hadoop and HDFS. Corona, like YARN, spawns multiple job trackers (one for each job, in Corona’s case).	Website
Facebook Peregrine	Map Reduce framework	Facebook Peregrine
Facebook Scuba	distributed in-memory datastore	Website
Geotrellis	geographic data processing engine for high performance applications	1. Website 2. Website
GIS Tools for Hadoop	Big Data Spatial Analytics for the Hadoop Framework	Website
Google Dataflow	create data pipelines to help themæingest, transform and analyze data	Website
Google MapReduce	map reduce framework	Website
Google MillWheel	fault tolerant stream processing framework	Website
HParser	data parsing transformation environment optimized for Hadoop	Website
IBM Streams	advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources	Website
JAQL	JAQL is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. As its name implies, a primary use of JAQL is to handle data stored as JSON documents, but JAQL can work on various types of data. For example, it can support XML, comma-separated values (CSV) data and flat files. A “SQL within JAQL” capability lets programmers work with structured SQL data while employing a JSON data model that’s less restrictive than its Structured Query Language counterparts.	1. JAQL in Google Code 2. What is Jaql? by IBM
Kite	is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.	Website
Kyro	Java serialization and cloning: fast, efficient, automatic	Website
Lipstick	Pig workflow visualization tool	Website
Metamarkers Druid	Realtime analytical data store.	Druid
Netflix Aegisthus	Bulk Data Pipeline out of Cassandra. implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family	Website
Netflix Lipstick	Pig Visualization framework	Website
Netflix Mantis	Event Stream Processing System	Website
Netflix PigPen	PigPen is map-reduce for Clojure whiche compiles to Apache Pig. Clojure is dialect of the Lisp programming language created by Rich Hickey, so is a functional general-purpose language, and runs on the Java Virtual Machine, Common Language Runtime, and JavaScript engines. In PigPen there are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program. This tool is open sourced by Netflix, Inc. the American provider of on-demand Internet streaming media.	PigPen on GitHub
Netflix STAASH	language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems	Website
Netflix Zeno	Netflix’s In-Memory Data Propagation Framework	Website
Nokia Disco	MapReduce framework developed by Nokia	Nokia Disco
PigPen	PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it	Website
Pinterest Pinlater	asynchronous job execution system	Website
Pydoop	Pydoop is a Python MapReduce and HDFS API for Hadoop, built upon the C++ Pipes and the C libhdfs APIs, that allows to write full-fledged MapReduce applications with HDFS access. Pydoop has several advantages over Hadoop’s built-in solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython package, it allows you to access all standard library and third party modules, some of which may not be available.	1. SF Pydoop site 2. Pydoop GitHub Project
ScaleOut hServer	fast, scalable in-memory data grid for Hadoop	Website
SeqPig	Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop	Website
SigmoidAnalytics Spork	Pig on Apache Spark	Website
SpatialHadoop	SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.	Website
Spring for Apache Hadoop	unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive	Website
SQLStream Blaze	stream processing platform	Website
Stratio Streaming	the union of a real-time messaging bus with a complex event processing engine using Spark Streaming	Website
Stratosphere	Stratosphere is a general purpose cluster computing framework. It is compatible to the Hadoop ecosystem: Stratosphere can access data stored in HDFS and runs with Hadoop’s new cluster manager YARN. The common input formats of Hadoop are supported as well. Stratosphere does not use Hadoop’s MapReduce implementation: it is a completely new system that brings its own runtime. The new runtime allows to define more advanced operations that include more transformations than just map and reduce. Additionally, Stratosphere allows to express analysis jobs using advanced data flow graphs, which are able to resemble common data analysis task more naturally.	Stratosphere site
Streamdrill	usefull for counting activities of event streams over different time windows and finding the most active one	Website
Teradata QueryGrid	data-access layer that can orchestrate multiple modes of analysis across multiple databases plus Hadoop	Website
TIBCO ActiveSpaces	in-memory data grid	Website
Torch	Scientific computing for LuaJIT	Website
Twitter Scalding	Scala library for Map Reduce jobs, built on Cascading	Twitter Scalding
Twitter Summingbird	a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird.	Summingbird
Twitter TSAR	TimeSeries AggregatoR by Twitter	Website2. Website
Distributed Filesystem
Apache HDFS	The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.	1. hadoop.apache.org 2. Google FileSystem – GFS Paper 3. Cloudera Why HDFS 4. Hortonworks Why HDFS
BeeGFS	formerly FhGFS, parallel distributed file system	Website
Ceph Filesystem	Ceph is a free software storage platform designed to present object, block, and file storage from a single distributed computer cluster. Ceph’s main goals are to be completely distributed without a single point of failure, scalable to the exabyte level, and freely-available. The data is replicated, making it fault tolerant. The problem right now is Ceph currently requires Hadoop 1.1.X stable series.	1. Ceph Filesystem site 2. Ceph and Hadoop 3. HADOOP-6253
Disco DDFS	distributed filesystem	Website
Facebook Haystack	object storage system	Facebook Haystack
Google Colossus	distributed filesystem (GFS2)	Website
Google GFS	distributed filesystem	Website
Google Megastore	scalable, highly available storage	Website
GridGain	GridGain is open source project licensed under Apache 2.0. One of the main pieces of this platform is the In-Memory Apache Hadoop Accelerator which aims to accelerate HDFS and Map/Reduce by bringing both, data and computations into memory. This work is done with the GGFS – Hadoop compliant in-memory file system. For I/O intensive jobs GridGain GGFS offers performance close to 100x faster than standard HDFS. Paraphrasing Dmitriy Setrakyan from GridGain Systems talking about GGFS regarding Tachyon: GGFS allows read-through and write-through to/from underlying HDFS or any other Hadoop compliant file system with zero code change. Essentially GGFS entirely removes ETL step from integration.GGFS has ability to pick and choose what folders stay in memory, what folders stay on disc, and what folders get synchronized with underlying (HD)FS either synchronously or asynchronously. GridGain is working on adding native MapReduce component which will provide native complete Hadoop integration without changes in API, like Spark currently forces you to do. Essentially GridGain MR+GGFS will allow to bring Hadoop completely or partially in-memory in Plug-n-Play fashion without any API changes.	GridGain site
HDSF-DU	HDFS-DU is an interactive visualization of the Hadoop distributed file system.	Website
Lustre file system	The Lustre filesystem is a high-performance distributed filesystem intended for larger network and high-availability environments. Traditionally, Lustre is configured to manage remote data storage disk devices within a Storage Area Network (SAN), which is two or more remotely attached disk devices communicating via a Small Computer System Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI.	1. wiki.lustre.org/ 2. Hadoop with Lustre 3. Intel HPC Hadoop
Netflix S3mper	library that provides an additional layer of consistency checking on top of Amazon’s S3 index through use of a consistent, secondary index	Website
Quantcast File System QFS	(QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters. It is written in C++ and has fixed-footprint memory management. QFS uses Reed-Solomon error correction as method for assuring reliable access to data.	1. QFS site 2. GitHub QFS 3. HADOOP-8885
Red Hat GlusterFS	GlusterFS is a scale-out network-attached storage file system. GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 201In June 2012, Red Hat Storage Server was announced as a commercially-supported integration of GlusterFS with Red Hat Enterprise Linux. Gluster File System, known now as Red Hat Storage Server.	1. www.gluster.org 2. Red Hat Hadoop Plugin
Tachyon	Tachyon is an memory distributed file system. By storing the file-system contents in the main memory of all cluster nodes, the system achieves higher throughput than traditional disk-based storage systems like HDFS.	Tachyon site
Key-Map Data Model
Actian Vector	column-oriented analytic database	Actian website
Apache Accumulo	Distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google’s BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Accumulo is software created by the NSA with security features.	Apache Accumulo
Apache Cassandra	Distributed Non-SQL DBMS, it’s a BDDB. MR can retrieve data from Cassandra. This BDDB can run without HDFS, or on-top of HDFS (DataStax fork of Cassandra). HBase and its required supporting systems are derived from what is known of the original Google BigTable and Google File System designs (as known from the Google File System paper Google published in 2003, and the BigTable paper published in 2006). Cassandra on the other hand is a recent open source fork of a standalone database system initially coded by Facebook, which while implementing the BigTable data model, uses a system inspired by Amazon’s Dynamo for storing data (in fact much of the initial development work on Cassandra was performed by two Dynamo engineers recruited to Facebook from Amazon).	Apache Cassandra
Apache HBase	Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables	Apache HBase
Facebook HydraBase	Evolution of HBase made by Facebook	Blog Post on Facebook engineer
Google BigTable	column-oriented distributed datastore	Google BigTable
Google Cloud Datastore	is a fully managed, schemaless database for storing non-relational data built on top of Google’s BigTable infrastructure	1. Google Cloud Datastore site 2. Google App Engine Datastore 3. Matering Datastore
Hypertable	Database system inspired by publications on the design of Google’s BigTable. The project is based on experience of engineers who were solving large-scale data-intensive tasks for many years. Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++. Sposored by Baidu the Chinese search engine.	HyperTable
InfiniDB	is accessed through a MySQL interface and use massive parallel processing to parallelize queries	Website
Netflix Priam	Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra	Website
OhmData C5	improved version of HBase	OhmData website
Sqrrl	NoSQL databases on top of Apache Accumulo	Website
Tephra	Transactions for HBase	Website
Twitter Manhattan	real-time, multi-tenant distributed database for Twitter scale	Blog post on Twitter Engineering blog
Document Data Model
Actian Versant	commercial object-oriented database management systems	Website
Crate Data	is an open source massively scalable data store. It requires zero administration.	Website
Facebook Apollo	Facebook’s Paxos-like NoSQL database	infoQ post2. Website
jumboDB	document oriented datastore over Hadoop	jumboDB
LinkedIn Espresso	horizontally scalable document-oriented NoSQL data store	LinkedIn Espresso
MarkLogic	Schema-agnostic Enterprise NoSQL database technology	Website
Microsoft DocumentDB	fully-managed, highly-scalable, NoSQL document database service	Website
MongoDB	Document-oriented database system. It is part of the NoSQL family of database systems. Instead of storing data in tables as is done in a “classical” relational database, MongoDB stores structured data as JSON-like documents	Mongodb site
RavenDB	A transactional, open-source Document Database	Website
RethinkDB	RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.	RethinkDB site
TokuMX	High-Performance MongoDB Distribution	Website
Key-value Data Model
Aerospike	NoSQL flash-optimized, in-memory. Open source and “Server code in ‘C’ (not Java or Erlang) precisely tuned to avoid context switching and memory copies.	Website
Amazon DynamoDB	distributed key/value store, implementation of Dynamo	Amazon DynamoDB
Edis	Edis is a protocol-compatible Server replacement for Redis, written in Erlang. Edis’s goal is to be a drop-in replacement for Redis when persistence is more important than holding the dataset in-memory. Edis (currently) uses Google’s leveldb as a backend. Future plans call for a multi-master clustering model. Near term goals are to act as a read-slave for existing Redis servers.	Website
ElephantDB	Distributed database specialized in exporting data from Hadoop	ElephantDB
EventStore	An open-source, functional database with support for Complex Event Processing. It provides a persistence engine for applications using event-sourcing, or for storing time-series data. Event Store is written in C#, C++ for the server which runs on Mono or the .NET CLR, on Linux or Windows. Applications using Event Store can be written in JavaScript.	EventStore2. Website
HyperDex	next generation key-value store	Website
LinkedIn Krati	is a simple persistent data store with very low latency and high throughput. It is designed for easy integration with read-write-intensive applications with little effort in tuning configuration, performance and JVM garbage collection.	Website
Linkedin Voldemort	Distributed data store that is designed as a key-value store used by LinkedIn for high-scalability storage.	LinkedIn Voldemort
Oracle NoSQL Database	distributed key-value database by Oracle Corporation	Website
Redis	Redis is an open-source, networked, in-memory, key-value data store with optional durability. It is written in ANSI C. In its outer layer, the Redis data model is a dictionary which maps keys to values. One of the main differences between Redis and other structured storage systems is that Redis supports not only strings, but also abstract data types. Sponsored by Pivotal and VMWare. It’s BSD licensed.	Redis.io2. Website
Redis Sentinel	system designed to help managing Redis instances	Website
Riak	a decentralized datastore.	Website
Storehaus	library to work with asynchronous key value stores, by Twitter	Storehaus
Tarantool	an efficient NoSQL database and a Lua application server.	Website
TreodeDB	key-value store that’s replicated and sharded and provides atomic multirow writes	Website
Graph Data Model
Apache Giraph	Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google	Apache Giraph
Apache Spark Bagel	implementation of Pregel, part of Spark	Apache Spark Bagel
ArangoDB	An open-source database with a flexible data model for documents, graphs, and key-values. Build high performance applications using a convenient sql-like query language or JavaScript extensions.	ArangoDB site
Facebook TAO	TAO is the distributed data store that is widely used at facebook to store and serve the social graph. The entire architecture is highly read optimized, supports a graph data model and works across multiple geographical regions.	Post about TAO
Faunus	Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster	Website
Google Cayley	open-source graph database.	Website
Google Pregel	graph processing framework	Website
GraphLab PowerGraph	a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. In addition, we are actively developing new interfaces to allow users to leverage the GraphLab API from other languages and technologies.	Graphlab website
GraphX	A Resilient Distributed Graph System on Spark	GraphX
Gremlin	graph traversal Language.	Website
InfiniteGraph	distributed graph database	Website
Infovore	RDF-centric Map/Reduce framework	Website
Intel GraphBuilder	library which provides tools to construct large-scale graphs on top of Apache Hadoop	Website
MapGraph	Massively Parallel Graph processing on GPUs	Website
Neo4j	An open-source graph database writting entirely in Java. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables.	Neo4j site
OrientDB	It is an Open Source NoSQL DBMS with the features of both Document and Graph DBMSs. Written in Java, it is incredibly fast: it can store up to 150,000 records per second on common hardware.	OrientDB site
Phoebus	framework for large scale graph processing	Phoebus
Sparksee	scalable high-performance graph database	Website
Titan	distributed graph database, built over Cassandra	Titan
Twitter FlockDB	distribuited graph database	Twitter FlockDB
NewSQL Databases
Actian Ingres	commercially supported, open-source SQL relational database management system	Website
BayesDB	BayesDB, a Bayesian database table, lets users query the probable implications of their tabular data as easily as an SQL database lets them query the data itself. Using the built-in Bayesian Query Language (BQL), users with no statistics training can solve basic data science problems, such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries.	BayesDB site
Cockroach	Scalable, Geo-Replicated, Transactional Datastore	Website
Datomic	distributed database designed to enable scalable, flexible and intelligent applications.	Website
FoundationDB	distributed database, inspired by F1, aquired Akiban server	FoundationDB2. Akiban Server
Google F1	distributed SQL database built on Spanner	Website
Google Spanner	globally distributed semi-relational database	Website
H-Store	is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications. It is a highly distributed, row-store-based relational database that runs on a cluster on shared-nothing, main memory executor nodes.	Brown project website
HandlerSocket	HandlerSocket is a NoSQL plugin for MySQL/MariaDB (the storage engine of MySQL). It works as a daemon inside the mysqld process, accepting TCP connections, and executing requests from clients. HandlerSocket does not support SQL queries. Instead, it supports simple CRUD operations on tables. HandlerSocket can be much faster than mysqld/libmysql in some cases because it has lower CPU, disk, and network overhead.	Website
IBM DB2	object-relational database management system	Website
InfiniSQL	infinity scalable RDBMS	InfiniSQL
MemSQL	in memory SQL database witho optimized columnar storage on flash	MemSQL site
NuoDB	SQL/ACID compliant distributed database	NuoDB
Oracle Database	object-relational database management system	Website
Oracle TimesTen in-Memory Database	in-memory, relational database management system with persistence and recoverability	Website
Pivotal GemFire XD	Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.	Website
SAP HANA	is an in-memory, column-oriented, relational database management system	Website
SenseiDB	Open-source, distributed, realtime, semi-structured database. Some Features: Full-text search, Fast realtime updates, Structured and faceted search, BQL: SQL-like query language, Fast key-value lookup, High performance under concurrent heavy update and query volumes, Hadoop integration	SenseiDB site
Sky	Sky is an open source database used for flexible, high performance analysis of behavioral data. For certain kinds of data such as clickstream data and log data, it can be several orders of magnitude faster than traditional approaches such as SQL databases or Hadoop.	SkyDB site
SymmetricDS	SymmetricDS is open source software for both file and database synchronization with support for multi-master replication, filtered synchronization, and transformation across the network in a heterogeneous environment. It supports multiple subscribers with one direction or bi-directional, asynchronous data replication. It uses web and database technologies to replicate data as a scheduled or near real-time operation. The software was designed to scale for a large number of nodes, work across low-bandwidth connections, and withstand periods of network outage. It works with most operating systems, file systems, and databases, including Oracle, MySQL, MariaDB, PostgreSQL, MS SQL Server (including Azure), IBM DB2, H2, HSQLDB, Derby, Firebird, Interbase, Informix, Greenplum, SQLite (including Android), Sybase ASE, and Sybase ASA (SQL Anywhere) databases.	SymmetricDS
Teradata Database	complete relational database management system	Website
VoltDB	in-memory NewSQL database	Website
Columnar Databases
Amazon RedShift	data warehouse service, based on PostgreSQL	Amazon RedShift
C-Store	column oriented DBMS	Website
Google BigQuery	framework for interactive analysis, implementation of Dremel	Google BigQuery
Google Dremel	framework for interactive analysis, implementation of Dremel	Dremel Paper
MonetDB	column store database	Website
Parquet	columnar storage format for Hadoop.	Parquet
Pivotal Greenplum	purpose-built, dedicated analytic data warehouse	Website
Vertica	The grid-based, column-oriented, Vertica Analytics Platform is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses and other query-intensive applications. The product claims to drastically improve query performance over traditional relational database systems, provide high-availability, and petabyte scalability on commodity enterprise servers.	Website
Time-Series Databases
Cube	uses MongoDB to store time series data	Website
InfluxDB	InfluxDB is an open source distributed time series database with no external dependencies. It’s useful for recording metrics, events, and performing analytics. It has a built-in HTTP API so you don’t have to write any server side code to get up and running. InfluxDB is designed to be scalable, simple to install and manage, and fast to get data in and out. It aims to answer queries in real-time. That means every data point is indexed as it comes in and is immediately available in queries that should return in	Website
Kairosdb	similar to OpenTSDB but allows for Cassandra	Website
OpenTSDB	OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.	OpenTSDB site2. Website
SQL-like processing
Actian SQL for Hadoop	high performance interactive SQL access to all Hadoop data	Website
AMPLAB Shark	Shark is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. Shark is built on top of Spark	AMPLAB on GitHub Shark
Apache Drill	Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need	Apache Drill
Apache HCatalog	HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS) and ensures that users need not worry about where or in what format their data is stored. Right now HCatalog is part of Hive. Only old versions are separated for download.	Apache HCatalog
Apache Hive	Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL.	Apache Hive
Apache Optiq	framework that allows efficient translation of queries involving heterogeneous and federated data	Website
Apache Phoenix	Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.	Apache Phoenix site
BlinkDB	massively parallel, approximate query engine	BlinkDB
Cloudera Impala	The Apache-licensed Impala project brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. It’s a Google Dremel clone (Big Query google).	Website2. Cloudera Impala
Concurrent Lingual	Open source project enabling fast and simple Big Data application development on Apache Hadoop. project that delivers ANSI-standard SQL technology to easily build new and integrate existing applications onto Hadoop	Cascading Lingual
Datasalt Splout SQL	Splout allows serving an arbitrarily big dataset with high QPS rates and at the same time provides full SQL query syntax.	Website
Facebook PrestoDB	Facebook has open sourced Presto, a SQL engine it says is on average 10 times faster than Hive for running queries across large data sets stored in Hadoop and elsewhere.	Facebook PrestoDB
JethroData	index-based SQL engine for Hadoop	Website
Metanautix Quest	data compute engine	Website
Pivotal HAWQ	SQL-like data warehouse system for Hadoop	Pivotal HAWQ
RainstorDB	database for storing petabyte-scale volumes of structured and semi-structured data	Website
Spark Catalyst	Catalyst is a Query Optimization Framework for Spark and Shark	Github sub page
SparkSQL	Manipulating Structured Data Using Spark	Databricks blog post
Splice Machine	a full-featured SQL-on-Hadoop RDBMS with ACID transactions	Website
Stinger	interactive query for Hive	Stinger
Tajo	Tajo is a distributed data warehouse system on Hadoop that provides low-latency and scalable ad-hoc queries and ETL on large-data sets stored on HDFS and other data sources.	Tajo site
Trafodion	enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads	Website
Integrated Development Environments
R-Studio	IDE for R.	Website
Data Ingestion
Amazon Kinesis	Real-time processing of streaming data at massive scale	Amazon Kinesis
Apache Chukwa	Large scale log aggregator, and analytics.	Apache Chukwa
Apache Flume	Un-structured data agregator to HDFS.	Apache Flume
Apache Samza	Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. Developed by http://www.linkedin.com/in/jaykreps Linkedin.	Apache Samza
Apache Sqoop	System for bulk data transfer between HDFS and structured datastores as RDBMS. Like Flume but from HDFS to RDBMS.	Apache Sqoop
Apache UIMA	Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user	Website
Cloudera Morphlines	framework that help ETL to Solr, HBase and HDFS.	Website
Facebook Scribe	Log agregator in real-time. It’s a Apache Thrift Service.	Facebook Scribe
Fluentd	tool to collect events and logs	Fluentd
Google Photon	geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency	Website
Heka	open source stream processing software system.	Website
HIHO	This project is a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable. HIHO connects Hadoop with multiple RDBMS and file systems, so that data can be loaded to Hadoop and unloaded from Hadoop	Website
LinkedIn Databus	stream of change capture events for a database	LinkedIn Databus
LinkedIn Kamikaze	utility package for compressing sorted integer arrays	LinkedIn Kamikaze
LinkedIn White Elephant	log aggregator and dashboard	LinkedIn White Elephant
Logstash	a tool for managing events and logs.	Website
Netflix Suro	Suro has its roots in Apache Chukwa, which was initially adopted by Netflix. Is a log agregattor like Storm, Samza.	Website
Pinterest Secor	is a service implementing Kafka log persistance	Github
Record Breaker	Automatic structure for your text-formatted data	Website
TIBCO Enterprise Message Service	standards-based messaging middleware	Website
Twitter Zipkin	distributed tracing system that helps us gather timing data for all the disparate services at Twitter	Website
Vibe Data Stream	streaming data collection for real-time Big Data analytics	Website
Message-oriented middleware
ActiveMQ	open source messaging and Integration Patterns server	Website
Amazon Simple Queue Service	fast, reliable, scalable, fully managed queue service	Website
Apache Kafka	Distributed publish-subscribe system for processing large amounts of streaming data. Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. Because messages are persisted, it has the interesting ability for clients to rewind a stream and consume the messages again. Another upside of the disk persistence is that bulk importing the data into HDFS for offline analysis can be done very quickly and efficiently. Storm, developed by BackType (which was acquired by Twitter a year ago), is more about transforming a stream of messages into new streams.	Apache Kafka
Apache Qpid	messaging tools that speak AMQP and support many languages and platforms	Website
Apollo	ActiveMQ’s next generation of messaging	Website
Beanstalkd	simple, fast work queue	Website
Bit.ly NSQ	realtime distributed message processing at scale	Website
Celery	Distributed Task Queue	Website
Crossroads I/O	library for building scalable and high performance distributed applications	Website
Darner	simple, lightweight message queue	Website
Gearman	Job Server	Website
HornetQ	open source project to build a multi-protocol, embeddable, very high performance, clustered, asynchronous messaging system	Website
IronMQ	easy-to-use highly available message queuing service	Website
Kestrel	distributed message queue system	Kestrel
Marconi	queuing and notification service made by and for OpenStack, but not only for it	Website
RabbitMQ	Robust messaging for applications	Website
RestMQ	message queue which uses HTTP as transport, JSON to format a minimalist protocol and is organized as REST resources	Website
RQ	simple Python library for queueing jobs and processing them in the background with workers	Website
Sidekiq	Simple, efficient background processing for Ruby	Website
ZeroMQ	The Intelligent Transport Layer	Website
Service Programming
Akka Toolkit	Akka is an open-source toolkit and runtime simplifying the construction of concurrent applications on the Java platform.	Website
Apache Avro	Apache Avro is a framework for modeling, serializing and making Remote Procedure Calls (RPC). Avro data is described by a schema, and one interesting feature is that the schema is stored in the same file as the data it describes, so files are self-describing. Avro does not require code generation. This framework can compete with other similar tools like: Apache Thrift, Google Protocol Buffers, ZeroC ICE, and so on.	Apache Avro
Apache Curator	Curator is a set of Java libraries that make using Apache ZooKeeper much easier.	Website
Apache Karaf	Apache Karaf is an OSGi runtime that runs on top of any OSGi framework and provides you a set of services, a powerful provisioning concept, an extensible shell & more.	Website
Apache Thrift	A cross-language RPC framework for service creations. It’s the service base for Facebook technologies (the original Thrift contributor). Thrift provides a framework for developing and accessing remote services. It allows developers to create services that can be consumed by any application that is written in a language that there are Thrift bindings for. Thrift manages serialization of data to and from a service, as well as the protocol that describes a method invocation, response, etc. Instead of writing all the RPC code – you can just get straight to your service logic. Thrift uses TCP and so a given service is bound to a particular port.	Apache Thrift
Apache Zookeeper	It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on “Chubby”, a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure.	Apache Zookeeper2. Google Chubby paper
Google Chubby	a lock service for loosely-coupled distributed systems	Paper
Linkedin Norbert	Norbert is a library that provides easy cluster management and workload distribution. With Norbert, you can quickly distribute a simple client/server architecture to create a highly scalable architecture capable of handling heavy traffic. Implemented in Scala, Norbert wraps ZooKeeper, Netty and uses Protocol Buffers for transport to make it easy to build a cluster aware application. A Java API is provided and pluggable load balancing strategies are supported with round robin and consistent hash strategies provided out of the box.	Linedin Project2. GitHub source code
MPICH	high performance and widely portable implementation of the Message Passing Interface (MPI) standard	Website
OpenMPI	message passing framework	OpenMPI
Serf	decentralized solution for service discovery and orchestration	Serf
Spotify Luigi	a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.	Website
Spring XD	Spring XD (Xtreme Data) is a evolution of Spring Java application development framework to help Big Data Applications by Pivotal. SpringSource was the company created by the founders of the Spring Framework. SpringSource was purchased by VMware where it was maintained for some time as a separate division within VMware. Later VMware, and its parent company EMC Corporation, formally created a joint venture called Pivotal. Spring XD is more than development framework library, is a distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export. It could be considered as alternative to Apache Flume/Sqoop/Oozie in some scenarios. Spring XD is part of Pivotal Spring for Apache Hadoop (SHDP). SHDP, integrated with Spring, Spring Batch and Spring Data are part of the Spring IO Platform as foundational libraries. Building on top of, and extending this foundation, the Spring IO platform provides Spring XD as big data runtime. Spring for Apache Hadoop (SHDP) aims to help simplify the development of Hadoop based applications by providing a consistent configuration and API across a wide range of Hadoop ecosystem projects such as Pig, Hive, and Cascading in addition to providing extensions to Spring Batch for orchestrating Hadoop based workflows.	Spring XD on GitHub
Twitter Elephant Bird	Elephant Bird is a project that provides utilities (libraries) for working with LZOP-compressed data. It also provides a container format that supports working with Protocol Buffers, Thrift in MapReduce, Writables, Pig LoadFuncs, Hive SerDe, HBase miscellanea. This open source library is massively used in Twitter.	Elephant Bird GitHub
Twitter Finagle	Finagle is an asynchronous network stack for the JVM that you can use to build asynchronous Remote Procedure Call (RPC) clients and servers in Java, Scala, or any JVM-hosted language.	Website
Scheduling
Apache Aurora	is a service scheduler that runs on top of Apache Mesos	Apache Incubator
Apache Falcon	Apache™ Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows. Instead of hard-coding complex data lifecycle capabilities, Hadoop applications can now rely on the well-tested Apache Falcon framework for these functions. Falcon’s simplification of data management is quite useful to anyone building apps on Hadoop. Data Management on Hadoop encompasses data motion, process orchestration, lifecycle management, data discovery, etc. among other concerns that are beyond ETL. Falcon is a new data processing and management platform for Hadoop that solves this problem and creates additional opportunities by building on existing components within the Hadoop ecosystem (ex. Apache Oozie, Apache Hadoop DistCp etc.) without reinventing the wheel.	Apache Falcon
Apache Oozie	Workflow scheduler system for MR jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs by time (frequency) and data availabilit	Apache Oozie
Chronos	distributed and fault-tolerant scheduler	Chronos
Linkedin Azkaban	Hadoop workflow management. A batch job scheduler can be seen as a combination of the cron and make Unix utilities combined with a friendly UI.	LinkedIn Azkaban
Pinterest Pinball	customizable platform for creating workflow managers	Website
Sparrow	Sparrow is a high throughput, low latency, and fault-tolerant distributed cluster scheduler. Sparrow is designed for applications that require resource allocations frequently for very short jobs, such as analytics frameworks. Sparrow schedules from a distributed set of schedulers that maintain no shared state. Instead, to schedule a job, a scheduler obtains intantaneous load information by sending probes to a subset of worker machines. The scheduler places the job’s tasks on the least loaded of the probed workers. This technique allows Sparrow to schedule in milliseconds, two orders of magnitude faster than existing approaches. Sparrow also handles failures: if a scheduler fails, a client simply directs scheduling requests to an alternate scheduler	Github2. Paper
Machine Learning
Apache Mahout	Machine learning library and math library, on top of MapReduce.	Apache Mahout
Ayasdi Core	tool for topological data analysis	Website
brain	Neural networks in JavaScript.	Website
Cloudera Oryx	The Oryx open source project provides simple, real-time large-scale machine learning / predictive analytics infrastructure. It implements a few classes of algorithm commonly used in business applications: collaborative filtering / recommendation, classification / regression, and clustering.	Oryx at GitHub2. Cloudera forum for Machine Learning
Concurrent Pattern	Machine Learning for Cascading on Apache Hadoop through an API, and standards based PMML	Cascading Pattern
convnetjs	Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.	Website
Decider	Flexible and Extensible Machine Learning in Ruby.	Website
etcML	text classification with machine learning
Etsy Conjecture	Conjecture is a framework for building machine learning models in Hadoop using the Scalding DSL. The goal of this project is to enable the development of statistical models as viable components in a wide range of product settings. Applications include classification and categorization, recommender systems, ranking, filtering, and regression (predicting real-valued numbers). Conjecture has been designed with a primary emphasis on flexibility and can handle a wide variety of inputs. Integration with Hadoop and scalding enable seamless handling of extremely large data volumes, and integration with established ETL processes. Predicted labels can either be consumed directly by the web stack using the dataset loader, or models can be deployed and consumed by live web code. Currently, binary classification (assigning one of two possible labels to input data points) is the most mature component of the Conjecture package.	Github
Google Sibyl	System for Large Scale Machine Learning at Google	Website2. Website3. Website
H2O	statistical, machine learning and math runtime for Hadoop	H2O
IBM Watson	cognitive computing system	Website
MLbase	distributed machine learning libraries for the BDAS stack	MLbase
MLPNeuralNet	Fast multilayer perceptron neural network library for iOS and Mac OS X.	Website
nupic	Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.	Website
PredictionIO	machine learning server buit on Hadoop, Mahout and Cascading	PredictionIO
scikit-learn	scikit-learn: machine learning in Python.	Website
Spark MLlib	a Spark implementation of some common machine learning (ML) functionality	Spark Documentation
Sparkling Water	combine H2OÕs Machine Learning capabilities with the power of the Spark platform	Website2. Website
Vahara	Machine learning and natural language processing with Apache Pig	Website
Viv	global platform that enables developers to plug into and create an intelligent, conversational interface to anything	Website
Vowpal Wabbit	learning system sponsored by Microsoft and Yahoo!	Vowpal Wabbit
WEKA	Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is free software available under the GNU General Public License.	Website
Wit	Natural Language for the Internet of Things	Website
Wolfram Alpha	computational knowledge engine	Website
Benchmarking
Apache Hadoop Benchmarking	There are two main JAR files in Apache Hadoop for benchmarking. This JAR are micro-benchmarks for testing particular parts of the infrastructure, for instance TestDFSIO analyzes the disk system, TeraSort evaluates MapReduce tasks, WordCount measures cluster performance, etc. Micro-Benchmarks are packaged in the tests and exmaples JAR files, and you can get a list of them, with descriptions, by invoking the JAR file with no arguments. With regards Apache Hadoop 2.2.0 stable version we have available the following JAR files for test, examples and benchmarking. The Hadoop micro-benchmarks, are bundled in this JAR files: hadoop-mapreduce-examples-2.2.0.jar, hadoop-mapreduce-client-jobclient-2.2.0-tests.jar.	MAPREDUCE-3561 umbrella ticket to track all the issues related to performance
Berkeley SWIM Benchmark	The SWIM benchmark (Statistical Workload Injector for MapReduce), is a benchmark representing a real-world big data workload developed by University of California at Berkley in close cooperation with Facebook. This test provides rigorous measurements of the performance of MapReduce systems comprised of real industry workloads.	GitHub SWIN
Big-Bench	Big Bench Workload Development	Website
Hive-benchmarks	some benchmarking queries for Apache Hive	Website
Hive-testbench	Testbench for experimenting with Apache Hive at any data scale.	Website
Intel HiBench	HiBench is a Hadoop benchmark suite.	Website
Netflix Inviso	performance focused Big Data tool	Website
PUMA Benchmarking	Benchmark suite which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. There are a total of 13 benchmarks, out of which Tera-Sort, Word-Count, and Grep are from Hadoop distribution. The rest of the benchmarks were developed in-house and are currently not part of the Hadoop distribution. The three benchmarks from Hadoop distribution are also slightly modified to take number of reduce tasks as input from the user and generate final time completion statistics of jobs.	MAPREDUCE-51162. Faraz Ahmad researcher3. PUMA Docs
Yahoo Gridmix3	Hadoop cluster benchmarking from Yahoo engineer team.	Website
Security
Apache Argus	framework to enable, monitor and manage comprehensive data security across the Hadoop platform	Website
Apache Knox Gateway	System that provides a single point of secure access for Apache Hadoop clusters. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster). The Gateway runs as a server (or cluster of servers) that serve one or more Hadoop clusters.	Website
Apache Sentry	Sentry is the next step in enterprise-grade big data security and delivers fine-grained authorization to data stored in Apache Hadoop™. An independent security module that integrates with open source SQL query engines Apache Hive and Cloudera Impala, Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise data sets. Sentry was a Cloudera development.	Website
PacketPig	Open Source Big Data Security Analytics	Website
Voltage SecureData	data protection framework	Website
System Deployment
Ankush	A big data cluster management tool that creates and manages clusters of different technologies.	Website
Apache Ambari	Intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Apache Ambari was donated by Hortonworks team to the ASF. It’s a powerful and nice interface for Hadoop and other typical applications from the Hadoop ecosystem. Apache Ambari is under a heavy development, and it will incorporate new features in a near future. For example Ambari is able to deploy a complete Hadoop system from scratch, however is not possible use this GUI in a Hadoop system that is already running. The ability to provisioning the operating system could be a good addition, however probably is not in the roadmap..	Apache Ambari
Apache Bigtop	Bigtop was originally developed and released as an open source packaging infrastructure by Cloudera. BigTop is used for some vendors to build their own distributions based on Apache Hadoop (CDH, Pivotal HD, Intel’s distribution), however Apache Bigtop does many more tasks, like continuous integration testing (with Jenkins, maven, …) and is useful for packaging (RPM and DEB), deployment with Puppet, and so on. Apache Bigtop could be considered as a community effort with a main focus: put all bits of the Hadoop ecosystem as a whole, rather than individual projects.	Apache Bigtop.
Apache Helix	Apache Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Originally developed by Linkedin, now is in an incubator project at Apache. Helix is developed on top of Zookeeper for coordination tasks. .	Apache Helix
Apache Mesos	Mesos is a cluster manager that provides resource sharing and isolation across cluster applications. Like HTCondor, SGE or Troque can do it. However Mesos is hadoop centred design	Apache Mesos
Apache Slider	Slider is a YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired -even while the cluster is running.	Gihub page
Apache Whirr	Apache Whirr is a set of libraries for running cloud services. It allows you to use simple commands to boot clusters of distributed systems for testing and experimentation. Apache Whirr makes booting clusters easy.	Apache Whirr
Apache YARN	Apache Hadoop YARN is a sub-project of Hadoop at the Apache Software Foundation introduced in Hadoop 2.0 that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce.	Apache YARN
Brooklyn	brooklyn is a library that simplifies application deployment and management. For deployment, it is designed to tie in with other tools, giving single-click deploy and adding the concepts of manageable clusters and fabrics: Many common software entities available out-of-the-box. Integrates with Apache Whirr – and thereby Chef and Puppet – to deploy well-known services such as Hadoop and elasticsearch (or use POBS, plain-old-bash-scripts) Use PaaS’s such as OpenShift, alongside self-built clusters, for maximum flexibility	Github
Buildoop	Buildoop is an open source project licensed under Apache License 2.0, based on Apache BigTop idea. Buildoop is a collaboration project that provides templates and tools to help you create custom Linux-based systems based on Hadoop ecosystem. The project is built from scrach using Groovy language, and is not based on a mixture of tools like BigTop does (Makefile, Gradle, Groovy, Maven), probably is easier to programming than BigTop, and the desing is focused in the basic ideas behind the buildroot Yocto Project. The project is in early stages of development right now.	Buildoop
Cloudera HUE	Web application for interacting with Apache Hadoop.	Website
Deimos	Mesos containerizer hooks for Docker	Website
Develoop	tool for provisioning, managing and monitoring Apache Hadoop	Website
Facebook Autoscale	the load balancer will concentrate workload to a server until it has at least a medium-level workload	Website
Facebook Prism	multi datacenters replication system	Website
Ganglia Monitoring System	scalable distributed monitoring system for high-performance computing systems such as clusters and Grids	Website
Genie	Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.	Website
Google Borg	job scheduling and monitoring system	Wired article
Google Omega	job scheduling and monitoring system	Talk
Hannibal	Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.	Website
Hortonworks HOYA	HOYA is defined as “running HBase On YARN”. The Hoya tool is a Java tool, and is currently CLI driven. It takes in a cluster specification – in terms of the number of regionservers, the location of HBASE_HOME, the ZooKeeper quorum hosts, the configuration that the new HBase cluster instance should use and so on.	Hortonworks Blog
Jumbune	Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.	Website2. Github
Marathon	Marathon is a Mesos framework for long-running services. Given that you have Mesos running as the kernel for your datacenter, Marathon is the init or upstart daemon.	Website
Applications
Adobe Spindle	Next-generation web analytics processing with Scala, Spark, and Parquet	Website
Apache Kiji	Build Real-time Big Data Applications on Apache HBase.	Website
Apache Nutch	Highly extensible and scalable open source web crawler software project. A search engine based on Lucene: A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Web crawlers can copy all the pages they visit for later processing by a search engine that indexes the downloaded pages so that users can search them much more quickly.	Website
Apache OODT	OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA’s scientific archives	Website
Apache Tika	Toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.	Apache Tika
Domino	Run, scale, share, and deploy models Ñ without any infrastructure.	Website
Eclipse BIRT	BIRT is an open source Eclipse-based reporting system that integrates with your Java/Java EE application to produce compelling reports.	Website
Eventhub	open source event analytics platform.	Website
HIPI Library	HIPI is a library for Hadoop’s MapReduce framework that provides an API for performing image processing tasks in a distributed computing environment.	Website
Hunk	Splunk analytics for Hadoop	Hunk
MADlib	The MADlib project leverages the data-processing capabilities of an RDBMS to analyze data. The aim of this project is the integration of statistical data analysis into databases. The MADlib project is self-described as the Big Data Machine Learning in SQL for Data Scientists. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (now Pivotal)	MADlib Community
PivotalR	PivotalR is a package that enables users of R, the most popular open source statistical programming language and environment to interact with the Pivotal (Greenplum) Database as well as Pivotal HD / HAWQ and the open-source database PostgreSQL for Big Data analytics. R is a programming language and data analysis software: you do data analysis in R by writing scripts and functions in the R programming language. R is a complete, interactive, object-oriented language: designed by statisticians, for statisticians. The language provides objects, operators and functions that make the process of exploring, modeling, and visualizing data a natural one.	Website
Qubole	auto-scaling Hadoop cluster, built-in data connectors.	Website
Sense	Cloud Platform for Data Science and Big Data Analytics	Website
Snowplow	enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.	Website
SparkR	R frontend for Spark	AMPLab extras
Splunk	analyzer for machine-generated date	Splunk
Talend	Talend is an open source software vendor that provides data integration, data management, enterprise application integration and big data software and solutions.	Website
Data Warehouse
Google Mesa	highly scalable analytic data warehousing system	Website
IBM BigInsights	data processing, warehousing and analytics	Website
Microsoft Cosmos	Microsoft’s internal BigData analysis platform	Website
Search engine and framework
Apache Lucene	Search engine library	Apache Lucene
Apache Solr	Search platform for Apache Lucene	Apache Solr
ElasticSearch	Search and analytics engine based on Apache Lucene	ElasticSearch
Elasticsearch Hadoop	Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.	Website
Enigma.io	Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web	Website
Facebook Unicorn	social graph search platform	Website
Google Caffeine	continuous indexing system	Google blog post
Google Percolator	continuous indexing system	Paper
TeraGoogle	large search index
Haeinsa	Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase. Use Haeinsa if you need strong ACID semantics on your HBase cluster. Is based on Google Perlcoator concept.	Website
HBase Coprocessor	implementation of Percolator, part of HBase	HBase Coprocessor
hIndex	Secondary Index for HBase	Website
Lily HBase Indexer	quickly and easily search for any content stored in HBase	Website
LinkedIn Bobo	is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.	Github Page
LinkedIn Cleo	Cleo is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search. It is suitable for data sets of varying sizes and types. Cleo has been used extensively to power LinkedIn typeahead search covering professional network connections, companies, groups, questions, skills and other site features.	Github
LinkedIn Galene	search architecture at LinkedIn	Blog post on LinkedIn engineer
LinkedIn Zoie	Zoie is a realtime search/indexing system written in Java.	Github
Sphnix Search Server	Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server.	Sphinx
MySQL forks and evolutions
Amazon RDS	MySQL databases in Amazon’s cloud	Amazon RDS
Drizzle	Drizzle is a re-designed version of the MySQL v6.0 codebase and is designed around a central concept of having a microkernel architecture. Features such as the query cache and authentication system are now plugins to the database, which follow the general theme of “pluggable storage engines” that were introduced in MySQL 5.It supports PAM, LDAP, and HTTP AUTH for authentication via plugins it ships. Via its plugin system it currently supports logging to files, syslog, and remote services such as RabbitMQ and Gearman. Drizzle is an ACID-compliant relational database that supports transactions via an MVCC design	Website
Google Cloud SQL	MySQL databases in Google’s cloud	Google Cloud SQL
MariaDB	enhanced, drop-in replacement for MySQL	MariaDB
MySQL Cluster	MySQL implementation using NDB Cluster storage engine	MySQL Cluster
Percona Server	enhanced, drop-in replacement for MySQL	Percona Server
ProxySQL	High Performance Proxy for MySQL	ProxySQL
TokuDB	TokuDB is a storage engine for MySQL and MariaDB that is specifically designed for high performance on write-intensive workloads. It achieves this via Fractal Tree indexing. TokuDB is a scalable, ACID and MVCC compliant storage engine. TokuDB is one of the technologies that enable Big Data in MySQL.	Website
WebScaleSQL	is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale, and seek greater performance from a database technology tailored for their needs.	Website
PostgreSQL forks and evolutions
HadoopDB	hybrid of MapReduce and DBMS	HadoopDB
IBM Netezza	high-performance data warehouse appliances	Website
Postgres-XL	Scalable Open Source PostgreSQL-based Database Cluster	Website
RecDB	Open Source Recommendation Engine Built Entirely Inside PostgreSQL	Website
Stado	open source MPP database system solely targeted at data warehousing and data mart applications	Website
Yahoo Everest	multi-peta-byte database / MPP derived by PostgreSQL	Website
Memcached forks and evolutions
Facebook McDipper	key/value cache for flash storage	Facebook McDipper
Facebook Memcached	fork of Memcache	Facebook Memcached
Twemproxy	A fast, light-weight proxy for memcached and redis	Github
Twitter Fatcache	key/value cache for flash storage	Twitter Fatcache
Twitter Twemcache	fork of Memcache	Twitter Twemcache
Embedded Databases
Actian PSQL	ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications	Website
BerkeleyDB	a software library that provides a high-performance embedded database for key/value data	Oracle website
HamsterDB	transactional key-value database	Website
HanoiDB	HanoiDB implements an indexed, key/value storage engine. The primary index is a log-structured merge tree (LSM-BTree) implemented using ‘doubling sizes’ persistent ordered sets of key/value pairs, similar is some regards to LevelDB. HanoiDB includes a visualizer which when used to watch a living database resembles the ‘Towers of Hanoi’ puzzle game, which inspired the name of this database.	Github
LevelDB	a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.	Google code website
LMDB	ultra-fast, ultra-compact key-value embedded data store developed by Symas	Symas website
RocksDB	RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads.	RocksDB site
Business Intelligence
ActivePivot	Java In-Memory OLAP cube stored in columns, with clearly decoupled pre/post processing	Website
Adatao	business intelligence and data science platform	Website
Apama analytics	platform for streaming analytics and intelligent automated action	Website
Atigeo xPatterns	data analytics platform	Website
BIME Analytics	business intelligence platform in the cloud	Website
Chartio	lean business intelligence platform to visualize and explore your data.	Website
Datapine	self-service business intelligence tool in the cloud	Website
Jaspersoft	powerful business intelligence suite.	Website
Jedox Palo	Palo Suite combines all core applications — OLAP Server, Palo Web, Palo ETL Server and Palo for Excel — into one comprehensive and customisable Business Intelligence platform. The platform is completely based on Open Source products representing a high-end Business Intelligence solution which is available entirely free of any license fees.	Website
Microsoft	business intelligence software and platform.	Website
Microstrategy	software platforms for business intelligence, mobile intelligence, and network applications.	Website
Pentaho	business intelligence platform.	Website
Qlik	business intelligence and analytics platform.	Website
SpagoBI	SpagoBI is an Open Source Business Intelligence suite, belonging to the free/open source SpagoWorld initiative, founded and supported by Engineering Group. It offers a large range of analytical functions, a highly functional semantic layer often absent in other open source platforms and projects, and a respectable set of advanced data visualization features including geospatial analytics	Website
Spotfire	business intelligence platform	Website
Tableau	business intelligence platform.	Website
Teradata Aster	Big Data Analytics	Website
Tessera	Environment for Deep Analysis of Large Complex Data	Website
Zeppelin	open source data analysis environment on top of Hadoop.	Website
Zoomdata	Big Data Analytics	Website
Data Visualization
Arbor	graph visualization library using web workers and jQuery.	Website
CartoDB	open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API	Website
Chart.js	open source HTML5 Charts visualizations.	Website
Crossfilter	avaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js	Website
Cubism	JavaScript library for time series visualization.	Website
Cytoscape	open source software platform for visualizing complex networks and integrating these with any type of attribute data	Website2. Website
D3	javaScript library for manipulating documents.	Website
DC.js	Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3	Website
Envisionjs	dynamic HTML5 visualization.	Website
Freeboard	pen source real-time dashboard builder for IOT and other web mashups.	Website
Gephi	An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It’s like Photoshop, but for graphs. Available for Windows and Mac OS X.	Website
Google Charts	simple charting API.	Website
Grafana	graphite dashboard frontend, editor and graph composer.	Website
Graphite	scalable Realtime Graphing.	Website
Highcharts	simple and flexible charting API.	Website
IPython	provides a rich architecture for interactive computing	Website
Keylines	toolkit for visualizing the networks in your data	Website
Matplotlib	plotting with Python.	Website
NVD3	chart components for d3.js.	Website
Peity	Progressive SVG bar, line and pie charts.	Website
Plot.ly	Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly’s online spreadsheet. Fork others’ plots.	Website
Recline	simple but powerful library for building data applications in pure Javascript and HTML.	Website
Redash	open-source platform to query and visualize data.	Website
Sigma.js	JavaScript library dedicated to graph drawing.	Website
Vega	a visualization grammar.	Website
Internet of things and sensor data
TempoIQ	Cloud-based sensor analytics	Website

你可能感兴趣的:(hadoop相关)

Hadoop相关面试题努力的搬砖人. java 面试 hadoop
以下是150道Hadoop面试题及其详细回答，涵盖了Hadoop的基础知识、HDFS、MapReduce、YARN、HBase、Hive、Sqoop、Flume、ZooKeeper等多个方面，每道题目都尽量详细且简单易懂：Hadoop基础概念类1.什么是Hadoop？Hadoop是一个由Apache基金会开发的开源分布式计算框架，主要用于处理和存储大规模数据集。它提供了高容错性和高扩展性的分布式存
大数据技术学习框架（更新中......）小技工丨大数据技术学习大数据学习
Hadoop相关HDFS分布式文件系统MR(MapReduce)离线数据处理MR-图解YARN集群资源管理ZooKeeperZooKeeper分布式协调框架Hive相关Hive-01之数仓、架构、数据类型、DDL、内外部表Hive-02之分桶表、数据导入导出、静动态分区、查询、排序、hiveserver2Hive-03之传参、常用函数、explode、lateralview、行专列、列转行、UDF
60款顶级大数据开源工具 La victoria 大数据
一、Hadoop相关工具1.HadoopApache的Hadoop项目已几乎与大数据划上了等号。它不断壮大起来，已成为一个完整的生态系统，众多开源工具面向高度扩展的分布式计算。支持的操作系统：Windows、Linux和OSX。相关链接：http://hadoop.apache.org2.Ambari作为Hadoop生态系统的一部分，这个Apache项目提供了基于Web的直观界面，可用于配置、管理
基于分布式计算的电商系统设计与实现【系统设计、模型预测、大屏设计、海量数据、Hadoop集群】王小王-123 hadoop 大数据分布式电商系统分析分布式计算
文章目录==有需要本项目的代码或文档以及全部资源，或者部署调试可以私信博主==项目展示项目介绍目录摘要Abstract1引言1.1研究背景1.2国内外研究现状1.3研究目的1.4研究意义2关键技术理论介绍2.1Hadoop相关组件介绍2.2分布式集群介绍2.3Pyecharts介绍2.4Flask框架3分布式集群搭建及数据准备3.1Hadoop全套组件搭建3.2数据集介绍3.3数据预处理4分布式计
[Hadoop]万字长文Hadoop相关优化和问题排查总结王一1995 hadoop jvm java
目录写文章的背景namenode频繁切换的原因namenodeHA如何实现，关键技术难题是什么？namenode优化namenode内存生产配置NameNode心跳并发配置开启回收站配置datanode的优化hdfs调优hadoop的优化YARN的优化HDFS调优的基本原则HDFS调优的常用参数排查哪个任务的cpu占用高hdfs查询慢的原因怎样判断是否是数据倾斜集群重启任务自动重启hadoop宕机
搭建hadoop单机环境 .Passion hadoop hadoop hdfs 大数据
hadoop笔记sbin:一些启动脚本【服务端的serverbin】logs:存放hadoop相关日志bin:客户端的脚本etc:hadoop相关的配置文件格式化文件系统配置免密码登录ssh-keygen-trsa-P''-f~/.ssh/id_rsacat~/.ssh/id_rsa.pub>>~/.ssh/authorized_keys#启动namenode#sbin/hadoop-daemon
MPP架构与Hadoop架构是一回事吗？ ThoughtWorks
计算机领域的很多概念都存在一些传播上的“谬误”。MPP这个概念就是其中之一。它的“谬误”之处在于，明明叫做“MassivelyParallelProcessing（大规模并行处理）”，却让非常多的人拿它与大规模并行处理领域最著名的开源框架Hadoop相关框架做对比，这实在是让人困惑——难道Hadoop不是“大规模并行处理”架构了？很多人在对比两者时，其实并不知道MPP的含义究竟是什么、两者的可比性
hadoop主要文件及目录简介我很ruo hadoop
1.hadoop目录概述hadoop的解压目录下的主要文件如下图所示：其中：/bin目录存放对Hadoop相关服务（HDFS,YARN）进行操作的脚本；/etc目录存放Hadoop的配置文件/lib目录存放Hadoop的本地库（对数据进行压缩解压缩功能）/sbin目录存放启动或停止Hadoop相关服务的脚本/share目录存放Hadoop的依赖jar包、文档、和官方案例下文将对常用的几个目录进行进
Flume实时读取本地/目录文件到HDFS Francek Chen 大数据技术基础 flume hdfs 大数据
目录一、准备工作二、实时读取本地文件到HDFS（一）案例需求（二）需求分析（三）实现步骤三、实时读取目录文件到HDFS（一）案例需求（二）需求分析（三）实现步骤一、准备工作Flume要想将数据输出到HDFS，必须持有Hadoop相关jar包。将以下jar包拷贝到“/usr/local/flume/lib”目录下。/usr/local/servers/hadoop/share/hadoop/comm
Hadoop、Pig、Hive、Storm、NOSQL 学习资源收集【Updating】 (转) 我爱大海V5 Hadoop hadoop
目录[-]（一）hadoop相关安装部署（二）hive（三）pig（四）hadoop原理与编码（五）数据仓库与挖掘（六）Oozie工作流（七）HBase（八）flume（九）sqoop（十）ZooKeeper（十一）NOSQL（十二）Hadoop监控与管理（十三）Storm（十四）YARN&Hadoop2.0附：（一）hadoop相关安装部署1、hadoop在windowscygwin下的部署：h
Flume基础知识（四）：Flume实战之实时监控单个追加文件依晴无旧大数据 flume 大数据
1）案例需求：实时监控Hive日志，并上传到HDFS中2）需求分析：3）实现步骤：（1）Flume要想将数据输出到HDFS，依赖Hadoop相关jar包检查/etc/profile.d/my_env.sh文件，确认Hadoop和Java环境变量配置正确JAVA_HOME=/opt/module/jdk1.8.0_212HADOOP_HOME=/opt/module/ha/hadoop-3.1.3P
Hadoop相关安装包上传到目录并完成安装余生跟他走数据仓库
1.指定一个安装的目录/usr/local/自己的名字(mkdircdhong、rm-rf*)[root@cdhong01~]#cd/usr/local/[root@cdhong01local]#rm-rf*[root@cdhong01local]#mkdircdhong[root@cdhong01local]#cdcdhong/[root@cdhong01cdhong]#pwd/usr/loca
hive-3.1.2环境安装实验芝士小熊饼干 hive hadoop 数据仓库
1.修改hadoop相关参数1-修改core-site.xml[bigdata@masterhive]$vim/opt/module/hadoop/etc/hadoop/core-site.xmlhadoop.proxyuser.bigdata.hosts*hadoop.proxyuser.bigdata.groups*hadoop.proxyuser.bigdata.users*2.hive解压
Flume监控Hive日志并上传到HDFS 无发可脱丶笔记学习 flume 大数据 flume
一、实时监控单个追加文件1.需求：实时监控Hive日志，并上传到HDFS2.实现步骤：（1）上传Hadoop相关jar包到flume/lib目录下flume相关jar包https://blog.csdn.net/Dj_hanhan/article/details/110097742（2）进入usr/flume/job目录，创建flume-file-hdfs.conf文件#Namethecompon
实时监控 Hive 日志，并上传到 HDFS 中夏殿灬青葛石 Flume hdfs hive hadoop
Flume要想将数据输出到HDFS，依赖Hadoop相关jar包检查/etc/profile.d/my_env.sh文件，确认Hadoop和Java环境变量配置正确创建flume-file-hdfs.conf文件注：要想读取Linux系统中的文件，就得按照Linux命令的规则执行命令。由于Hive日志在Linux系统中所以读取文件的类型选择：exec即execute执行的意思。表示执行Linux命
【log4j漏洞】log4j 1.x漏洞依赖包解决方案秦拿希 log4j java springboot
一问题描述log4j1.x被证实有漏洞，公司要求升级log4j版本到最新，在升级过程中发现问题。对于应用中我们自己写的程序全部替换为新版本。但是在打包发布镜像到harbor时还是被检测出log4j的引用。二问题分析那么自己的程序中确定是没有引用了，那log4j的引用必定是程序中的第三方依赖包了。于是继续检查本地程序，在pom中一个个的排查依赖包，发现是hadoop相关的包引用到了log4j1.x，
【Hadoop】安装部署-完全分布式搭建 db_lmr_2071 分布式 hadoop 大数据
文章目录前言一、部署需要的软件二、Hadoop配置环境1.配置网络环境关闭防火墙2.安装jdk和hadoop2.1配置jdk环境变量2.2配置Hadoop环境变量三、准备三台虚拟机1.修改主机名与IP映射2.修改主机上的hadoop相关配置文件2.1core-site.xml2.2hdfs-site.xml2.3yarn-site.xml2.4slaves3.将主机上的hadoop配置文件，同步到
数仓开发面试题之Hadoop相关话数Science 面试大数据 hadoop 大数据
提纲MapReduce原理，map数、reduce数的参数说一下mapjoin与reducejoinhivesql怎么优spark和hive的区别数据倾斜几种解决方式数据如何清洗说一下udf、udtf、udaf，集成的类、接口，怎么写hive文件存储格式，对比内外表区别hive执行的job数是怎么确定的cube、groupingsets、grouping__idorderby、sortby、dis
Hadoop相关小美美大白蛋 hadoop 大数据分布式
hdfsgetconf-confKeydfs.namenode.http-address查看Hadoop工作端口的信息hdfsgetconf-confKeydfs.datanode.http.address查看HDFS的NameNode组件的HTTP端口。
60款顶级大数据开源工具 weixin_34006965 大数据操作系统 java
一、Hadoop相关工具1.HadoopApache的Hadoop项目已几乎与大数据划上了等号。它不断壮大起来，已成为一个完整的生态系统，众多开源工具面向高度扩展的分布式计算。支持的操作系统：Windows、Linux和OSX。相关链接：http://hadoop.apache.org2.Ambari作为Hadoop生态系统的一部分，这个Apache项目提供了基于Web的直观界面，可用于配置、管理
Flink on yarn模式部署 fragrans CDH和大数据组件 Flink flink yarn java
目录1.基于docker部署cdh2.遇到的异常2.1flink下缺少hadoop相关依赖2.2jdk7造成的错误3.启动flinkonyarn模式<
HADOOP集群大数据词频统计及设计比较（完整教程）鸷鸟之不群 Hadoop相关 hadoop 网络 linux
###如若发现错误，或代码敲错，望能评论指正！！！通过百度网盘分享的文件：Hadoop相关需要的软件链接:https://pan.baidu.com/s/1XzDvyhP4_LQzAM1auQCSrg?pwd=tph5提取码:tph5VMware下安装CentOS一、先安装一个虚拟机安装好后要右键，找到用管理员的方式打开也可以设置成每次打开都是以管理员身份运行二、安装一个CentOS，这里使用的是
Hadoop环境搭建星星失眠️ hadoop 大数据分布式
1Hadoop集群环境搭建概述所谓集群，就是一组通过网络互联的计算机，集群中的每一台计算机称作一个节点，Hadoop集群搭建就是在这个物理集群之上安装部署Hadoop相关的软件，然后对外提供大数据存储和分析等相关服务。一个前提：Hadoop是为了在Linux平台上使用而开发的一个现实：我们的电脑不是Linux系统如何解决？？？搭建虚拟机，在虚拟机上安装Linux操作系统虚拟机是什么？虚拟的计算机，
Hadoop相关知识点浪漫的诗人 hadoop 大数据分布式
文章目录一、主要命令二、配置虚拟机2.1设置静态ip2.2修改主机名及映射2.3修改映射2.4单机模式2.5伪分布式2.6完全分布式三、初识Hadoop四、三种模式的区别4.1、单机模式与伪分布式模式的区别4.2、特点4.3、配置文件的差异4.3.1、单机模式4.3.2、伪分布式模式4.3.3、完全分布式模式五、问答题六、shell访问hdfs(通过HDFS*Shell命令)6.1、问答题6.2、
本地报 HADOOP_HOME and hadoop.home.dir are unset 错误处理 HoneyYHQ9988 Hadoop 配置hadoop环境
在本地idea上运行Hadoop相关服务，控制台打印出此错误“HADOOP_HOMEandhadoop.home.dirareunset”，这是由于在本地Windows系统配置hadoop环境就会报此错误。第一步：下载winutils-master.zip蓝奏云：https://www.lanzous.com/i55ccnc对照你自己版本选择合适的插件。第二步：配置window上环境变量1、新建H
HBase（hbase-0.96.2）安装数大招疯 hadoop HBase 0.96.2 安装配置
明天要讲HBase课程，由于以前使用的是0.92的版本，所以在此记录下新版本的安装步骤（基于hadoop-2.2.0安装，hadoop2.2安装有空补上）。一、检查hadoop是否已安装且能正常运行方法一：检查hadoop相关进程是否都存在方法二：能否通过浏览器访问hdfs与mapred对应的端口二、安装HBase1、下载安装包：http://mirror.esocc.com/apache/hba
Flink on yarn 不废话集群部署病妖 flink flink yarn big data
文章目录Flinkonyarn集群部署前言先安装好yarn集群，在我们这个环境中使用的是CDH6.3，也就是基于hadoop3.0的大数据生态环境flink部署1.包下载2.将所下载的包放置/opt/flink下进行解压安装3.切换到相关目录4.配置hadoop相关路径5.如果第四步采用后仍然报错找不到相关包，则切换到lib包中并将相关hadoop包放置lib目录下6.确保你的环境有足够的内存能够
从零开始的Hadoop学习（三）| 集群分发脚本xsync 庭前云落 Hadoop hadoop 学习大数据
1.Hadoop目录结构bin目录：存放对Hadoop相关服务（hdfs，yarn，mapred）进行操作的脚本etc目录：Hadoop的配置文件目录，存放Hadoop的配置文件lib目录：存放Hadoop的本地库（对数据进行压缩解压缩功能）sbin目录：存放启动或停止Hadoop相关服务的脚本share目录：存放Hadoop的依赖jar包、文档、和官方案例2.Hadoop运行模式本地模式、伪分布
2 hadoop的目录水无痕simon Hadoop hadoop 大数据分布式
1.目录结构：其中比较的重要的路径有：hdfs,mapred,yarn（1）bin目录：存放对Hadoop相关服务（hdfs，yarn，mapred）进行操作的脚本（2）etc目录：Hadoop的配置文件目录，存放Hadoop的配置文件（3）lib目录：存放Hadoop的本地库（对数据进行压缩解压缩功能）（4）sbin目录：存放启动或停止Hadoop相关服务的脚本（5）share目录：存放Hado
mac 下编译hadoop源码疯狂的哈丘
本篇博客主要介绍社区版的hadoop源码的编译，以及会遇到的一些问题。一、获取hadoop源码可以通过hadoop的官网获取hadoop相关源码包:https://hadoop.apache.org/releases.html。或者直接通过git去github拉取最新的源码:gitclonehttps://github.com/apache/hadoop#拉完代码后进入源码目录cdhadoop#通
rust的指针作为函数返回值是直接传递，还是先销毁后创建？ wudixiaotie 返回值
这是我自己想到的问题，结果去知呼提问，还没等别人回答，我自己就想到方法实验了。。 fn main() { let mut a = 34; println!("a's addr:{:p}", &a); let p = &mut a; println!("p's addr:{:p}", &a
java编程思想 -- 数据的初始化百合不是茶 java 数据的初始化
1.使用构造器确保数据初始化 /* *在ReckInitDemo类中创建Reck的对象 */ public class ReckInitDemo { public static void main(String[] args) { //创建Reck对象 new Reck(); } }
[航天与宇宙]为什么发射和回收航天器有档期 comsci
地球的大气层中有一个时空屏蔽层,这个层次会不定时的出现,如果该时空屏蔽层出现,那么将导致外层空间进入的任何物体被摧毁,而从地面发射到太空的飞船也将被摧毁... 所以,航天发射和飞船回收都需要等待这个时空屏蔽层消失之后,再进行 &
linux下批量替换文件内容商人shang linux 替换
1、网络上现成的资料　　格式: sed -i "s/查找字段/替换字段/g" `grep 查找字段 -rl 路径` 　　linux sed 批量替换多个文件中的字符串　　sed -i "s/oldstring/newstring/g" `grep oldstring -rl yourdir` 　　例如：替换/home下所有文件中的www.admi
网页在线天气预报 oloz 天气预报
网页在线调用天气预报 <%@ page language="java" contentType="text/html; charset=utf-8" pageEncoding="utf-8"%> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transit
SpringMVC和Struts2比较杨白白 springMVC
1. 入口 spring mvc的入口是servlet，而struts2是filter（这里要指出，filter和servlet是不同的。以前认为filter是servlet的一种特殊），这样就导致了二者的机制不同，这里就牵涉到servlet和filter的区别了。参见：http://blog.csdn.net/zs15932616453/article/details/8832343 2
refuse copy, lazy girl! 小桔子 copy
妹妹坐船头啊啊啊啊！都打算一点点琢磨呢。文字编辑也写了基本功能了。。今天查资料，结果查到了人家写得完完整整的。我清楚的认识到： 1.那是我自己觉得写不出的高度 2.如果直接拿来用，很快就能解决问题 3.然后就是抄咩~~ 4.肿么可以这样子，都不想写了今儿个，留着作参考吧！拒绝大抄特抄，慢慢一点点写！
apache与php整合 aichenglong php apache web
一 apache web服务器 1 apeche web服务器的安装 1)下载Apache web服务器 2)配置域名(如果需要使用要在DNS上注册) 3)测试安装访问http://localhost/验证是否安装成功 2 apache管理 1)service.msc进行图形化管理 2)命令管理，配
Maven常用内置变量 AILIKES maven
Built-in properties ${basedir} represents the directory containing pom.xml ${version} equivalent to ${project.version} (deprecated: ${pom.version}) Pom/Project properties Al
java的类和对象百合不是茶 JAVA面向对象类对象
java中的类： java是面向对象的语言，解决问题的核心就是将问题看成是一个类，使用类来解决 java使用 class 类名来创建类，在Java中类名要求和构造方法，Java的文件名是一样的创建一个A类： class A{ } java中的类：将某两个事物有联系的属性包装在一个类中，再通
JS控制页面输入框为只读 bijian1013 JavaScript
在WEB应用开发当中，增、删除、改、查功能必不可少，为了减少以后维护的工作量，我们一般都只做一份页面，通过传入的参数控制其是新增、修改或者查看。而修改时需将待修改的信息从后台取到并显示出来，实际上就是查看的过程，唯一的区别是修改时，页面上所有的信息能修改，而查看页面上的信息不能修改。因此完全可以将其合并，但通过前端JS将查看页面的所有信息控制为只读，在信息量非常大时，就比较麻烦。
AngularJS与服务器交互 bijian1013 JavaScript AngularJS $http
对于AJAX应用（使用XMLHttpRequests）来说，向服务器发起请求的传统方式是：获取一个XMLHttpRequest对象的引用、发起请求、读取响应、检查状态码，最后处理服务端的响应。整个过程示例如下： var xmlhttp = new XMLHttpRequest(); xmlhttp.onreadystatechange
[Maven学习笔记八]Maven常用插件应用 bit1129 maven
常用插件及其用法位于：http://maven.apache.org/plugins/ 1. Jetty server plugin 2. Dependency copy plugin 3. Surefire Test plugin 4. Uber jar plugin 1. Jetty Pl
【Hive六】Hive用户自定义函数(UDF) bit1129 自定义函数
1. 什么是Hive UDF Hive是基于Hadoop中的MapReduce，提供HQL查询的数据仓库。Hive是一个很开放的系统，很多内容都支持用户定制，包括：文件格式：Text File，Sequence File 内存中的数据格式： Java Integer/String, Hadoop IntWritable/Text 用户提供的 map/reduce 脚本：不管什么
杀掉nginx进程后丢失nginx.pid，如何重新启动nginx ronin47 nginx 重启 pid丢失
nginx进程被意外关闭，使用nginx -s reload重启时报如下错误：nginx: [error] open() “/var/run/nginx.pid” failed (2: No such file or directory)这是因为nginx进程被杀死后pid丢失了，下一次再开启nginx -s reload时无法启动解决办法：nginx -s reload 只是用来告诉运行中的ng
UI设计中我们为什么需要设计动效 brotherlamp UI ui教程 ui视频 ui资料 ui自学
随着国际大品牌苹果和谷歌的引领，最近越来越多的国内公司开始关注动效设计了，越来越多的团队已经意识到动效在产品用户体验中的重要性了，更多的UI设计师们也开始投身动效设计领域。但是说到底，我们到底为什么需要动效设计？或者说我们到底需要什么样的动效？做动效设计也有段时间了，于是尝试用一些案例，从产品本身出发来说说我所思考的动效设计。一、加强体验舒适度嗯，就是让用户更加爽更加爽的用你的产品。
Spring中JdbcDaoSupport的DataSource注入问题 bylijinnan java spring
参考以下两篇文章： http://www.mkyong.com/spring/spring-jdbctemplate-jdbcdaosupport-examples/ http://stackoverflow.com/questions/4762229/spring-ldap-invoking-setter-methods-in-beans-configuration Sprin
数据库连接池的工作原理 chicony 数据库连接池
随着信息技术的高速发展与广泛应用，数据库技术在信息技术领域中的位置越来越重要，尤其是网络应用和电子商务的迅速发展，都需要数据库技术支持动态Web站点的运行，而传统的开发模式是：首先在主程序（如Servlet、Beans）中建立数据库连接；然后进行SQL操作，对数据库中的对象进行查询、修改和删除等操作；最后断开数据库连接。使用这种开发模式，对
java 关键字 CrazyMizzz java
关键字是事先定义的，有特别意义的标识符，有时又叫保留字。对于保留字，用户只能按照系统规定的方式使用，不能自行定义。 Java中的关键字按功能主要可以分为以下几类：（1）访问修饰符 public,private,protected p
Hive中的排序语法 daizj 排序 hive order by DISTRIBUTE BY sort by
Hive中的排序语法 2014.06.22 ORDER BY hive中的ORDER BY语句和关系数据库中的sql语法相似。他会对查询结果做全局排序，这意味着所有的数据会传送到一个Reduce任务上，这样会导致在大数量的情况下，花费大量时间。与数据库中 ORDER BY 的区别在于在hive.mapred.mode = strict模式下，必须指定 limit 否则执行会报错。
单态设计模式 dcj3sjt126com 设计模式
单例模式（Singleton）用于为一个类生成一个唯一的对象。最常用的地方是数据库连接。使用单例模式生成一个对象后，该对象可以被其它众多对象所使用。 <?phpclass Example{ // 保存类实例在此属性中 private static&
svn locked dcj3sjt126com Lock
post-commit hook failed (exit code 1) with output: svn: E155004: Working copy 'D:\xx\xxx' locked svn: E200031: sqlite: attempt to write a readonly database svn: E200031: sqlite: attempt to write a
ARM寄存器学习 e200702084 数据结构 C++c C#F#
无论是学习哪一种处理器，首先需要明确的就是这种处理器的寄存器以及工作模式。 ARM有37个寄存器，其中31个通用寄存器，6个状态寄存器。 1、不分组寄存器（R0-R7）不分组也就是说说，在所有的处理器模式下指的都时同一物理寄存器。在异常中断造成处理器模式切换时，由于不同的处理器模式使用一个名字相同的物理寄存器，就是
常用编码资料 gengzg 编码
List<UserInfo> list=GetUserS.GetUserList(11); String json=JSON.toJSONString(list); HashMap<Object,Object> hs=new HashMap<Object, Object>(); for(int i=0;i<10;i++) {
进程 vs. 线程 hongtoushizi 线程 linux 进程
我们介绍了多进程和多线程，这是实现多任务最常用的两种方式。现在，我们来讨论一下这两种方式的优缺点。首先，要实现多任务，通常我们会设计Master-Worker模式，Master负责分配任务，Worker负责执行任务，因此，多任务环境下，通常是一个Master，多个Worker。如果用多进程实现Master-Worker，主进程就是Master，其他进程就是Worker。如果用多线程实现
Linux定时Job：crontab -e 与 /etc/crontab 的区别 Josh_Persistence linux crontab
一、linux中的crotab中的指定的时间只有5个部分：* * * * * 分别表示：分钟，小时，日，月，星期，具体说来：第一段代表分钟 0—59 第二段代表小时 0—23 第三段代表日期 1—31 第四段代表月份 1—12 第五段代表星期几，0代表星期日 0—6 如： */1 * * * * 每分钟执行一次。 *
KMP算法详解 hm4123660 数据结构 C++算法字符串 KMP
字符串模式匹配我们相信大家都有遇过，然而我们也习惯用简单匹配法（即Brute-Force算法)，其基本思路就是一个个逐一对比下去，这也是我们大家熟知的方法，然而这种算法的效率并不高，但利于理解。假设主串s="ababcabcacbab",模式串为t="
枚举类型的单例模式 zhb8015 单例模式
E.编写一个包含单个元素的枚举类型[极推荐]。代码如下： public enum MaYun {himself; //定义一个枚举的元素，就代表MaYun的一个实例private String anotherField;MaYun() {//MaYun诞生要做的事情//这个方法也可以去掉。将构造时候需要做的事情放在instance赋值的时候：/** himself = MaYun() {*
Kafka+Storm+HDFS ssydxa219 storm
cd /myhome/usr/stormbin/storm nimbus &bin/storm supervisor &bin/storm ui &Kafka+Storm+HDFS整合实践kafka_2.9.2-0.8.1.1.tgzapache-storm-0.9.2-incubating.tar.gzKafka安装配置我们使用3台机器搭建Kafk
Java获取本地服务器的IP 中华好儿孙 java Web 获取服务器ip地址
System.out.println("getRequestURL:"+request.getRequestURL()); System.out.println("getLocalAddr:"+request.getLocalAddr()); System.out.println("getLocalPort:&quo