Apache Spark ecosystem

参考资料

1. Apache Spark Ecosystem – Complete Spark Components Guide

2. Apache Spark Ecosystem

3.edureka posts about spark

5. Spark SQL Tutorial – Understanding Spark SQL With Examples

6. Spark Streaming Tutorial – Sentiment Analysis Using Apache Spark

7.Spark MLlib – Machine Learning Library Of Apache Spark

8.Spark GraphX Tutorial – Graph Analytics In Apache Spark

4.Spark Tutorial: Real Time Cluster Computing Framework|
笔记：
Real Time Processing Framework
Real Time Analytics
Why Spark when Hadoop is already there?
What is Apache Spark?
Spark Features
Getting Started with Spark（Install Spark2.0 on Ubuntu.）
Using Spark with Hadoop
(main concepts of Spark like Spark Session(may be for insteading of Spark Context), Data Sources, RDDs(Resilient Distributed Dataset (RDD)), DataFrames(It is conceptually equivalent to a table in a relational database ) and other libraries.)
Spark Components
Use Case: Earthquake Detection using Spark
This blog is the first blog in the upcoming Apache Spark blog series which will include Spark Streaming, Spark Interview Questions, Spark MLlib and others.
We can see that Real Time Processing of Big Data is ingrained in every aspect of our lives. From fraud detection in banking to live surveillance systems in government, automated machines in healthcare to live prediction systems in the stock market, everything around us revolves around processing big data in near real time.

Hadoop ===> bach processing, Hadoop is based on batch processing of big data. This means that the data is stored over a period of time and is then processed using Hadoop.
Spark ===> Real Time Processing Framework, the data is generating over time.

Figure: Spark Tutorial – Differences between Hadoop and Spark

Figure: Spark Tutorial – Spark Features

Hadoop Integration:

Hadoop Integration:Apache Spark provides smooth compatibility with Hadoop. This is a boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.

Figure: Spark Tutorial – Spark Features

Hadoop components can be used alongside Spark in the following ways:
HDFS:Spark can run on top of HDFS to leverage the distributed replicated storage.
MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
YARN: Spark applications can be made to run on YARN (Hadoop NextGen).
Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.

Figure: Use Case – Flow diagram of Earthquake Detection using Apache Spark

Apache Spark ecosystem

你可能感兴趣的:(Apache Spark ecosystem)