YARN
It support classic MapReduce framework
It also support other open source / commercial applications running on it, like Impala, Storm and they do not need change anything.
It also support user developed applications
It also enables frameworks like Tez, Spark
Execution Frameworks: YARN, Tez, Spark
Support DAG(directed acyclic graph) of tasks.
In memeory caching of data
MapReduce
Application engine.
Applications fits the MapReduce paradigm: need know the distributed data chains, and which are independent of each other, and then have the shuffle process that will feed the data into the reduce process.
Application does not fit the MapReduce paradigm:
Interactive data exploration - load data into memeory to avoid loading data from disk again and again.
Iterative data procesing - Machine Learing algorithms.
Tez
Application engine.
Features:
Handle Dataflow graphs with expressive API.
Support customized data types and customized logic application, so no restriction as on MapReduce of framework.
Can run complex DAG of tasks
Dynamic DAG changes
Reuse resource(containers) to avoid those costs of containers startup. More efficient.
Compare MapReduce and Tez on :
Use case:
SELECT a.vendor, COUNT(*), AVG(c.cost) FROM a JOIN b ON (a,id=b.id) JOIN a ON (a.itemid=c.itemid) GROUP BY a.vendor
Spark
Application engine.
It could run on HDFS directly without YARN is needed. It can also run on other storage too.
Features:
Advance DAG execution engine - Data can be shared across DAGs, between iterations and reused. So much faster than other DAG engines.
Support cyclic data flow
In-memory computing. If out of memory, it excels at gracefully spilling over to disks.
Can be accessd from Java, Scala, Python, R
Existing optimized libraries
Hadoop Resource Scheduling
Schedulers:
FIFO (default)
Fairshare - balance resource between application, default resource is memory but we can add CPUs as resource.
Balance out resource allocation among apps over time.
Can organize into queues/sub-queues
Garrantee minimum shares
Weighted app priorities
Capacity - guaratee resource for each application
Queues and sub-queues
Capacity Guarantee with elasticity
ACLs for security
Runtime changes/draining apps
Resource based scheduling
Lesson 4 Slides