Spark(1)Introduction and Installation
1. Introduction
1.1 MapReduce Model
Map -- read, convert
Reduce -- calculate
4 classes
Read and Convert data to key-value, Map, Reduce, Convert and output key-value to output data.
1.2 Apache mesos
Mesos and YARN, they can control the resource. Resource sharing system.
Hadoop scheduler, MPI scheduler, Spark
Mesos master, Standby master, … (Controlled by ZooKeeper)
Mesos slave, Mesos slave, Mesos slave … (execute Hadoop executor task, MPI executor task, … )
Mesos-master: manage framework and slave, give the resource from slave to framework
Mesos-slave: mesos-task
Framework: Hadoop, Spark …
Executor:
1.3 Spark Introduction
Spark is implemented by Scala and based on Mesos.
It can work with Hadoop and EC2, directly read data from HDFS or S3.
Bagel Shark
Spark(RDD, Map Reduce, FP)
Mesos
HDFS AWS s3n
Spark is using Map Reduce Model, function programming, Mesos, HDFS and S3
Spark Terms
RDD - Resilient Distributed Datasets
Local mode and Mesos Mode
Tansformations and Actions -
Transformation will return RDD,
Action return a collection of scala, value, null
Spark on Mesos
RDD + Job(tasks) ----> SparkScheduler -----> Mesos Master ---> Mesos Slave, Mesos Slave … ( Spark executor… tasks)
1.4 HDFS Introduction
Hadoop Distributed File System ---- NameNode(Only One)------> DataNode
Block 64M, default block of file
NameNode File name, tree, namespace image, edit log, How many blocks does one file have, where is them on the DataNodes.
DataNode Client or NameNode can write and read data from DataNodes
1.5 Zookeeper
Configuration Management
Cluster Management
1.6 NFS Introduction
NFS - Network FileSystem
2. Installation of Spark
After the version 0.6, we can ignore Mesos at first.
Get the source codes
>git clone https://github.com/mesos/spark.git
My scala version is 2.10.0, just try the command
>sudo sbt/sbt package
It works.
I also try to build with MAVEN, but it seems not working. Since I already have SCALA_HOME, I will directly run that
Syntax: ./run <class> <params>
>./run spark.examples.SparkLR local[2]
Or
>./run spark.examples.SparkPi local
I try to run spark to verify my environment, but it seems that it is not working because of the SCALA_HOME.
Error Message:
Exception in thread "main" java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
at spark.examples.SparkPi.main(SparkPi.scala)
Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
Solution:
>cd examples
>sudo mvn eclipse:eclipse
>cd ..
>sudo mvn eclipse:eclipse
try to import the samples and spark project into my eclipse and read the resource codes.
Read the code in spark-examples/src/main/scala/spark/examples/SparkPi.scala
Run this again
>sudo ./run spark.examples.SparkPi local
Still not working, told me SCALA_HOME is not set. But I am sure it is there.
>wget http://www.spark-project.org/files/spark-0.7.0-sources.tgz
Unzip and put it in the working directory
>sudo ln -s /Users/carl/tool/spark-0.7.0 /opt/spark-0.7.0
>sudo ln -s /opt/spark-0.7.0 /opt/spark
Compile the source codes
>sudo sbt/sbt compile
>sudo sbt/sbt package
>sudo sbt/sbt assembly
>sudo ./run spark.examples.SparkPi local
Error is still there, SCALA_HOME is not set.
Finally, I found the reason. I should change the conf/spark-env.sh
>cd conf
>cp spark-env.sh.template spark-env.sh
And be careful, do not use Scala version 2.10.0 there. I should use 2.9.2
export SCALA_HOME=/opt/scala2.9.2
This time, every thing will go well.
>sudo ./run spark.examples.SparkPi local
>sudo ./run spark.examples.SparkLR local[2]
Use local 2 CPU.
References:
Spark
http://www.ibm.com/developerworks/cn/opensource/os-spark/
http://spark-project.org/documentation/
http://rdc.taobao.com/team/jm/archives/tag/spark
http://rdc.taobao.com/team/jm/archives/2043
http://spark-project.org/examples/
http://rdc.taobao.com/team/jm/archives/1871
http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/
http://run-xiao.iteye.com/blog/1835707
http://www.yiihsia.com/2011/12/%E5%88%9D%E5%A7%8Bspark-%E5%9F%BA%E6%9C%AC%E6%A6%82%E5%BF%B5%E5%92%8C%E4%BE%8B%E5%AD%90/
http://www.cnblogs.com/jerrylead/archive/2012/08/13/2636115.html
http://blog.csdn.net/macyang/article/details/7100523
Git resource
https://github.com/mesos/spark
HDFS
http://www.cnblogs.com/forfuture1978/archive/2010/03/14/1685351.html
Hadoop
http://blog.csdn.net/robertleepeak/article/details/6001369
mesos
http://dongxicheng.org/mapreduce-nextgen/mesos_vs_yarn/
zookeeper
http://rdc.taobao.com/team/jm/archives/665