启动
start-dfs.sh
start-yarn.sh
更改主机名
su root
cd
hostname localhost
看后台服务
jps
29456 NameNode
29863 SecondaryNameNode
30220 ResourceManager
30718 Jps
29548 DataNode
30307 NodeManager
spark-shell --driver-memory 512M --executor-memory 512M
driver和executor内存,默认1G
spark-submit --class 类在jar包中的路径 [--executor-memory 256M] jar包
例如:
spark-submit --master yarn --class sparkstreaming.SparkSteaming SparkStreaming.jar
spark-shell --master yarn --driver-memory 128M --executor-memory 128M
http://localhost:8088/proxy/application_1510121456507_0002/spark-submit --master yarn --class sparkstreaming.SparkSteaming SparkStreaming.jar
map -> shuffle -> reduce group\reduceByKey等操作会引起shuffle,shuffle因为涉及不同服务器间的数据传输,因此性能低
hadoop fs -mkdir /sougou
hadoop fs -put Sogou01.txt /sougou
hadoop fs -ls /sougou
Transformations:
map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues,sort,partionBy
actions:
Count, collect, reduce, lookup, save
SparkContext是driver在程序里的抽象
var rdd=sc.textFile("file:///home/hadoop/derby.log")
var wordcount=rdd.flatMap(_.split(" ").map(x=>(x,1)).reduceByKey(_+_)
wordcount.take(10)
map(x=>(x,1)) :窄依赖
reduceByKey:宽依赖
修改hadoop/...spark.../conf/hive_site.xml,最后加一个property
拷贝这个文件到hadoop/apache...hive...bin/conf/
命令行执行:
nohup hive --service metastore>metastore.log 2>&1 &
jps
/home/hadoop/spark-1.5.1-bin-hadoop2.4/sbin/start-thriftserver.sh
/home/hadoop/spark-1.5.1-bin-hadoop2.4/bin/beeline
!connect jdbc:hive2://localhost:10000