Spark version 2.2.3 基础概念方法讲解
1. 代码+案例详解:使用Spark处理大数据最全指南(上)
https://www.jianshu.com/p/826c16298ca6
2. 代码+案例详解:使用Spark处理大数据最全指南(下)
https://zhuanlan.zhihu.com/p/95022557
Spark 部署启动参看
https://github.com/heibaiying/BigData-Notes
Spark之本地模式与集群模式
https://blog.csdn.net/learn_tech/article/details/83654290
spark-shell --master spark://server01:7077 --total-executor-cores 3 --executor-memory 1g
--master spark://server01:7077:指定master进程的机器
--total-executor-cores 3:指定executor的核数(worker数量)
--executor-memory 1g:指定executor执行的内存大小
spark-shell 常用命令
原文参看 https://www.jianshu.com/p/826c16298ca6 惊喜多多
1.Ctrl+l 查看当前ubuntu 文件管理器(我的电脑) 当前打开文件夹路径 此为路径找对方式
2.加载ubuntu本地文件及操作
userRDD=sc.textFile("file:///home/fgq/Downloads/u.user");
val movieRDD=sc.textFile("file:///home/fgq/Downloads/u.item");
val ratingRDD=sc.textFile("file:///home/fgq/Downloads/u.data");
userRDD.first();
userRDD.count();
userRDD.take(1);
\t分隔符分隔 且 返回指定索引位置的内容
#Create a RDD from RatingRDD that only contains the two columns of interest i.e. movie_id,rating.
val rdd_movid_rating=ratingRDD.map(x=>(x.split("\t")(1),x.split("\t")(2)));
| 分隔符分隔 且 返回指定索引位置的内容
# Create a RDD from MovieRDD that only contains the two columns of interest i.e. movie_id,title.
val rdd_movid_title=movieRDD.map(x=>(x.split('|')(0),x.split('|')(1)));
leftOuterJoin使用
# merge these two pair RDDs based on movie_id. For this we will use the transformation leftOuterJoin(). See the transformation document.
val rdd_movid_title_rating=rdd_movid_rating.leftOuterJoin(rdd_movid_title);
Array((736,(4,Some(Shadowlands (1993))))) 获取Some位置值 t._2._2
# use the RDD in previous step to create (movie,1) tuple pair RDD
val rdd_title_rating=rdd_movid_title_rating.map(t=>(t._2._2,1));
# Use the reduceByKey transformation to reduce on the basis of movie_title
val rdd_title_ratingcnt=rdd_title_rating.reduceByKey((x,y)=>x+y);
# Get the final answer by using takeOrdered Transformation
val finalResultRDD=rdd_title_ratingcnt.map(x=>(x._2,x._1));
finalResultRDD.top(25);
takeOrdered与top是相反的
top是将RDD中的每个元素进行降序排序后取topN。
而takeOrdered是将RDD中的每个元素进行升序排序后取topN。
在Spark中,可以对RDD执行两种不同类型的操作:转换和操作
1. 转换:从现有的RDD中创建新的数据集
Map函数: Filter函数:
distinct函数: flatmap函数:
Reduce By Key函数: Group By Key函数:
2. 操作:从Spark中获取结果的机制
collect ,reduce ,take ,takeOrdered
Spark DataFrame的创建&操作 https://www.jianshu.com/p/009126dec52f
val ratings=spark.read.format("csv").load("file:///home/fgq/Downloads/u.data");
或
val ratings= spark.read.option("header","false").option("inferSchema","false").csv("file:///home/fgq/Downloads/u.data");
默认第一行作为表头
val ratings= spark.read.option("header","true").option("inferSchema","true").csv("file:///home/fgq/Downloads/u.data");
更多 DataFrame的创建&操作
1.val dfUsers = spark.read.format("csv").option("header", "true").load("file:///root/data/user.csv")
2.scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> import spark.implicits._
import spark.implicits._
// 读取文件并转换成RDD[Row]类型
scala> val uRdd = spark.sparkContext.textFile("file:///root/data/user.csv")
.map(x = x.split(","))
.mapPartitionsWithIndex((index, iter) => if (index == 0) iter.drop(1) else iter)
.map(Row.fromSeq(_))
uRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[26] at map at :30
// 定义Schema
scala> val schema = StructType(Array(StructField("user_id", StringType, true),
StructField("locale", StringType, true),StructField("birthyear", StringType, true),
StructField("gender",StringType, true), StructField("joinedAt", StringType, true),
StructField("location", StringType, true), StructField("timezone", StringType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(user_id,StringType,true), StructField(locale,StringType,true), StructField(birthyear,StringType,true), StructField(gender,StringType,true), StructField(joinedAt,StringType,true), StructField(location,StringType,true), StructField(timezone,StringType,true))
// 创建DataFrame
scala> val dfUsers = spark.createDataFrame(uRdd, schema)
dfUsers: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 5 more fields]
scala> dfUsers.printSchema
// root
// |-- user_id: string (nullable = true)
// |-- locale: string (nullable = true)
// |-- birthyear: string (nullable = true)
// |-- gender: string (nullable = true)
// |-- joinedAt: string (nullable = true)
// |-- location: string (nullable = true)
// |-- timezone: string (nullable = true)
scala> dfUsers show 3
注:由于该文件首行是列名,所以使用mapPartitionsWithIndex()函数过滤掉
3.scala> val dfUsers = spark.sparkContext.textFile("file:///root/data/users.csv")
.map(_.split(","))
.mapPartitionsWithIndex((index, iter) => if (index == 0) iter.drop(1) else iter)
.map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6)))
.toDF("user_id", "locale", "birthyear", "gender", "joinedAt", "location", "timezone")
dfUsers: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 5 more fields]
scala> dfUsers show 3
https://www.jianshu.com/p/009126dec52f