Spark命令笔录(1)-spark-shell

Spark version 2.2.3 基础概念方法讲解

 1. 代码+案例详解:使用Spark处理大数据最全指南(上)
 		https://www.jianshu.com/p/826c16298ca6
 2. 代码+案例详解:使用Spark处理大数据最全指南(下)
 		 https://zhuanlan.zhihu.com/p/95022557

Spark 部署启动参看

	https://github.com/heibaiying/BigData-Notes

Spark之本地模式与集群模式

https://blog.csdn.net/learn_tech/article/details/83654290

spark-shell --master spark://server01:7077 --total-executor-cores 3 --executor-memory 1g

--master spark://server01:7077:指定master进程的机器
--total-executor-cores 3:指定executor的核数(worker数量)
--executor-memory 1g:指定executor执行的内存大小

spark-shell 常用命令

原文参看 https://www.jianshu.com/p/826c16298ca6 惊喜多多
1.Ctrl+l 查看当前ubuntu 文件管理器(我的电脑) 当前打开文件夹路径 此为路径找对方式
2.加载ubuntu本地文件及操作
  userRDD=sc.textFile("file:///home/fgq/Downloads/u.user");
  val movieRDD=sc.textFile("file:///home/fgq/Downloads/u.item");
  val ratingRDD=sc.textFile("file:///home/fgq/Downloads/u.data");
  userRDD.first();
  userRDD.count();
  userRDD.take(1);
  
  \t分隔符分隔 且 返回指定索引位置的内容
  #Create a RDD from RatingRDD that only contains the two columns of interest i.e. movie_id,rating.
   val rdd_movid_rating=ratingRDD.map(x=>(x.split("\t")(1),x.split("\t")(2)));
  
  | 分隔符分隔 且 返回指定索引位置的内容
  # Create a RDD from MovieRDD that only contains the two columns of interest i.e. movie_id,title.
    val rdd_movid_title=movieRDD.map(x=>(x.split('|')(0),x.split('|')(1)));
  
  leftOuterJoin使用
  # merge these two pair RDDs based on movie_id. For this we will use the transformation leftOuterJoin(). See the transformation document.
    val rdd_movid_title_rating=rdd_movid_rating.leftOuterJoin(rdd_movid_title);
  
  Array((736,(4,Some(Shadowlands (1993))))) 获取Some位置值 t._2._2 
  # use the RDD in previous step to create (movie,1) tuple pair RDD
    val rdd_title_rating=rdd_movid_title_rating.map(t=>(t._2._2,1));
    
  # Use the reduceByKey transformation to reduce on the basis of movie_title
    val rdd_title_ratingcnt=rdd_title_rating.reduceByKey((x,y)=>x+y);
    
  # Get the final answer by using takeOrdered Transformation
  	 val finalResultRDD=rdd_title_ratingcnt.map(x=>(x._2,x._1));
     finalResultRDD.top(25);
     
     takeOrdered与top是相反的
	 top是将RDD中的每个元素进行降序排序后取topN。
	 而takeOrdered是将RDD中的每个元素进行升序排序后取topN。

在Spark中,可以对RDD执行两种不同类型的操作:转换和操作

1. 转换:从现有的RDD中创建新的数据集
	Map函数: Filter函数: 
	distinct函数: flatmap函数: 
	Reduce By Key函数: Group By Key函数:
2. 操作:从Spark中获取结果的机制
 	collect ,reduce ,take  ,takeOrdered

Spark DataFrame的创建&操作 https://www.jianshu.com/p/009126dec52f

val ratings=spark.read.format("csv").load("file:///home/fgq/Downloads/u.data");
或
val ratings= spark.read.option("header","false").option("inferSchema","false").csv("file:///home/fgq/Downloads/u.data");

默认第一行作为表头
val ratings= spark.read.option("header","true").option("inferSchema","true").csv("file:///home/fgq/Downloads/u.data");


更多 DataFrame的创建&操作 
1.val dfUsers = spark.read.format("csv").option("header", "true").load("file:///root/data/user.csv")
2.scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import spark.implicits._
import spark.implicits._

// 读取文件并转换成RDD[Row]类型
scala> val uRdd = spark.sparkContext.textFile("file:///root/data/user.csv")
                     .map(x = x.split(","))
                     .mapPartitionsWithIndex((index, iter) => if (index == 0) iter.drop(1) else iter)
                     .map(Row.fromSeq(_))
uRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[26]  at map at :30

// 定义Schema
scala> val schema = StructType(Array(StructField("user_id", StringType, true),       
 StructField("locale", StringType, true),StructField("birthyear", StringType, true),   
 StructField("gender",StringType, true), StructField("joinedAt", StringType, true), 
 StructField("location", StringType, true), StructField("timezone", StringType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(user_id,StringType,true), StructField(locale,StringType,true), StructField(birthyear,StringType,true), StructField(gender,StringType,true), StructField(joinedAt,StringType,true), StructField(location,StringType,true), StructField(timezone,StringType,true))

// 创建DataFrame
scala> val dfUsers = spark.createDataFrame(uRdd, schema)
dfUsers: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 5 more fields]

scala> dfUsers.printSchema

// root
// |-- user_id: string (nullable = true)
// |-- locale: string (nullable = true)
// |-- birthyear: string (nullable = true)
// |-- gender: string (nullable = true)
// |-- joinedAt: string (nullable = true)
// |-- location: string (nullable = true)
// |-- timezone: string (nullable = true)

scala> dfUsers show 3
注:由于该文件首行是列名,所以使用mapPartitionsWithIndex()函数过滤掉

3.scala> val dfUsers = spark.sparkContext.textFile("file:///root/data/users.csv")
           .map(_.split(","))
           .mapPartitionsWithIndex((index, iter) => if (index == 0) iter.drop(1) else iter)
           .map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6)))
           .toDF("user_id", "locale", "birthyear", "gender", "joinedAt", "location", "timezone")
   dfUsers: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 5 more  fields]

scala> dfUsers show 3
	https://www.jianshu.com/p/009126dec52f

你可能感兴趣的:(BigData)