阿里平台pyspark使用

方法:https://yq.aliyun.com/articles/692148

冒烟测试失败:运维修改默认资源组 单击TaskName为、 master-0任务条,在下方FuxiInstance栏中,通过、All按钮过滤后, 单击TempRoot的StdOut按钮可以查看SparkPi的输出结果

对比:https://zhuanlan.zhihu.com/p/34901585

bug:

ValueError: Some of types cannot be determined after inferring:原因:https://blog.csdn.net/loxeed/article/details/53434555

代码

修改完py一定要提交在运行

from pyspark import SparkConf, SparkContext conf = SparkConf().setAppName("PySpark App").setMaster("spark://master:7077") sc = SparkContext(conf=conf)

df.count() getrootdirectory() 它指定根目录的路径,该目录包含通过SparkContext.addFile()添加的文件。

----------------------------------------sparkfile.py------------------------------------ from pyspark import SparkContext from pyspark import SparkFiles finddistance = "/home/hadoop/examples_pyspark/finddistance.R" finddistancename = "finddistance.R" sc = SparkContext("local", "SparkFile App") sc.addFile(finddistance) print "Absolute Path -> %s" % SparkFiles.get(finddistancename)

sparkfile.py-

我们考虑以下StorageLevel示例,其中我们使用存储级别 MEMORY_AND_DISK_2, 这意味着RDD分区将具有2的复制。

storagelevel.py

from pyspark import SparkContext import pyspark sc = SparkContext ( "local", "storagelevel app" ) rdd1 = sc.parallelize([1,2]) rdd1.persist( pyspark.StorageLevel.MEMORY_AND_DISK_2 ) rdd1.getStorageLevel() print(rdd1.getStorageLevel())

storagelevel.py-----

from future import print_function from pyspark import SparkContext from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating if name == "main": sc = SparkContext(appName="Pspark mllib Example") data = sc.textFile("test.data") ratings = data.map(lambda l: l.split(','))\ .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 10 model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r1 - r1)**2).mean() print("Mean Squared Error = " + str(MSE))

# Save and load model model.save(sc, "target/tmp/myCollaborativeFilter") sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter"

2#初始化

from pyspark import SparkConf, SparkContext sc = SparkContext() #创建RDD 接下来我们使用parallelize方法创建一个RDD: intRDD = sc.parallelize([3,1,2,5,5]) stringRDD = sc.parallelize(['Apple','Orange','Grape','Banana','Apple']) #RDD转换为Python数据类型 print (intRDD.collect()) #map print (intRDD.map(lambda x:x+1).collect()) print (intRDD.filter(lambda x: x<3).collect()) print (intRDD.distinct().collect()) #随机分 sRDD = intRDD.randomSplit([0.4,0.6]) print (sRDD[0].collect()) #groupby result = intRDD.groupBy(lambda x : x % 2).collect() print (sorted([(x, sorted(y)) for (x, y) in result])) #多个rdd intRDD1 = sc.parallelize([3,1,2,5,5]) print (intRDD1.union(intRDD2).union(intRDD3).collect()) #交集 print (intRDD1.intersection(intRDD2).collect()) #差集 print (intRDD1.subtract(intRDD2).collect()) #笛卡尔积 print (intRDD1.cartesian(intRDD2).collect()) #更多参考 https://www.jianshu.com/p/4cd22eda363f

3

sc = SparkSession.builder \ .appName("Spark_CLOB_Split") \ .config("hive.metastore.sasl.enabled", "true") \ .enableHiveSupport() \ .getOrCreate()

sc.sql("""select some columns, sum(any solumn) as col_name from your_table1 a left join your_table2 b on a.key = b.key where a.col_name > 0 group by some columns) """ 新生成一列常量:需要使用lit函数 from pyspark.sql.functions import lit df.withColumn('your_col_name' ,lit(your_const_var)) 新生成一列:利用自定义函数对某一列进行运算,生成新的一列 from pyspark.sql.functions import udf,col from pyspark.sql.types import StringType def func(s): return s[:3] my_func = udf(func, StringType()) df = df.withColumn('new_col_name', my_func('col_name'))

其他

https://zhuanlan.zhihu.com/p/31134940

你可能感兴趣的:(工具,编程)