使用spark遇到的问题

1.如何设置广播变量:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window

from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType
from pyspark.sql.types import IntegerType
from pyspark.sql import HiveContext
from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import partial

spark = (SparkSession
                .builder
            .appName("gps-test-DockerLinuxContainer")
            .enableHiveSupport()
            .config("spark.executor.instances", "500").config("spark.sql.hive.mergeFiles", "true").config("spark.num-executors", "300").config("spark.locality.wait", "0")
            .config("spark.executor.memory","40g")
            .config("spark.executor.cores","4")
            .config("spark.driver.memory","12g")
            .config("spark.sql.shuffle.partitions","500")
            .config("spark.yarn.appMasterEnv.yarn.nodemanager.container-executor.class","DockerLinuxContainer")
            .config("spark.executorEnv.yarn.nodemanager.container-executor.class","DockerLinuxContainer")
            .config("spark.yarn.appMasterEnv.yarn.nodemanager.docker-container-executor.image-name","bdp-docker.jd.com:5000/wise_mart_bag:latest")
            .config("spark.executorEnv.yarn.nodemanager.docker-container-executor.image-name","bdp-docker.jd.com:5000/wise_mart_bag:latest")
            .config("spark.sql.adaptive.enabled", "true")
            .config("spark.rpc.message.maxSize","512" )
            .config("spark.sql.adaptive.repartition.enabled", "true")
            .config("spark.yarn.queue", "root.bdp_jmart_scr_union.bdp_jmart_cis_union.bdp_jmart_cis_online_spark")
            .getOrCreate())

model=spark.sparkContext.broadcast(value=???)
使用时使用model.value

2.对pyspark中的rdd分组(A)排序(B),保留排名第一的重复词,代码如下:

window = Window.partitionBy(['A']).orderBy(F.col("B").desc())
df_attvalue_process=df_attvalue_process.withColumn('rank', F.ntile(1000000000).over(window)).filter("rank= '1'").drop('rank')

3.pyspark中的多种排序函数:

RANK:可以生成不连续的序号,比如按分数排序,第一第二都是100分,第三名98分,那第一第二就会显示序号1,第三名显示序号3。

DENSE_RANK: 生成连续的序号,在上一例子中,第一第二并列显示序号1,第三名会显示序号2。

ROW_NUMBER: 顾名思义就是行的数值,在上一例子中,第一第二第三将会显示序号为1,2,3。

ntile:也可用于排序,需设置n

from pyspark.sql import functions as F

F.ntile(1000000000).over(window)

你可能感兴趣的:(使用spark遇到的问题)