spark学习记录

云服务器环境搭建(从0)

阿里云申请一个云服务器,1核2G,一个月免费试用中。

登陆默认在 “/root”

目录

  • linux目录结构
  • Java8 安装
  • Zookeeper 安装
  • Hadoop 安装
  • Spark 2.3 HA 集群分布式 安装
  • Spark RDD
  • spark sql
  • docker mysql
  • spark session

linux目录结构

bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var

/bin bin 是 Binaries (二进制文件) 的缩写, 这个目录存放着最经常使用的命令

/boot 这里存放的是启动 Linux 时使用的一些核心文件,包括一些连接文件以及镜像文件。

/dev dev 是 Device(设备) 的缩写, 该目录下存放的是 Linux 的外部设备,在 Linux 中访问设备的方式和访问文件的方式是相同的。

/etc etc 是 Etcetera(等等) 的缩写,这个目录用来存放所有的系统管理所需要的配置文件和子目录。

/home用户的主目录,在 Linux 中,每个用户都有一个自己的目录,一般该目录名是以用户的账号命名的,如上图中的 alice、bob 和 eve。

/lib lib 是 Library(库) 的缩写这个目录里存放着系统最基本的动态连接共享库,其作用类似于 Windows 里的 DLL 文件。几乎所有的应用程序都需要用到这些共享库。

media linux 系统会自动识别一些设备,例如U盘、光驱等等,当识别后,Linux 会把识别的设备挂载到这个目录下。
ls /media 为空

mnt 系统提供该目录是为了让用户临时挂载别的文件系统的,我们可以将光驱挂载在 /mnt/ 上,然后进入该目录就可以查看光驱里的内容了。
ls /mnt 为空

/opt opt 是 optional(可选) 的缩写,这是给主机额外安装软件所摆放的目录。比如你安装一个ORACLE数据库则就可以放到这个目录下。默认是空的。
ls /opt 为空
后面的zk,spark我都放在了opt下,下载的压缩包也是

/proc proc 是 Processes(进程) 的缩写,/proc 是一种伪文件系统(也即虚拟文件系统),存储的是当前内核运行状态的一系列特殊文件,这个目录是一个虚拟的目录,它是系统内存的映射,我们可以通过直接访问这个目录来获取系统信息。
这个目录的内容不在硬盘上而是在内存里,我们也可以直接修改里面的某些文件,比如可以通过下面的命令来屏蔽主机的ping命令,使别人无法ping你的机器:

echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_all

/root 该目录为系统管理员,也称作超级权限者的用户主目录。
root用户一登陆在/root路径,为空

/sbin s 就是 Super User 的意思,是 Superuser Binaries (超级用户的二进制文件) 的缩写,这里存放的是系统管理员使用的系统管理程序。

/srv 该目录存放一些服务启动之后需要提取的数据。
ls /srv 为空

/sys 这是 Linux2.6 内核的一个很大的变化。该目录下安装了 2.6 内核中新出现的一个文件系统 sysfs 。
sysfs 文件系统集成了下面3种文件系统的信息:针对进程信息的 proc 文件系统、针对设备的 devfs 文件系统以及针对伪终端的 devpts 文件系统。
该文件系统是内核设备树的一个直观反映。
当一个内核对象被创建的时候,对应的文件和目录也在内核对象子系统中被创建。

/tmp tmp 是 temporary(临时) 的缩写这个目录是用来存放一些临时文件的。

/usr usr 是 unix shared resources(共享资源) 的缩写,这是一个非常重要的目录,用户的很多应用程序和文件都放在这个目录下,类似于 windows 下的 program files 目录。
/usr/bin 系统用户使用的应用程序。
/usr/sbin超级用户使用的比较高级的管理程序和系统守护程序。
/usr/src 内核源代码默认的放置目录。

/var var 是 variable(变量) 的缩写,这个目录中存放着在不断扩充着的东西,我们习惯将那些经常被修改的目录放在这个目录下。包括各种日志文件。

/run 是一个临时文件系统,存储系统启动以来的信息。当系统重启时,这个目录下的文件应该被删掉或清除。如果你的系统上有 /var/run 目录,应该让它指向 run。

Reference

Java8 安装

yum -y list java* 可以看到可安装的Java版本
其中带有“-devel”的是jdk,否则是jre。

yum install -y java-1.8.0-openjdk-devel.x86_64

获取jdk的安装目录

rpm -ql java-1.8.0-openjdk

发现在/usr/lib/jvm路径下

至此,yum安装jdk完成。

也可以通过官方安装包进行安装。

Conference

Zookeeper 安装

我装的是3.5.9

wget https://dlcdn.apache.org/zookeeper/zookeeper-3.5.9/apache-zookeeper-3.5.9-bin.tar.gz
tar -zxvf apache-zookeeper-3.5.9-bin.tar.gz
cd apache-zookeeper-3.5.9/conf
cp zoo_sample.cfg zoo.cfg
cd ..
cd bin
sh zkServer.sh start

查看server状态

sh zkServer.sh status

坑:3.5.5后,带有bin名称的包才是我们想要的下载可以直接使用的里面有编译后的二进制的包,而之前的普通的tar.gz的包里面是只是源码的包无法直接使用。

启动客户端

sh zkCli.sh

Conference

Hadoop 安装

Conference
我装的是2.7.5
配置一下ssh
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable/hadoop-3.3.1.tar.gz(清华源只有2.10 3.3)
tar -zxvf
自带HDFS、Yarn,开启
3.3.1版本,要使用HDFS需要在 /etc/profile 添加

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

后在/hadoop 执行 start-dfs.sh 即可开启 nameNode dataNode secondaryNameNode 进程

Spark 2.3 HA 集群分布式 安装

前提:
Java8安装
Zookeeper安装
Hadoop2.7.5 HA安装
Scala安装

只能下载2.4了

wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz

3.3.1的hadoop,下载3.2.0的spark(基于3.3后的hadoop)

wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
  1. 先启动Zookeeper
    cd 到 ZooKeeper/bin 执行 sh zkServer.sh start
    sh zkServer.sh status 查看运行状态
    jps 查看是否有QuorumPeerMain线程(zk入口类线程)
    如果失败 执行sh zkServer.sh start-forground 带日志的执行

  2. 启动 spark
    cd 到 spark/sbin
    执行 start-all.sh 会启动yarn和hdfs
    执行 sh start-master.sh 启动spark的master
    执行 sh start-slave.sh 启动spark的worker
    (记得spark/conf/spark-env 加入 SPARK_MASTER_HOST=localhost)

  3. spark on Yarn
    测试pi例子
    在spark目录执行

    bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 500M \
    --executor-memory 500m \
    --executor-cores 1 \
    ./examples/jars/spark-examples_2.11-2.4.8.jar \
    10
    

    Conference

    spark内存不够

    暂未成功。INFO yarn.Client: Application report for application_1640188927691_0003 (state: ACCEPTED) 后,一直在 INFO yarn.Client: Application report for application_1640188927691_0003 (state: RUNNING),可能是资源太少(1core2G)的问题

  4. spark on standalone
    4.1. 运行 example-pi

    bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master spark://localhost:7077 \
    --executor-memory 500m \
    --total-executor-cores 1 \
    ./examples/jars/spark-examples_2.11-2.4.8.jar \
    100
    

    spark-examples_2.12-3.2.0.jar
    警告没资源,等待几分钟后成功

    21/12/23 14:13:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    21/12/23 14:13:57 INFO spark.SparkContext: Running Spark version 2.4.8
    21/12/23 14:13:57 INFO spark.SparkContext: Submitted application: Spark Pi
    21/12/23 14:13:58 INFO spark.SecurityManager: Changing view acls to: root
    21/12/23 14:13:58 INFO spark.SecurityManager: Changing modify acls to: root
    21/12/23 14:13:58 INFO spark.SecurityManager: Changing view acls groups to: 
    21/12/23 14:13:58 INFO spark.SecurityManager: Changing modify acls groups to: 
    21/12/23 14:13:58 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
    21/12/23 14:13:58 INFO util.Utils: Successfully started service 'sparkDriver' on port 33713.
    21/12/23 14:13:58 INFO spark.SparkEnv: Registering MapOutputTracker
    21/12/23 14:13:58 INFO spark.SparkEnv: Registering BlockManagerMaster
    21/12/23 14:13:58 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    21/12/23 14:13:58 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
    21/12/23 14:13:58 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-619fd2dc-0d04-414a-b314-0421d5c00934
    21/12/23 14:13:58 INFO memory.MemoryStore: MemoryStore started with capacity 413.9 MB
    21/12/23 14:13:58 INFO spark.SparkEnv: Registering OutputCommitCoordinator
    21/12/23 14:13:58 INFO util.log: Logging initialized @3145ms to org.spark_project.jetty.util.log.Slf4jLog
    21/12/23 14:13:58 INFO server.Server: jetty-9.4.z-SNAPSHOT; built: unknown; git: unknown; jvm 1.8.0_312-b07
    21/12/23 14:13:58 INFO server.Server: Started @3383ms
    21/12/23 14:13:59 INFO server.AbstractConnector: Started ServerConnector@79ab3a71{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
    21/12/23 14:13:59 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@44ea608c{/jobs,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@515f4131{/jobs/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@74518890{/jobs/job,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3f3ddbd9{/jobs/job/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@14c053c6{/stages,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c2d4cc6{/stages/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@30865a90{/stages/stage,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@71b1a49c{/stages/stage/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@73e132e0{/stages/pool,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3773862a{/stages/pool/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2472c7d8{/storage,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@589b028e{/storage/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@22175d4f{/storage/rdd,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@9fecdf1{/storage/rdd/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3b809711{/environment,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3b0f7d9d{/environment/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@236ab296{/executors,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5c84624f{/executors/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63034ed1{/executors/threadDump,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@232024b9{/executors/threadDump/json,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@55a8dc49{/static,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e406694{/,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ab9b447{/api,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@15b986cd{/jobs/job/kill,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6bb7cce7{/stages/stage/kill,null,AVAILABLE,@Spark}
    21/12/23 14:13:59 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://iZwz915iahvm5k8atqdtj2Z:4040
    21/12/23 14:13:59 INFO spark.SparkContext: Added JAR file:/opt/spark-2.4.8-bin-hadoop2.7/./examples/jars/spark-examples_2.11-2.4.8.jar at spark://iZwz915iahvm5k8atqdtj2Z:33713/jars/spark-examples_2.11-2.4.8.jar with timestamp 1640240039193
    21/12/23 14:13:59 INFO client.StandaloneAppClient$ClientEndpoint: Connecting to master spark://localhost:7077...
    21/12/23 14:13:59 INFO client.TransportClientFactory: Successfully created connection to localhost/127.0.0.1:7077 after 67 ms (0 ms spent in bootstraps)
    21/12/23 14:13:59 INFO cluster.StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20211223141359-0000
    21/12/23 14:13:59 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37767.
    21/12/23 14:13:59 INFO netty.NettyBlockTransferService: Server created on iZwz915iahvm5k8atqdtj2Z:37767
    21/12/23 14:13:59 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    21/12/23 14:13:59 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, iZwz915iahvm5k8atqdtj2Z, 37767, None)
    21/12/23 14:13:59 INFO storage.BlockManagerMasterEndpoint: Registering block manager iZwz915iahvm5k8atqdtj2Z:37767 with 413.9 MB RAM, BlockManagerId(driver, iZwz915iahvm5k8atqdtj2Z, 37767, None)
    21/12/23 14:13:59 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, iZwz915iahvm5k8atqdtj2Z, 37767, None)
    21/12/23 14:13:59 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, iZwz915iahvm5k8atqdtj2Z, 37767, None)
    21/12/23 14:14:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e48e859{/metrics/json,null,AVAILABLE,@Spark}
    21/12/23 14:14:00 INFO cluster.StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
    21/12/23 14:14:01 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38
    21/12/23 14:14:01 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 100 output partitions
    21/12/23 14:14:01 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
    21/12/23 14:14:01 INFO scheduler.DAGScheduler: Parents of final stage: List()
    21/12/23 14:14:01 INFO scheduler.DAGScheduler: Missing parents: List()
    21/12/23 14:14:01 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
    21/12/23 14:14:01 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.0 KB, free 413.9 MB)
    21/12/23 14:14:01 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1381.0 B, free 413.9 MB)
    21/12/23 14:14:01 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on iZwz915iahvm5k8atqdtj2Z:37767 (size: 1381.0 B, free: 413.9 MB)
    21/12/23 14:14:01 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1184
    21/12/23 14:14:01 INFO scheduler.DAGScheduler: Submitting 100 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
    21/12/23 14:14:01 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 100 tasks
    21/12/23 14:14:16 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
    21/12/23 14:14:27 INFO client.StandaloneAppClient$ClientEndpoint: Master removed worker worker-20211223114534-172.18.92.61-40105: Not responding for recovery
    21/12/23 14:14:27 INFO cluster.StandaloneSchedulerBackend: Worker worker-20211223114534-172.18.92.61-40105 removed: Not responding for recovery
    21/12/23 14:14:27 INFO scheduler.TaskSchedulerImpl: Handle removed worker worker-20211223114534-172.18.92.61-40105: Not responding for recovery
    21/12/23 14:14:27 INFO scheduler.DAGScheduler: Shuffle files lost for worker worker-20211223114534-172.18.92.61-40105 on host 172.18.92.61
    21/12/23 14:14:27 INFO client.StandaloneAppClient$ClientEndpoint: Executor added: app-20211223141359-0000/0 on worker-20211223141327-172.18.92.61-34607 (172.18.92.61:34607) with 1 core(s)
    21/12/23 14:14:27 INFO cluster.StandaloneSchedulerBackend: Granted executor ID app-20211223141359-0000/0 on hostPort 172.18.92.61:34607 with 1 core(s), 500.0 MB RAM
    21/12/23 14:14:27 INFO client.StandaloneAppClient$ClientEndpoint: Executor updated: app-20211223141359-0000/0 is now RUNNING
    21/12/23 14:14:31 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
    21/12/23 14:14:31 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.18.92.61:40782) with ID 0
    21/12/23 14:14:31 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.18.92.61, executor 0, partition 0, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:14:32 INFO storage.BlockManagerMasterEndpoint: Registering block manager 172.18.92.61:35261 with 110.0 MB RAM, BlockManagerId(0, 172.18.92.61, 35261, None)
    21/12/23 14:14:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.18.92.61:35261 (size: 1381.0 B, free: 110.0 MB)
    121/12/23 14:19:36 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 172.18.92.61, executor 0, partition 1, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 172.18.92.61, executor 0, partition 2, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 172.18.92.61, executor 0, partition 3, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 172.18.92.61, executor 0, partition 4, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 172.18.92.61, executor 0, partition 5, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 172.18.92.61, executor 0, partition 6, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 172.18.92.61, executor 0, partition 7, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 309907 ms on 172.18.92.61 (executor 0) (1/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 172 ms on 172.18.92.61 (executor 0) (2/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 144 ms on 172.18.92.61 (executor 0) (3/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 108 ms on 172.18.92.61 (executor 0) (4/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 231 ms on 172.18.92.61 (executor 0) (5/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 204 ms on 172.18.92.61 (executor 0) (6/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, 172.18.92.61, executor 0, partition 8, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 5529 ms on 172.18.92.61 (executor 0) (7/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 77 ms on 172.18.92.61 (executor 0) (8/100)
    21/12/23 14:19:41 WARN client.StandaloneAppClient$ClientEndpoint: Connection to localhost:7077 failed; waiting for master to reconnect...
    21/12/23 14:19:41 WARN cluster.StandaloneSchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
    21/12/23 14:19:41 WARN client.StandaloneAppClient$ClientEndpoint: Connection to localhost:7077 failed; waiting for master to reconnect...
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, 172.18.92.61, executor 0, partition 9, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 46 ms on 172.18.92.61 (executor 0) (9/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, 172.18.92.61, executor 0, partition 10, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 49 ms on 172.18.92.61 (executor 0) (10/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 11.0 in stage 0.0 (TID 11, 172.18.92.61, executor 0, partition 11, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 10.0 in stage 0.0 (TID 10) in 50 ms on 172.18.92.61 (executor 0) (11/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 12.0 in stage 0.0 (TID 12, 172.18.92.61, executor 0, partition 12, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 11.0 in stage 0.0 (TID 11) in 20 ms on 172.18.92.61 (executor 0) (12/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 13.0 in stage 0.0 (TID 13, 172.18.92.61, executor 0, partition 13, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 12.0 in stage 0.0 (TID 12) in 44 ms on 172.18.92.61 (executor 0) (13/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 14.0 in stage 0.0 (TID 14, 172.18.92.61, executor 0, partition 14, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 13.0 in stage 0.0 (TID 13) in 29 ms on 172.18.92.61 (executor 0) (14/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 15.0 in stage 0.0 (TID 15, 172.18.92.61, executor 0, partition 15, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 45 ms on 172.18.92.61 (executor 0) (15/100)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Starting task 16.0 in stage 0.0 (TID 16, 172.18.92.61, executor 0, partition 16, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:41 INFO scheduler.TaskSetManager: Finished task 15.0 in stage 0.0 (TID 15) in 39 ms on 172.18.92.61 (executor 0) (16/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 17.0 in stage 0.0 (TID 17, 172.18.92.61, executor 0, partition 17, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 16.0 in stage 0.0 (TID 16) in 33 ms on 172.18.92.61 (executor 0) (17/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 18.0 in stage 0.0 (TID 18, 172.18.92.61, executor 0, partition 18, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 17.0 in stage 0.0 (TID 17) in 21 ms on 172.18.92.61 (executor 0) (18/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 19.0 in stage 0.0 (TID 19, 172.18.92.61, executor 0, partition 19, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 18.0 in stage 0.0 (TID 18) in 19 ms on 172.18.92.61 (executor 0) (19/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 20.0 in stage 0.0 (TID 20, 172.18.92.61, executor 0, partition 20, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 19.0 in stage 0.0 (TID 19) in 19 ms on 172.18.92.61 (executor 0) (20/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 21.0 in stage 0.0 (TID 21, 172.18.92.61, executor 0, partition 21, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 20.0 in stage 0.0 (TID 20) in 25 ms on 172.18.92.61 (executor 0) (21/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 22.0 in stage 0.0 (TID 22, 172.18.92.61, executor 0, partition 22, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 21.0 in stage 0.0 (TID 21) in 21 ms on 172.18.92.61 (executor 0) (22/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 23.0 in stage 0.0 (TID 23, 172.18.92.61, executor 0, partition 23, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 22 ms on 172.18.92.61 (executor 0) (23/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 24.0 in stage 0.0 (TID 24, 172.18.92.61, executor 0, partition 24, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 23.0 in stage 0.0 (TID 23) in 38 ms on 172.18.92.61 (executor 0) (24/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 25.0 in stage 0.0 (TID 25, 172.18.92.61, executor 0, partition 25, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) in 76 ms on 172.18.92.61 (executor 0) (25/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 26.0 in stage 0.0 (TID 26, 172.18.92.61, executor 0, partition 26, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 25.0 in stage 0.0 (TID 25) in 29 ms on 172.18.92.61 (executor 0) (26/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 27.0 in stage 0.0 (TID 27, 172.18.92.61, executor 0, partition 27, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 26.0 in stage 0.0 (TID 26) in 16 ms on 172.18.92.61 (executor 0) (27/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 28.0 in stage 0.0 (TID 28, 172.18.92.61, executor 0, partition 28, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 27.0 in stage 0.0 (TID 27) in 25 ms on 172.18.92.61 (executor 0) (28/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 29.0 in stage 0.0 (TID 29, 172.18.92.61, executor 0, partition 29, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 28.0 in stage 0.0 (TID 28) in 30 ms on 172.18.92.61 (executor 0) (29/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 30.0 in stage 0.0 (TID 30, 172.18.92.61, executor 0, partition 30, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 29.0 in stage 0.0 (TID 29) in 34 ms on 172.18.92.61 (executor 0) (30/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 31.0 in stage 0.0 (TID 31, 172.18.92.61, executor 0, partition 31, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 30.0 in stage 0.0 (TID 30) in 16 ms on 172.18.92.61 (executor 0) (31/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 32.0 in stage 0.0 (TID 32, 172.18.92.61, executor 0, partition 32, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 31.0 in stage 0.0 (TID 31) in 34 ms on 172.18.92.61 (executor 0) (32/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 33.0 in stage 0.0 (TID 33, 172.18.92.61, executor 0, partition 33, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 32.0 in stage 0.0 (TID 32) in 14 ms on 172.18.92.61 (executor 0) (33/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 34.0 in stage 0.0 (TID 34, 172.18.92.61, executor 0, partition 34, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 45 ms on 172.18.92.61 (executor 0) (34/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 35.0 in stage 0.0 (TID 35, 172.18.92.61, executor 0, partition 35, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 34.0 in stage 0.0 (TID 34) in 29 ms on 172.18.92.61 (executor 0) (35/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 36.0 in stage 0.0 (TID 36, 172.18.92.61, executor 0, partition 36, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 35.0 in stage 0.0 (TID 35) in 17 ms on 172.18.92.61 (executor 0) (36/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 37.0 in stage 0.0 (TID 37, 172.18.92.61, executor 0, partition 37, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 36.0 in stage 0.0 (TID 36) in 40 ms on 172.18.92.61 (executor 0) (37/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 38.0 in stage 0.0 (TID 38, 172.18.92.61, executor 0, partition 38, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 37.0 in stage 0.0 (TID 37) in 43 ms on 172.18.92.61 (executor 0) (38/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 0.0 (TID 39, 172.18.92.61, executor 0, partition 39, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 38.0 in stage 0.0 (TID 38) in 21 ms on 172.18.92.61 (executor 0) (39/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 40.0 in stage 0.0 (TID 40, 172.18.92.61, executor 0, partition 40, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 39.0 in stage 0.0 (TID 39) in 48 ms on 172.18.92.61 (executor 0) (40/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 41.0 in stage 0.0 (TID 41, 172.18.92.61, executor 0, partition 41, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 40.0 in stage 0.0 (TID 40) in 15 ms on 172.18.92.61 (executor 0) (41/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 42.0 in stage 0.0 (TID 42, 172.18.92.61, executor 0, partition 42, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 41.0 in stage 0.0 (TID 41) in 34 ms on 172.18.92.61 (executor 0) (42/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 43.0 in stage 0.0 (TID 43, 172.18.92.61, executor 0, partition 43, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 42.0 in stage 0.0 (TID 42) in 33 ms on 172.18.92.61 (executor 0) (43/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 44.0 in stage 0.0 (TID 44, 172.18.92.61, executor 0, partition 44, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 43.0 in stage 0.0 (TID 43) in 37 ms on 172.18.92.61 (executor 0) (44/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 45.0 in stage 0.0 (TID 45, 172.18.92.61, executor 0, partition 45, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 44.0 in stage 0.0 (TID 44) in 28 ms on 172.18.92.61 (executor 0) (45/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 46.0 in stage 0.0 (TID 46, 172.18.92.61, executor 0, partition 46, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 45.0 in stage 0.0 (TID 45) in 31 ms on 172.18.92.61 (executor 0) (46/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 47.0 in stage 0.0 (TID 47, 172.18.92.61, executor 0, partition 47, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 46.0 in stage 0.0 (TID 46) in 29 ms on 172.18.92.61 (executor 0) (47/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 48.0 in stage 0.0 (TID 48, 172.18.92.61, executor 0, partition 48, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 47.0 in stage 0.0 (TID 47) in 30 ms on 172.18.92.61 (executor 0) (48/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 49.0 in stage 0.0 (TID 49, 172.18.92.61, executor 0, partition 49, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 48.0 in stage 0.0 (TID 48) in 27 ms on 172.18.92.61 (executor 0) (49/100)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Starting task 50.0 in stage 0.0 (TID 50, 172.18.92.61, executor 0, partition 50, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:42 INFO scheduler.TaskSetManager: Finished task 49.0 in stage 0.0 (TID 49) in 45 ms on 172.18.92.61 (executor 0) (50/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 51.0 in stage 0.0 (TID 51, 172.18.92.61, executor 0, partition 51, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 50.0 in stage 0.0 (TID 50) in 68 ms on 172.18.92.61 (executor 0) (51/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 52.0 in stage 0.0 (TID 52, 172.18.92.61, executor 0, partition 52, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 51.0 in stage 0.0 (TID 51) in 43 ms on 172.18.92.61 (executor 0) (52/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 53.0 in stage 0.0 (TID 53, 172.18.92.61, executor 0, partition 53, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 52.0 in stage 0.0 (TID 52) in 16 ms on 172.18.92.61 (executor 0) (53/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 54.0 in stage 0.0 (TID 54, 172.18.92.61, executor 0, partition 54, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 53.0 in stage 0.0 (TID 53) in 28 ms on 172.18.92.61 (executor 0) (54/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 55.0 in stage 0.0 (TID 55, 172.18.92.61, executor 0, partition 55, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 54.0 in stage 0.0 (TID 54) in 28 ms on 172.18.92.61 (executor 0) (55/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 56.0 in stage 0.0 (TID 56, 172.18.92.61, executor 0, partition 56, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 55.0 in stage 0.0 (TID 55) in 31 ms on 172.18.92.61 (executor 0) (56/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 57.0 in stage 0.0 (TID 57, 172.18.92.61, executor 0, partition 57, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 56.0 in stage 0.0 (TID 56) in 37 ms on 172.18.92.61 (executor 0) (57/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 58.0 in stage 0.0 (TID 58, 172.18.92.61, executor 0, partition 58, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 57.0 in stage 0.0 (TID 57) in 25 ms on 172.18.92.61 (executor 0) (58/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 59.0 in stage 0.0 (TID 59, 172.18.92.61, executor 0, partition 59, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 58.0 in stage 0.0 (TID 58) in 27 ms on 172.18.92.61 (executor 0) (59/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 60.0 in stage 0.0 (TID 60, 172.18.92.61, executor 0, partition 60, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 59.0 in stage 0.0 (TID 59) in 26 ms on 172.18.92.61 (executor 0) (60/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 61.0 in stage 0.0 (TID 61, 172.18.92.61, executor 0, partition 61, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 60.0 in stage 0.0 (TID 60) in 27 ms on 172.18.92.61 (executor 0) (61/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 62.0 in stage 0.0 (TID 62, 172.18.92.61, executor 0, partition 62, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 61.0 in stage 0.0 (TID 61) in 26 ms on 172.18.92.61 (executor 0) (62/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 63.0 in stage 0.0 (TID 63, 172.18.92.61, executor 0, partition 63, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 62.0 in stage 0.0 (TID 62) in 38 ms on 172.18.92.61 (executor 0) (63/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 64.0 in stage 0.0 (TID 64, 172.18.92.61, executor 0, partition 64, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 63.0 in stage 0.0 (TID 63) in 29 ms on 172.18.92.61 (executor 0) (64/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 65.0 in stage 0.0 (TID 65, 172.18.92.61, executor 0, partition 65, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 64.0 in stage 0.0 (TID 64) in 26 ms on 172.18.92.61 (executor 0) (65/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 66.0 in stage 0.0 (TID 66, 172.18.92.61, executor 0, partition 66, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 65.0 in stage 0.0 (TID 65) in 27 ms on 172.18.92.61 (executor 0) (66/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 67.0 in stage 0.0 (TID 67, 172.18.92.61, executor 0, partition 67, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 66.0 in stage 0.0 (TID 66) in 30 ms on 172.18.92.61 (executor 0) (67/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 68.0 in stage 0.0 (TID 68, 172.18.92.61, executor 0, partition 68, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 67.0 in stage 0.0 (TID 67) in 29 ms on 172.18.92.61 (executor 0) (68/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 69.0 in stage 0.0 (TID 69, 172.18.92.61, executor 0, partition 69, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 68.0 in stage 0.0 (TID 68) in 28 ms on 172.18.92.61 (executor 0) (69/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 70.0 in stage 0.0 (TID 70, 172.18.92.61, executor 0, partition 70, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 69.0 in stage 0.0 (TID 69) in 37 ms on 172.18.92.61 (executor 0) (70/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 71.0 in stage 0.0 (TID 71, 172.18.92.61, executor 0, partition 71, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 70.0 in stage 0.0 (TID 70) in 16 ms on 172.18.92.61 (executor 0) (71/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 72.0 in stage 0.0 (TID 72, 172.18.92.61, executor 0, partition 72, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 71.0 in stage 0.0 (TID 71) in 40 ms on 172.18.92.61 (executor 0) (72/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 73.0 in stage 0.0 (TID 73, 172.18.92.61, executor 0, partition 73, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 72.0 in stage 0.0 (TID 72) in 25 ms on 172.18.92.61 (executor 0) (73/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 74.0 in stage 0.0 (TID 74, 172.18.92.61, executor 0, partition 74, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 73.0 in stage 0.0 (TID 73) in 24 ms on 172.18.92.61 (executor 0) (74/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 75.0 in stage 0.0 (TID 75, 172.18.92.61, executor 0, partition 75, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 74.0 in stage 0.0 (TID 74) in 29 ms on 172.18.92.61 (executor 0) (75/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 76.0 in stage 0.0 (TID 76, 172.18.92.61, executor 0, partition 76, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 75.0 in stage 0.0 (TID 75) in 28 ms on 172.18.92.61 (executor 0) (76/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 77.0 in stage 0.0 (TID 77, 172.18.92.61, executor 0, partition 77, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 76.0 in stage 0.0 (TID 76) in 69 ms on 172.18.92.61 (executor 0) (77/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 78.0 in stage 0.0 (TID 78, 172.18.92.61, executor 0, partition 78, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 77.0 in stage 0.0 (TID 77) in 19 ms on 172.18.92.61 (executor 0) (78/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 79.0 in stage 0.0 (TID 79, 172.18.92.61, executor 0, partition 79, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 78.0 in stage 0.0 (TID 78) in 27 ms on 172.18.92.61 (executor 0) (79/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 80.0 in stage 0.0 (TID 80, 172.18.92.61, executor 0, partition 80, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 79.0 in stage 0.0 (TID 79) in 38 ms on 172.18.92.61 (executor 0) (80/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 81.0 in stage 0.0 (TID 81, 172.18.92.61, executor 0, partition 81, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 80.0 in stage 0.0 (TID 80) in 27 ms on 172.18.92.61 (executor 0) (81/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 82.0 in stage 0.0 (TID 82, 172.18.92.61, executor 0, partition 82, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 81.0 in stage 0.0 (TID 81) in 34 ms on 172.18.92.61 (executor 0) (82/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 83.0 in stage 0.0 (TID 83, 172.18.92.61, executor 0, partition 83, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 82.0 in stage 0.0 (TID 82) in 28 ms on 172.18.92.61 (executor 0) (83/100)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Starting task 84.0 in stage 0.0 (TID 84, 172.18.92.61, executor 0, partition 84, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:43 INFO scheduler.TaskSetManager: Finished task 83.0 in stage 0.0 (TID 83) in 29 ms on 172.18.92.61 (executor 0) (84/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 85.0 in stage 0.0 (TID 85, 172.18.92.61, executor 0, partition 85, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 84.0 in stage 0.0 (TID 84) in 28 ms on 172.18.92.61 (executor 0) (85/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 86.0 in stage 0.0 (TID 86, 172.18.92.61, executor 0, partition 86, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 85.0 in stage 0.0 (TID 85) in 31 ms on 172.18.92.61 (executor 0) (86/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 87.0 in stage 0.0 (TID 87, 172.18.92.61, executor 0, partition 87, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 86.0 in stage 0.0 (TID 86) in 40 ms on 172.18.92.61 (executor 0) (87/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 88.0 in stage 0.0 (TID 88, 172.18.92.61, executor 0, partition 88, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 87.0 in stage 0.0 (TID 87) in 15 ms on 172.18.92.61 (executor 0) (88/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 89.0 in stage 0.0 (TID 89, 172.18.92.61, executor 0, partition 89, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 88.0 in stage 0.0 (TID 88) in 37 ms on 172.18.92.61 (executor 0) (89/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 90.0 in stage 0.0 (TID 90, 172.18.92.61, executor 0, partition 90, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 89.0 in stage 0.0 (TID 89) in 25 ms on 172.18.92.61 (executor 0) (90/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 91.0 in stage 0.0 (TID 91, 172.18.92.61, executor 0, partition 91, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 90.0 in stage 0.0 (TID 90) in 27 ms on 172.18.92.61 (executor 0) (91/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 92.0 in stage 0.0 (TID 92, 172.18.92.61, executor 0, partition 92, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 91.0 in stage 0.0 (TID 91) in 23 ms on 172.18.92.61 (executor 0) (92/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 93.0 in stage 0.0 (TID 93, 172.18.92.61, executor 0, partition 93, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 92.0 in stage 0.0 (TID 92) in 24 ms on 172.18.92.61 (executor 0) (93/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 94.0 in stage 0.0 (TID 94, 172.18.92.61, executor 0, partition 94, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 93.0 in stage 0.0 (TID 93) in 33 ms on 172.18.92.61 (executor 0) (94/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 95.0 in stage 0.0 (TID 95, 172.18.92.61, executor 0, partition 95, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 94.0 in stage 0.0 (TID 94) in 27 ms on 172.18.92.61 (executor 0) (95/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 96.0 in stage 0.0 (TID 96, 172.18.92.61, executor 0, partition 96, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 95.0 in stage 0.0 (TID 95) in 29 ms on 172.18.92.61 (executor 0) (96/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 97.0 in stage 0.0 (TID 97, 172.18.92.61, executor 0, partition 97, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 96.0 in stage 0.0 (TID 96) in 24 ms on 172.18.92.61 (executor 0) (97/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 98.0 in stage 0.0 (TID 98, 172.18.92.61, executor 0, partition 98, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 97.0 in stage 0.0 (TID 97) in 38 ms on 172.18.92.61 (executor 0) (98/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Starting task 99.0 in stage 0.0 (TID 99, 172.18.92.61, executor 0, partition 99, PROCESS_LOCAL, 7870 bytes)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) in 16 ms on 172.18.92.61 (executor 0) (99/100)
    21/12/23 14:19:44 INFO scheduler.TaskSetManager: Finished task 99.0 in stage 0.0 (TID 99) in 94 ms on 172.18.92.61 (executor 0) (100/100)
    21/12/23 14:19:44 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    21/12/23 14:19:44 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 343.331 s
    21/12/23 14:19:44 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 343.675474 s
    Pi is roughly 3.1412163141216314
    21/12/23 14:19:44 INFO server.AbstractConnector: Stopped Spark@79ab3a71{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
    21/12/23 14:19:45 INFO ui.SparkUI: Stopped Spark web UI at http://iZwz915iahvm5k8atqdtj2Z:4040
    21/12/23 14:19:45 INFO cluster.StandaloneSchedulerBackend: Shutting down all executors
    21/12/23 14:19:45 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
    21/12/23 14:19:45 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    21/12/23 14:19:45 INFO memory.MemoryStore: MemoryStore cleared
    21/12/23 14:19:45 INFO storage.BlockManager: BlockManager stopped
    21/12/23 14:19:45 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    21/12/23 14:19:45 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    21/12/23 14:19:45 INFO spark.SparkContext: Successfully stopped SparkContext
    21/12/23 14:19:45 INFO util.ShutdownHookManager: Shutdown hook called
    21/12/23 14:19:45 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a600ff42-96e3-46e7-b644-8d3f369d2948
    21/12/23 14:19:45 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-15c86a48-af3c-4d70-b44b-b1f660c765ce
    

4.2. spark shell

bin/spark-shell \
--master spark://localhost:7077 \
--executor-memory 500m \
--total-executor-cores 1

参数说明:
–master spark://localhost:7077 指定Master的地址
–executor-memory 500m:指定每个worker可用内存为500m
–total-executor-cores 1: 指定整个集群使用的cup核数为1个

至此,环境搭完了!
先后以1核2G和2核4G的配置搭建环境,体验天差地别,1核2G的几乎无法工作,老是OOM。

Spark RDD

rdd 概述
数据的集合
1个 RDD 分为 n 个分片 partition,1个 partition 被1个计算任务处理。partition 数由创建 RDD 时指定,默认为程序所分配到的 cpu 的 core 数。

RDD --转换–> RDD。RDD之间就形成类似于流水线一样的前后依赖关系。当部分的 partition 数据丢失时,可以根据依赖关系只重新计算丢失的 partition 的数据,而不用对所有 partition 计算。

RDD 的分片函数 Partitioner,根据 key 将 parent rdd 中的所有值划分到新 rdd 的 partitions 中 。Spark 实现了两种 Partitioner:基于哈希的 HashPartitioner,基于范围的 RangePartitioner。只有 key-value 的 RDD 有 Partitioner,否则 RDD 的 Partitioner 为 None。Partitioner 函数不但决定了 RDD 的 partition 数量,也决定了 parent RDD shuffle 后的 partition 数量。

WordCount 粗图解
spark学习记录_第1张图片

rdd 创建

读取外部存储系统的数据集创建。如 HDFS,HBASE。
读取数据库。
通过其他 RDD 转换而来。

rdd 编程api
Spark 支持2类算子(操作):Transaction、Action
Transaction 是 懒加载的。Transformation算子的代码不会真正被执行。只有当我们的程序里面遇到一个action算子的时候,代码才会真正的被执行。这种设计让Spark更加有效率地运行。
如 rdd1.TransactionA().TransactionB().TransactionC().ActionA()。当执行到 ActionA() 时,TransactionA()、TransactionB()、TransactionC() 才真正执行。

常见 Transaction 算子

map(func): 返回一个 RDD, 该 RDD 由每一个输入元素都经过 func 转换后组成(func 返回值为单一元素)
filter(func): 返回一个 RDD, 该 RDD 由每一个输入元素都经过 func 过滤后组成(func 返回值 为 bool,true 的元素可以留下,false 的元素被过滤)
flatMap(func): 类似map(func),map返回的是一个个元素,flatmap返回的是一个序列。rdd的每个元素经过map,返回一个元素;rdd的每个元素经过flatmap,返回0-n个元素。flatmap常用于string的split。
mapPartitions(func): 与map(func)类似,在每个 partition 里用 iterator 处理 partition 内的数据。是一种优化,SparkSql、DataFrame 默认开启。
mapPartitionsWithIndex(func): 与mapPartitions(func)类似,能知道当前的partitionId。
sample(withReplacement, fraction, seed): 采样。有用到再回来补充完整
union(otherDataset): 返回一个 RDD,rdd1.union(rdd2),不去重并集
intersection(otherDataset): 返回一个 RDD,交集。必须是两个RDD之间,而不是两个Array之间。
distinct([numTasks])): 返回一个 RDD,去重
groupByKey([numTasks]): key 相同的为一个 group,入参 RDD,返回 RDD
groupByKey 作用与 RDD,不需指定key名(也没有key名),groupBy 作用于 DataSet,需指定列名
reduceByKey(func, [numTasks]): 入参 RDD,返回 RDD。将 key 相同的输入元进行聚合。numTasks 为 reduce 任务个数。
kvRDD.reduceByKey(+).collect()
相当于
kvRDD.reduceByKey((a,b) => (a+b)).collect()
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]): 先在 partition 内聚合,再总的聚合。
sortByKey([ascending], [numTasks]): 返回按 key 排序的 RDD,key 必须实现Ordered接口。
sortBy(func,[ascending], [numTasks]): 比 sortByKey 更灵活。func 表示根据什么排序
join(otherDataset, [numTasks]): 输入(K,V)和(K,W)的RDD,输出K,(V,W))的RDD,相当于内连接
cogroup(otherDataset, [numTasks]): 输入(K,V)(K,W)的RDD,返回(K,(Iterator(V),Iterator(W))的RDD。
比如

val rdd1 = sc.parallelize(Array(("aa",1),("bb",2),("cc",6)))
val rdd2 = sc.parallelize(Array(("aa",3),("dd",4),("aa",5)))
(”Array()= “List()”)
(Scala中的“Seq” = Java中的“List”,Scala中的)
(“”)
=========================
(aa,(CompactBuffer(1),CompactBuffer(3, 5)))
(dd,(CompactBuffer(),CompactBuffer(4)))
(bb,(CompactBuffer(2),CompactBuffer()))
(cc,(CompactBuffer(6),CompactBuffer()))

cartesian(otherDataSet): 两个RDD的笛卡尔积
rdd1 cartesian rdd2
spark学习记录_第2张图片
zip: rdd1 = [(1),(2),(3)],rdd2 = [(10),(20),(30)],rdd1 zip rdd2 = [(1,10),(2,20),(3,30)]
pipe(command, [envVars]): spark调用外部程序,比如脚本
coalesce(numPartitions): 重新分区,第一个参数是要分多少区,第二个参数是否shuffle,默认false。少分区变多分区true,多分区变少分区false。
repartition 重新分区(扩大) 必须shuffle 参数是要分多少区 少变多
repartitionAndSortWithinPartitions(partitioner): 重新分区+排序。对的RDD进行操作。比先分区再排序的效率低。
foldByKey(zeroValue)(seqOp): k/v做折叠,合并处理。k相同的v进行合并。合并可以是相加,相乘等。
比如:

// 相加
rdd1.foldByKey(0)(_+_)
// 相乘
rdd1.foldByKey(1)(_*_)

combineByKey: 合并相同的key的值。spark 的核心高级函数。很多高阶的k/v函数的底层都由combineByKey实现。
比如求平均成绩
在这里插入图片描述
再比如求按性别聚合成nameList并统计人数
在这里插入图片描述
partitionBy(partitioner): 对RDD进行分区,partitioner分区器可以是hashPartitioner。
cache/persist: RDD缓存,可以避免重复计算从而减少时间,区别:cache内部调用了persist算子,cache默认就一个缓存级别MEMORY-ONLY ,而persist则可以选择缓存级别.
rdd1.Subtract(rdd2): 返回在rdd1而不在rdd2的元素所组成的rdd
leftOuterJoin: 左外连接
rightOuterJoin: 右外连接
subtractByKey: 与subtract类似,只不过subtract输入是(k)的RDD,subtractByKey输入是(k,v)的RDD。(是subtract而不是substract!)
keys: rdd,返回rdd。只能rdd1.keys,不能rdd1.keys() !
values: rdd,返回rdd。只能rdd1.values,不能rdd1.values() !

常见Action算子
触发代码的运行,一段spark代码必须有一个action操作。

collect(): 以数组形式返回数据集的所有元素
count(): 返回数据集的元素的数量
如果 rdd 是一个文件,count()计算行数
reduce(func): 通过func聚集RDD中的所有元素。
如 rdd1.reduce((x,y) => (x+y)),返回rdd1所有元素之和的int
first(): 返回RDD的第一个元素
take(n): 返回RDD前n个元素组成的数组
takeSample(withReplacement,num, [seed]): 返回RDD的随机num个元素组成的数组
takeOrdered(n, [ordering]): take(n)+排序
saveAsTextFile(path): 将数据集的元素以textfile的形式保存到HDFS文件系统或者其他支持的文件系统,对于每个元素,Spark将会调用toString方法,将它装换为文件中的文本
saveAsSequenceFile(path): 将数据集中的元素以Hadoop sequencefile的格式保存到指定的目录下,可以使HDFS或者其他Hadoop支持的文件系统。
saveAsObjectFile(path): saveAsObjectFile用于将RDD中的元素序列化成对象,存储到文件中。对于HDFS,默认采用SequenceFile保存。
countByKey(): 针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数。
foreach(func): 在数据集的每一个元素上,运行函数func进行更新。
aggregate(zeroValue)(seqOp, combOp, [numTasks]): 聚合。
比如

scala> val rdd = List(1,2,3,4,5,6,7,8,9)
rdd: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> rdd.par.aggregate((0,0))(
(acc,number) => (acc._1 + number, acc._2 + 1),
(par1,par2) => (par1._1 + par2._1, par1._2 + par2._2)
)
res0: (Int, Int) = (45,9)
scala> res0._1 / res0._2
res1: Int = 5
// 分布式计算,多个 partition的(acc,number) 的结果,(par1,par2)相加。

top(): rdd.top(num),返回前n个元素
keyBy(func):rdd1 = [(10,20,30,40)],rdd1.keyBy(_ / 10) = [(1,10),(2,20),(3,30),(4,40)],将每个元素func的结果作为key,元素本身作为value。

常用:
.foreach(println)
.foreach(print)
.foreach(line => println(line))

发现:
rdd 来自一个text
.foreach(println) 顺序按rdd从头到尾。串行 or 并行+排序
.foreach(line => println(line)) 按某种非rdd的顺序

spark sql

分布式SQL查询引擎。为支持结构化数据而生。DataFrame的编程抽象。
特点:

集成:sql于spark程序集成。将结构化数据作为RDD进行查询,提供了Python、Scala,Java的API
统一数据访问:加载、查询来自各种来源的数据。Hive表,JSON文件,parquet文件
兼容Hive
标准连接:JDBC、ODBC
schema rdd:理解为临时表。称为数据帧。

spark session:spark sql的入口类。对应于spark-core中的spark context
在这里插入图片描述
spark.implicits._主要用来隐式转换的,比如Rdd转DataFrame

构造json(本地文件系统vim,再put到hdfs)
在这里插入图片描述spark学习记录_第3张图片
从hdfs导入json,show查询数据,printSchema查看schema
spark学习记录_第4张图片
spark.read.json(“hdfs:/test/employee.json”) = spark.read.format(“json”).load(“hdfs:/test/employee.json”)
select().show 查询某一列(相当于select 列名)
spark学习记录_第5张图片
filter 通过条件过滤查询(select where)
spark学习记录_第6张图片
groupBy 聚合
spark学习记录_第7张图片
spark sql 数据源
spark sql提供了通用的加载和保存数据的方法:load()和save()
加载(读):
spark.read.format(“parquet”).load(“hdfs:/test/employee_p.parquet”)
保存(写):
spark.write.format(“parquet”).save(“hdfs:/test/employee_p.parquet”)

parquet文件格式:一种列式存储
orc文件格式:一种列式存储
json
hive表
jdbc:

jdbc方式:(先将mysql-connector-java.jar放到spark/jars/目录下,正常启动spark-shell即可)
读取

scala> import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext
scala> val sqlContext = new SQLContext(sc)
warning: one deprecation (since 2.0.0); for details, enable `:setting -deprecation' or `:replay -deprecation'
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4655e059
scala> val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:mysql://localhost:3306/test?useSSL=false", "driver" -> "com.mysql.jdbc.Driver", "dbtable" -> "employee", "user" -> "root", "password" -> "123456")).load()
jdbcDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint]

写入

// 定义schema
scala> import java.util.Properties
scala> import org.apache.spark.sql.{SQLContext, Row}
scala> import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType, LongType}
scala> val sqlContext = new SQLContext(sc)
scala> val studentRDD = sc.parallelize(Array("Tom 21","Jerry 23")).map(_.split(" "))
studentRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at map at <console>:26
// 定义schema
scala> val schema = StructType(List(StructField("name", StringType, true),StructField("age", LongType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,LongType,true))
//Rdd->Row。创建Row对象,每个Row对象都是rowRDD中的一行
scala> val rowRDD = studentRDD.map(p => Row(p(0).trim, p(1).toLong))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[2] at map at <console>:27
// RDD与schema连接
val studentDataFrame = sqlContext.createDataFrame(rowRDD, schema)
// 创建一个prop变量用来保存JDBC连接参数
val prop = new Properties()
prop.put("user", "root") 
prop.put("password", "123456") 
prop.put("driver","com.mysql.jdbc.Driver")
prop.put("dbtable", "employee")
prop.put("url", "jdbc:mysql://localhost:3306/test?useSSL=false")
studentDataFrame.write.mode("append").jdbc("jdbc:mysql://localhost:3306/test?useSSL=false", "employee", prop)

docker mysql

官方脚本安装docker
curl -sSL https://get.daocloud.io/docker | sh

开启docker服务
systemctl start docker

docker安装mysql
pull mysql镜像
在mysql-test容器启动
docker run -itd --name mysql-test -p 3306:3306 -e MYSQL_ROOT_PASSWORD=123456 mysql:5.7.30
进入mysql-test容器
docker exec -it mysql-test /bin/bash
客户端访问mysql~
mysql -u root -p

spark session

以上sparkSQL都通过SparkContext与spark做交互;spark2.0后,建议使用SparkSession

SparkSession是所有Spark功能的起点。通过 SparkSession.builder() 创建一个基本的 SparkSession。

import spark.implicits._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

通过 SparkSession,可以从 RDD、Hive表,多种数据源创建 DataFrame。

scala> sparkSession.read.json("/test/employee.json").show
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+

df.printSchema
df.select(“name”).show

scala> sparkSession.read.json("/test/employee.json").filter($"age" > 21).show()
+---+----+-------+
|age|  id|   name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203|  amith|
| 23|1204|  javed|
| 23|1205| prudvi|
+---+----+-------+

groupBy

你可能感兴趣的:(big,data)