愤怒的苹果ext

spark集群安装和基本使用

spark官网下载地址：http://spark.apache.org/downloads.html

我下载的是1.6.3兼容hadoop2.4的版本spark-1.6.3-bin-hadoop2.4

一、下载、解压

目录结构

二、修改配置文件

[zzq@weekend110 spark-1.6.3-bin-hadoop2.4]$ cd conf/
[zzq@weekend110 conf]$ ll
total 36
-rw-r--r--. 1 zzq zzq  987 Nov  2 15:25 docker.properties.template
-rw-r--r--. 1 zzq zzq 1105 Nov  2 15:25 fairscheduler.xml.template
-rw-r--r--. 1 zzq zzq 1734 Nov  2 15:25 log4j.properties.template
-rw-r--r--. 1 zzq zzq 6671 Nov  2 15:25 metrics.properties.template
-rw-r--r--. 1 zzq zzq  878 Jan  6 07:22 slaves
-rw-r--r--. 1 zzq zzq 1292 Nov  2 15:25 spark-defaults.conf.template
-rwxr-xr-x. 1 zzq zzq 4345 Jan  6 06:53 spark-env.sh

进入conf目录，我把spark-env.sh和slaves后面的.template删掉了，分别修改这2个配置文件

vim spark-env.sh

加入jdk路径、本机ip、spark master的端口号

export JAVA_HOME=/usr/local/jdk1.7.0_79
export SPARK_MASTER_IP=192.168.16.130
export SPARK_MASTER_PORT=7077

[zzq@weekend110 conf]$ cat spark-env.sh 
#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
export JAVA_HOME=/usr/local/jdk1.7.0_79
export SPARK_MASTER_IP=192.168.16.130
export SPARK_MASTER_PORT=7077

修改slaves文件,指定worker

[zzq@weekend110 conf]$ vim slaves 

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.
weekend111
weekend112

我添加了hosts

weekend111地址是192.168.16.135

weekend112地址是192.168.16.136
远程拷贝

scp -r spark-1.6.3-bin-hadoop2.4/  weekend111:/home/zzq/app
scp -r spark-1.6.3-bin-hadoop2.4/  weekend112:/home/zzq/app

三、使用spark

启动spark

[zzq@weekend110 spark-1.6.3-bin-hadoop2.4]$ ./sbin/start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /home/zzq/app/spark-1.6.3-bin-hadoop2.4/logs/spark-zzq-org.apache.spark.deploy.master.Master-1-weekend110.out
weekend111: starting org.apache.spark.deploy.worker.Worker, logging to /home/zzq/app/spark-1.6.3-bin-hadoop2.4/logs/spark-zzq-org.apache.spark.deploy.worker.Worker-1-weekend111.out
weekend112: starting org.apache.spark.deploy.worker.Worker, logging to /home/zzq/app/spark-1.6.3-bin-hadoop2.4/logs/spark-zzq-org.apache.spark.deploy.worker.Worker-1-weekend112.out
weekend111: failed to launch org.apache.spark.deploy.worker.Worker:
weekend111: full log in /home/zzq/app/spark-1.6.3-bin-hadoop2.4/logs/spark-zzq-org.apache.spark.deploy.worker.Worker-1-weekend111.out
weekend112: failed to launch org.apache.spark.deploy.worker.Worker:
weekend112: full log in /home/zzq/app/spark-1.6.3-bin-hadoop2.4/logs/spark-zzq-org.apache.spark.deploy.worker.Worker-1-weekend112.out

查看进程一个master二个Worker

[zzq@weekend110 spark-1.6.3-bin-hadoop2.4]$ jps
2156 Jps
2093 Master

[zzq@weekend111 ~]$ jps
2019 Jps
1969 Worker

[zzq@weekend112 ~]$ jps
2129 Jps
2079 Worker

spark的web界面

spark shell使用

[zzq@weekend110 spark-1.6.3-bin-hadoop2.4]$ ./bin/spark-shell --master spark://192.168.16.130:7077 --executor-memory 1G --total-executor-cores 2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.3
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) Client VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
17/02/03 06:00:15 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
17/02/03 06:00:18 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
17/02/03 06:00:26 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/02/03 06:00:26 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
Java HotSpot(TM) Client VM warning: You have loaded library /tmp/libnetty-transport-native-epoll5980461953900249225.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
17/02/03 06:00:31 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
17/02/03 06:00:32 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala>

spark-shell参数详解

--executor-memory 是指定每个executor(执行器)占用的内存 
--total-executor-cores是所有executor总共使用的cpu核数 
--executor-cores是每个executor使用的cpu核数

例子：路径是hdfs上的wc.txt文件，reduce结果也返回到hdfs

scala> sc.textFile("hdfs://weekend110:9000/hadoop/data/wc.txt").flatMap(_.split(" "))
res17: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at flatMap at :28

scala> res17.map((_,1)).reduceByKey(_ + _)
res18: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[17] at reduceByKey at :30


scala> res18.saveAsTextFile("hdfs://weekend110:9000/hadoop/data/output")

这是运行结果

wc.txt的文本内容

[zzq@weekend110 ~]$ hadoop fs -cat /hadoop/data/wc.txt
hello jetty hadoop apple
thank hello

Scala程序的例子

package Test2

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf



object TestSpark {
  def main(args: Array[String]) {

    if(args.length < 2){
      println("args params error");
    }

    /** spark例子 */
    val masterAddr = "spark://weekend110:7077";
    val conf = new SparkConf().setAppName("wordCount").setMaster(masterAddr);
    val sc = new SparkContext(conf);
    sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).saveAsTextFile(args(1))
    sc.stop();


  }
}

然后打成jar，上传到spark master机子中，使用spark-submit执行

我用maven打的包发现一个异常

Java HotSpot(TM) Client VM warning: You have loaded library /tmp/libnetty-transport-native-epoll9051232252112870247.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: c
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:65)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:331)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$reduceByKey$3.apply(PairRDDFunctions.scala:331)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:330)
	at Test2.TestSpark$.main(TestSpark.scala:20)
	at Test2.TestSpark.main(TestSpark.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我修改了pom.xml,build部分

 
    src/main/scala/
    

    
      
        org.scala-tools
        maven-scala-plugin
        
          
            
              compile
              testCompile
            
          
        
        
          2.10.3
        
      

      
        org.apache.maven.plugins
        maven-shade-plugin
        2.2
        
          
            package
            
              shade
            
            
              
                
                  *:*
                  
                    META-INF/*.SF
                    META-INF/*.DSA
                    META-INF/*.RSA
                  
                
              
              
                
                  reference.conf
                

                
                  
                    cn.chinahadoop.spark.Analysis
                  
                
                
                  META-INF/services/org.apache.hadoop.fs.FileSystem

运行：

[zzq@weekend110 spark-1.6.3-bin-hadoop2.4]$  ./bin/spark-submit --master spark://weekend110:7077 --name WordCountByscala --class Test2.TestSpark --executor-memory 1G --total-executor-cores 2 ./wordCount.jar hdfs://weekend110:9000/hadoop/data/wc.txt hdfs://weekend110:9000/hadoop/data/output
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/03 08:40:14 INFO SparkContext: Running Spark version 1.6.3
17/02/03 08:40:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/03 08:40:15 INFO SecurityManager: Changing view acls to: zzq
17/02/03 08:40:15 INFO SecurityManager: Changing modify acls to: zzq
17/02/03 08:40:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zzq); users with modify permissions: Set(zzq)
17/02/03 08:40:15 INFO Utils: Successfully started service 'sparkDriver' on port 47906.
17/02/03 08:40:15 INFO Slf4jLogger: Slf4jLogger started
17/02/03 08:40:15 INFO Remoting: Starting remoting
17/02/03 08:40:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:54718]
17/02/03 08:40:15 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54718.
17/02/03 08:40:15 INFO SparkEnv: Registering MapOutputTracker
17/02/03 08:40:15 INFO SparkEnv: Registering BlockManagerMaster
17/02/03 08:40:15 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ee825a30-c3e4-45d8-90dc-ea822974efbc
17/02/03 08:40:15 INFO MemoryStore: MemoryStore started with capacity 517.4 MB
17/02/03 08:40:16 INFO SparkEnv: Registering OutputCommitCoordinator
17/02/03 08:40:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/02/03 08:40:16 INFO SparkUI: Started SparkUI at http://192.168.16.130:4040
17/02/03 08:40:16 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2ec0fcd4-587e-4620-bd54-6cacfddc54ee/httpd-2043fbfd-e881-4f58-865c-c4eba096324a
17/02/03 08:40:16 INFO HttpServer: Starting HTTP Server
17/02/03 08:40:16 INFO Utils: Successfully started service 'HTTP file server' on port 39554.
17/02/03 08:40:16 INFO SparkContext: Added JAR file:/home/zzq/app/spark-1.6.3-bin-hadoop2.4/./wordCount.jar at http://192.168.16.130:39554/jars/wordCount.jar with timestamp 1486140016635
17/02/03 08:40:16 INFO AppClient$ClientEndpoint: Connecting to master spark://weekend110:7077...
17/02/03 08:40:17 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20170203084017-0002
17/02/03 08:40:18 INFO AppClient$ClientEndpoint: Executor added: app-20170203084017-0002/0 on worker-20170203055642-192.168.16.135-59890 (192.168.16.135:59890) with 1 cores
17/02/03 08:40:18 INFO SparkDeploySchedulerBackend: Granted executor ID app-20170203084017-0002/0 on hostPort 192.168.16.135:59890 with 1 cores, 1024.0 MB RAM
17/02/03 08:40:18 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42663.
17/02/03 08:40:18 INFO NettyBlockTransferService: Server created on 42663
17/02/03 08:40:18 INFO BlockManagerMaster: Trying to register BlockManager
17/02/03 08:40:18 INFO AppClient$ClientEndpoint: Executor added: app-20170203084017-0002/1 on worker-20170203055645-192.168.16.136-42516 (192.168.16.136:42516) with 1 cores
17/02/03 08:40:18 INFO SparkDeploySchedulerBackend: Granted executor ID app-20170203084017-0002/1 on hostPort 192.168.16.136:42516 with 1 cores, 1024.0 MB RAM
17/02/03 08:40:18 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.16.130:42663 with 517.4 MB RAM, BlockManagerId(driver, 192.168.16.130, 42663)
17/02/03 08:40:18 INFO BlockManagerMaster: Registered BlockManager
17/02/03 08:40:19 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/1 is now RUNNING
17/02/03 08:40:19 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
17/02/03 08:40:19 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/0 is now RUNNING
17/02/03 08:40:21 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
17/02/03 08:40:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 134.3 KB, free 517.3 MB)
17/02/03 08:40:22 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.4 KB, free 517.3 MB)
17/02/03 08:40:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.16.130:42663 (size: 12.4 KB, free: 517.4 MB)
17/02/03 08:40:22 INFO SparkContext: Created broadcast 0 from textFile at TestSpark.scala:20
Java HotSpot(TM) Client VM warning: You have loaded library /tmp/libnetty-transport-native-epoll5070917405505758915.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
17/02/03 08:40:26 INFO FileInputFormat: Total input paths to process : 1
17/02/03 08:40:26 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/02/03 08:40:26 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/02/03 08:40:26 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/02/03 08:40:26 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/02/03 08:40:26 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/02/03 08:40:27 INFO SparkContext: Starting job: saveAsTextFile at TestSpark.scala:20
17/02/03 08:40:27 INFO DAGScheduler: Registering RDD 3 (map at TestSpark.scala:20)
17/02/03 08:40:27 INFO DAGScheduler: Got job 0 (saveAsTextFile at TestSpark.scala:20) with 2 output partitions
17/02/03 08:40:27 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at TestSpark.scala:20)
17/02/03 08:40:27 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/02/03 08:40:27 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/02/03 08:40:27 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at TestSpark.scala:20), which has no missing parents
17/02/03 08:40:27 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 517.3 MB)
17/02/03 08:40:27 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 517.3 MB)
17/02/03 08:40:27 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.16.130:42663 (size: 2.3 KB, free: 517.4 MB)
17/02/03 08:40:27 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/02/03 08:40:27 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at TestSpark.scala:20)
17/02/03 08:40:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
17/02/03 08:40:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/03 08:40:57 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/03 08:41:08 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/1 is now EXITED (Command exited with code 1)
17/02/03 08:41:08 INFO SparkDeploySchedulerBackend: Executor app-20170203084017-0002/1 removed: Command exited with code 1
17/02/03 08:41:08 INFO BlockManagerMaster: Removal of executor 1 requested
17/02/03 08:41:08 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
17/02/03 08:41:08 INFO AppClient$ClientEndpoint: Executor added: app-20170203084017-0002/2 on worker-20170203055645-192.168.16.136-42516 (192.168.16.136:42516) with 1 cores
17/02/03 08:41:08 INFO SparkDeploySchedulerBackend: Granted executor ID app-20170203084017-0002/2 on hostPort 192.168.16.136:42516 with 1 cores, 1024.0 MB RAM
17/02/03 08:41:08 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/02/03 08:41:12 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/03 08:41:14 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/2 is now RUNNING
17/02/03 08:41:27 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/03 08:41:37 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/0 is now EXITED (Command exited with code 1)
17/02/03 08:41:37 INFO SparkDeploySchedulerBackend: Executor app-20170203084017-0002/0 removed: Command exited with code 1
17/02/03 08:41:37 INFO BlockManagerMaster: Removal of executor 0 requested
17/02/03 08:41:37 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
17/02/03 08:41:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
17/02/03 08:41:37 INFO AppClient$ClientEndpoint: Executor added: app-20170203084017-0002/3 on worker-20170203055642-192.168.16.135-59890 (192.168.16.135:59890) with 1 cores
17/02/03 08:41:37 INFO SparkDeploySchedulerBackend: Granted executor ID app-20170203084017-0002/3 on hostPort 192.168.16.135:59890 with 1 cores, 1024.0 MB RAM
17/02/03 08:41:41 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/3 is now RUNNING
17/02/03 08:41:42 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/02/03 08:41:56 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (weekend112:60400) with ID 2
17/02/03 08:41:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, weekend112, partition 0,ANY, 2188 bytes)
17/02/03 08:42:00 INFO BlockManagerMasterEndpoint: Registering block manager weekend112:39103 with 517.4 MB RAM, BlockManagerId(2, weekend112, 39103)
17/02/03 08:42:26 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/3 is now EXITED (Command exited with code 1)
17/02/03 08:42:26 INFO SparkDeploySchedulerBackend: Executor app-20170203084017-0002/3 removed: Command exited with code 1
17/02/03 08:42:26 INFO BlockManagerMaster: Removal of executor 3 requested
17/02/03 08:42:26 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 3
17/02/03 08:42:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/02/03 08:42:26 INFO AppClient$ClientEndpoint: Executor added: app-20170203084017-0002/4 on worker-20170203055642-192.168.16.135-59890 (192.168.16.135:59890) with 1 cores
17/02/03 08:42:26 INFO SparkDeploySchedulerBackend: Granted executor ID app-20170203084017-0002/4 on hostPort 192.168.16.135:59890 with 1 cores, 1024.0 MB RAM
17/02/03 08:42:28 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/4 is now RUNNING
17/02/03 08:42:43 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on weekend112:39103 (size: 2.3 KB, free: 517.4 MB)
17/02/03 08:42:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on weekend112:39103 (size: 12.4 KB, free: 517.4 MB)
17/02/03 08:43:03 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/4 is now EXITED (Command exited with code 1)
17/02/03 08:43:03 INFO SparkDeploySchedulerBackend: Executor app-20170203084017-0002/4 removed: Command exited with code 1
17/02/03 08:43:03 INFO BlockManagerMaster: Removal of executor 4 requested
17/02/03 08:43:03 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 4
17/02/03 08:43:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/02/03 08:43:03 INFO AppClient$ClientEndpoint: Executor added: app-20170203084017-0002/5 on worker-20170203055642-192.168.16.135-59890 (192.168.16.135:59890) with 1 cores
17/02/03 08:43:03 INFO SparkDeploySchedulerBackend: Granted executor ID app-20170203084017-0002/5 on hostPort 192.168.16.135:59890 with 1 cores, 1024.0 MB RAM
17/02/03 08:43:04 INFO AppClient$ClientEndpoint: Executor updated: app-20170203084017-0002/5 is now RUNNING
17/02/03 08:43:08 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, weekend112, partition 1,ANY, 2188 bytes)
17/02/03 08:43:08 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 72590 ms on weekend112 (1/2)
17/02/03 08:43:08 INFO DAGScheduler: ShuffleMapStage 0 (map at TestSpark.scala:20) finished in 161.108 s
17/02/03 08:43:08 INFO DAGScheduler: looking for newly runnable stages
17/02/03 08:43:08 INFO DAGScheduler: running: Set()
17/02/03 08:43:08 INFO DAGScheduler: waiting: Set(ResultStage 1)
17/02/03 08:43:08 INFO DAGScheduler: failed: Set()
17/02/03 08:43:08 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at TestSpark.scala:20), which has no missing parents
17/02/03 08:43:08 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 173 ms on weekend112 (2/2)
17/02/03 08:43:08 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/02/03 08:43:09 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 57.2 KB, free 517.2 MB)
17/02/03 08:43:09 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.8 KB, free 517.2 MB)
17/02/03 08:43:09 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.16.130:42663 (size: 19.8 KB, free: 517.4 MB)
17/02/03 08:43:09 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
17/02/03 08:43:09 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at TestSpark.scala:20)
17/02/03 08:43:09 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
17/02/03 08:43:09 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, weekend112, partition 0,NODE_LOCAL, 1950 bytes)
17/02/03 08:43:09 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on weekend112:39103 (size: 19.8 KB, free: 517.4 MB)
17/02/03 08:43:09 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to weekend112:60400
17/02/03 08:43:09 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 151 bytes
17/02/03 08:43:11 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, weekend112, partition 1,NODE_LOCAL, 1950 bytes)
17/02/03 08:43:11 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 2464 ms on weekend112 (1/2)
17/02/03 08:43:11 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at TestSpark.scala:20) finished in 2.685 s
17/02/03 08:43:11 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 237 ms on weekend112 (2/2)
17/02/03 08:43:11 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
17/02/03 08:43:11 INFO DAGScheduler: Job 0 finished: saveAsTextFile at TestSpark.scala:20, took 164.809417 s
17/02/03 08:43:12 INFO SparkUI: Stopped Spark web UI at http://192.168.16.130:4040
17/02/03 08:43:12 INFO SparkDeploySchedulerBackend: Shutting down all executors
17/02/03 08:43:12 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
17/02/03 08:43:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/02/03 08:43:12 INFO MemoryStore: MemoryStore cleared
17/02/03 08:43:12 INFO BlockManager: BlockManager stopped
17/02/03 08:43:12 INFO BlockManagerMaster: BlockManagerMaster stopped
17/02/03 08:43:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/02/03 08:43:12 INFO SparkContext: Successfully stopped SparkContext
17/02/03 08:43:12 INFO ShutdownHookManager: Shutdown hook called
17/02/03 08:43:12 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/02/03 08:43:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-2ec0fcd4-587e-4620-bd54-6cacfddc54ee
17/02/03 08:43:12 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
17/02/03 08:43:12 INFO ShutdownHookManager: Deleting directory /tmp/spark-2ec0fcd4-587e-4620-bd54-6cacfddc54ee/httpd-2043fbfd-e881-4f58-865c-c4eba096324a

运行成功

你可能感兴趣的:(spark,scala)

24.park和unpark方法卷土重来… java并发编程 java
1.park方法可以暂停线程，线程状态为wait。2.unpark方法可以恢复线程，线程状态为runnable。3.LockSupport的静态方法。4.park和unpark方法调用不分先后，unpark先调用，park后执行也可以恢复线程。publicclassParkDemo{publicstaticvoidmain(String[]args){Threadt1=newThread(()->
FlinkSQL 自定义函数详解 Tit先生基础 flink sql 大数据 java
FlinkSQL函数详解自定义函数除了内置函数之外，FlinkSQL还支持自定义函数，我们可以通过自定义函数来扩展函数的使用FlinkSQL当中自定义函数主要分为四大类:1.ScalarFunction:标量函数特点:每次只接收一行的数据，输出结果也是1行1列典型的标量函数如:upper(str),lower(str),abs(salary)2.TableFunction:表生成函数特点:运行时每
HTB academy -- Linux Privilege Escalation --Service-based Privilege Escalation 网络安全小吗喽 linux 服务器网络安全测试工具
VulnerableServices#!/bin/bash#screenroot.sh#setuidscreenv4.5.0localrootexploit#abusesld.so.preloadoverwritingtogetroot.#bug:https://lists.gnu.org/archive/html/screen-devel/2017-01/msg00025.html#HACKTH
网络安全核心技术解析：权限提升（Privilege Escalation）攻防全景
引言在网络安全攻防对抗中，权限提升（PrivilegeEscalation）是攻击链条中关键的「破局点」。攻击者通过突破系统权限壁垒，往往能以有限权限为跳板，最终掌控整个系统控制权。本文将从攻击原理、技术路径、实战案例到防御体系，全方位解析这一网络空间的「钥匙窃取」艺术。一、权限提升的本质与分类1.1核心定义权限提升指攻击者通过技术手段，将当前运行进程或用户的权限等级突破系统预设的访问控制机制，获
安全运维的 “五层防护”：构建全方位安全体系 KKKlucifer 安全运维
在数字化运维场景中，异构系统复杂、攻击手段隐蔽等挑战日益突出。保旺达基于“全域纳管-身份认证-行为监测-自动响应-审计溯源”的五层防护架构，融合AI、零信任等技术，构建全链路安全运维体系，以下从技术逻辑与实践落地展开解析：第一层：全域资产纳管——筑牢安全根基挑战云网基础设施包含分布式计算（Hadoop/Spark）、数据流处理（Storm/Flink）等异构组件，通信协议繁杂，传统方案难以全面纳管
Embabel：下一代企业级JVM AI智能体框架的革命引言：AI时代的Java生态新机遇 DZSpace 软件开发 jvm 人工智能 java
在生成式AI（如ChatGPT、Claude、Gemini）席卷全球的背景下，Python凭借其丰富的AI工具链（如PyTorch、LangChain）成为主流开发语言。然而，在企业级软件开发领域，Java和JVM生态（如Kotlin、Scala）长期以来占据主导地位，尤其是在金融、电信、电商等对稳定性、可扩展性、事务管理要求极高的场景。RodJohnson（Spring框架创始人）敏锐地发现了这
Scala实现网页数据采集示例
Scala可以轻松实现简单的数据采集任务，结合AkkaHTTP（高效HTTP客户端）和Jsoup（HTML解析库）是常见方案。Scala因为受众比较少，而且随着这两年python的热门语言，更让Scala不为人知，今天我将结合我所学的知识实现一个简单的Scala爬虫代码示例。以下就是我整理的一个完整示例，演示如何抓取网页标题和链接：示例代码importakka.actor.ActorSystemi
Hive 事务表(ACID)问题梳理
文章目录问题描述分析原因什么是事务表概念事务表和普通内部表的区别相关配置事务表的适用场景注意事项设计原理与实现文件管理格式参考博客问题描述工作中需要使用pyspark读取Hive中的数据，但是发现可以获取metastore，外部表的数据可以读取，内部表数据有些表报错信息是：AnalysisException:org.apache.hadoop.hive.ql.metadata.HiveExcept
kafka 每条消息只会保存到某一个分区 scan724 kafka
也就是说Kafka的消息组织方式实际上是三级结构：主题-分区-消息。主题下的每条消息只会保存在某一个分区中，而不会在多个分区中被保存多份。官网上的这张图非常清晰地展示了Kafka的三级结构，如下所示其实分区的作用就是提供负载均衡的能力，或者说对数据进行分区的主要原因，就是为了实现系统的高伸缩性（Scalability）。不同的分区能够被放置到不同节点的机器上，而数据的读写操作也都是针对分区这个粒度
2025年跑深度学习电脑配置-深度学习显卡推荐 OpenCV图像识别人工智能深度学习智能电视人工智能
2025年跑深度学习任务，电脑配置需从处理器、内存、显卡、存储、散热与电源、扩展性、网络连接等多方面综合考量，以下是具体分析：处理器（CPU）多核高性能：深度学习涉及大量并行计算任务，需要处理器具备强大的多核处理能力。英特尔至强Scalable处理器（SapphireRapids或后续架构）和AMDEPYC处理器（Genoa或后续架构）是不错的选择。英特尔至强Scalable处理器提供卓越的单核性
云原生--微服务、CICD、SaaS、PaaS、IaaS 青秋. 云原生 docker 云原生微服务 kubernetes serverless service_mesh ci/cd
往期推荐浅学React和JSX-CSDN博客一文搞懂大数据流式计算引擎Flink【万字详解，史上最全】-CSDN博客一文入门大数据准流式计算引擎Spark【万字详解，全网最新】_大数据spark-CSDN博客目录1.云原生概念和特点2.常见云模式3.云对外提供服务的架构模式3.1IaaS（Infrastructure-as-a-Service）3.2PaaS（Platform-as-a-Servi
Spark运行架构 EmoGP Spark spark 架构大数据
Spark框架的核心是一个计算引擎，整体来说，它采用了标准master-slave的结构如下图所示，它展示了一个Spark执行时的基本结构，图形中的Driver表示master，负责管理整个集群中的作业任务调度，图形中的Executor则是slave，负责实际执行任务。由上图可以看出，对于Spark框架有两个核心组件：DriverSpark驱动器节点，用于执行Spark任务中的main方法，负
Spark 各种配置项 zhixingheyi_tian 大数据 spark Spark Conf spark jvm java
/bin/spark-shell--masteryarn--deploy-modeclient/bin/spark-shell--masteryarn--deploy-modeclusterTherearetwodeploymodesthatcanbeusedtolaunchSparkapplicationsonYARN.Inclustermode,theSparkdriverrunsinside
Spark RDD 及性能调优 Aurora_NeAr spark wpf c#
RDDProgrammingRDD核心架构与特性分区（Partitions）：数据被切分为多个分区；每个分区在集群节点上独立处理；分区是并行计算的基本单位。计算函数（ComputeFunction）：每个分区应用相同的转换函数；惰性执行机制。依赖关系（Dependencies）窄依赖：1个父分区→1个子分区（map、filter）。宽依赖：1个父分区→多个子分区（groupByKey、join）。
Apache Iceberg数据湖基础 Aurora_NeAr apache
IntroducingApacheIceberg数据湖的演进与挑战传统数据湖（Hive表格式）的缺陷：分区锁定：查询必须显式指定分区字段（如WHEREdt='2025-07-01'）。无原子性：并发写入导致数据覆盖或部分可见。低效元数据：LIST操作扫描全部分区目录（云存储成本高）。Iceberg的革新目标：解耦计算引擎与存储格式（支持Spark/Flink/Trino等）；提供ACID事务、模式
大数据技术之Flink
第1章Flink概述1.1Flink是什么1.2Flink特点1.3FlinkvsSparkStreaming表Flink和Streaming对比FlinkStreaming计算模型流计算微批处理时间语义事件时间、处理时间处理时间窗口多、灵活少、不灵活（窗口必须是批次的整数倍）状态有没有流式SQL有没有1.4Flink的应用场景1.5Flink分层API第2章Flink快速上手2.1创建项目在准备
Scala 简介 froginwe11 开发语言
Scala简介引言Scala是一种多范式编程语言，它结合了面向对象和函数式编程的特性。自从2003年由MartinOdersky教授在EPFL开发以来，Scala已经成为了在Java虚拟机（JVM）上运行的高效编程语言。本文将为您详细介绍Scala的起源、特点、应用场景以及学习资源。Scala的起源与发展起源Scala的灵感来源于多种编程语言，包括Java、C++、Self、Haskell和ML。
Hadoop核心组件最全介绍 Cachel wood 大数据开发 hadoop 大数据分布式 spark 数据库计算机网络
文章目录一、Hadoop核心组件1.HDFS(HadoopDistributedFileSystem)2.YARN(YetAnotherResourceNegotiator)3.MapReduce二、数据存储与管理1.HBase2.Hive3.HCatalog4.Phoenix三、数据处理与计算1.Spark2.Flink3.Tez4.Storm5.Presto6.Impala四、资源调度与集群管
SVG格式深度解析与Path应用实战：从原理到企业级全场景开发（实战版）
一、简介在数字图形领域，SVG（ScalableVectorGraphics）凭借其矢量特性、可编辑性和交互能力，成为现代设计和开发的核心工具。本文将从SVG的基础原理出发，深入解析其技术特性，并与主流图像格式（如JPEG、PNG、PLT等）进行对比分析。通过企业级应用案例，结合代码示例和Mermaid图表，帮助开发者全面掌握SVG的应用场景与开发技巧，实现从零到一的高效实践。二、SVG格式的核心
大数据分析技术的学习路径，不是绝对的，仅供参考水云桐程序员学习大数据数据分析学习方法
阶段一：基础筑基（1-3个月）1.编程语言：Python：掌握基础语法、数据结构、流程控制、函数、面向对象编程、常用库（NumPy,Pandas）。SQL：精通SELECT语句（过滤、排序、分组、聚合、连接）、DDL/DML基础。理解关系型数据库概念（表、主键、外键、索引）。MySQL或PostgreSQL是很好的起点。Java/Scala：深入理解Hadoop/Spark等框架会更有优势。初学者
大数据开发高频面试题：Spark与MapReduce解析
被招网约司机的盯上了好几天实习了六个月，到期被通知不能转正。外包裁员让我去友商我该去吗？offer比较华为状态码浏览器插件嵌入式项目推荐2019秋招总结+云从语音算法面经+银行群面面经科大讯飞语音算法面经语音算法美团一面已挂科大讯飞智能语音方向值得去吗？语音算法oc科大讯飞语音算法二面荣耀一面语音算法面经，已挂荣耀_语音算法工程一面科大讯飞语音一面凉经8.18携程机器学习（语音方向）一面【vivo
SVG 安装使用教程小奇JAVA面试安装使用教程 SVG
一、SVG简介SVG（ScalableVectorGraphics，可缩放矢量图形）是一种基于XML的图像格式，用于描述二维图形。与传统的PNG、JPG等位图格式不同，SVG不会因放大而失真，适合展示图标、图表、动画和交互图形。二、SVG的应用场景网站图标和UI元素数据可视化（与ECharts、D3.js等结合）响应式Web设计中的矢量图动画和交互图形三、SVG安装环境（无需专门安装）3.1浏览器
spark处理kafka的用户行为数据写入hive 月光一族吖 spark kafka hive
在CentOS上部署Hadoop（Hadoop3.4.1）和Hive（Hive3.1.2）的详细步骤说明。这份指南面向单机安装（伪集群模式），如果需要搭建真正的多节点集群，各节点间的网络互访、SSH免密登录以及配置同步需进一步调整。注意：本指南假设你已拥有root权限或者具有sudo权限，并且系统连接Internet（用于下载安装包）。步骤中的版本号可根据实际需要进行更改。一、环境准备更新系统软件
Spark 4.0的VariantType 类型以及内部存储鸿乃江边鸟大数据 SQL spark spark sql 大数据
背景本文基于Spark4.0总结Spark中的VariantType类型，用尽量少的字节来存储Json的格式化数据分析这里主要介绍Variant的存储，我们从VariantBuilder.buildJson方法(把对应的json数据存储为VariantType类型)开始：publicstaticVariantparseJson(JsonParserparser,booleanallowDuplic
SVG VSCode：深度解析与最佳实践 froginwe11 开发语言
SVGVSCode：深度解析与最佳实践引言SVG（可缩放矢量图形）作为一种矢量图形格式，因其高度的可缩放性和矢量特性，在网页设计中得到了广泛应用。而VSCode（VisualStudioCode）作为一款流行的代码编辑器，同样在开发者中备受欢迎。本文将深入探讨SVG在VSCode中的使用，包括其优势、配置方法以及最佳实践。SVG简介什么是SVG？SVG（ScalableVectorGraphics
如何学习才能更好地理解人工智能工程技术专业和其他信息技术专业的关联性？人工智能教学实践 python编程实践人工智能学习人工智能
要深入理解人工智能工程技术专业与其他信息技术专业的关联性，需要跳出单一专业的学习框架，通过“理论筑基-实践串联-跨学科整合”的路径构建系统性认知。以下是分阶段、可落地的学习方法：一、建立“专业关联”的理论认知框架绘制知识关联图谱操作方法：用XMind或Notion绘制思维导图，以AI为中心，辐射关联专业的核心技术节点。例如：AI（机器学习）├─数据支撑：大数据技术（Hadoop/Spark）+数据
Spark从入门到熟悉（篇二）
本文介绍Spark的RDD编程，并进行实战演练，加强对编程的理解，实现快速入手知识脉络包含如下8部分内容：创建RDD常用Action操作常用Transformation操作针对PairRDD的常用操作缓存操作共享变量分区操作编程实战创建RDD实现方式有如下两种方式实现：textFile加载本地或者集群文件系统中的数据用parallelize方法将Driver中的数据结构并行化成RDD示例"""te
Kafka生态整合深度解析：构建现代化数据架构的核心枢纽
Kafka生态整合深度解析：构建现代化数据架构的核心枢纽导语：在当今数据驱动的时代，ApacheKafka已经成为企业级数据架构的核心组件。本文将深入探讨Kafka与主流技术栈的整合方案，帮助架构师和开发者构建高效、可扩展的现代化数据处理平台。文章目录Kafka生态整合深度解析：构建现代化数据架构的核心枢纽一、Kafka与流处理引擎的深度集成1.1Kafka+ApacheSpark：批流一体化处理
SOFA RPC SPI机制原理 Jooou rpc
SOFARPC（ScalableOpenFinancialArchitectureRemoteProcedureCall）是一个高可扩展性、高性能、生产级的JavaRPC框架。其SPI（ServiceProviderInterface）机制为框架提供了强大的扩展能力，允许开发者在不修改框架核心代码的情况下，对框架的各个功能组件进行定制和扩展。以下将详细介绍SOFARPC的SPI机制原理。1.Jav
Spark on Docker：容器化大数据开发环境搭建指南 AI天才研究院 ChatGPT 实战 ChatGPT AI大模型应用入门实战与进阶大数据 spark docker ai
SparkonDocker：容器化大数据开发环境搭建指南关键词：Spark、Docker、容器化、大数据开发、分布式计算、开发环境搭建、容器编排摘要：本文系统讲解如何通过Docker实现Spark开发环境的容器化部署，涵盖从基础概念到实战部署的完整流程。首先分析Spark分布式计算框架与Docker容器技术的核心原理及融合优势，接着详细演示单节点开发环境和多节点集群环境的搭建步骤，包括Docker
SAX解析xml文件小猪猪08 xml
1.创建SAXParserFactory实例 2.通过SAXParserFactory对象获取SAXParser实例 3.创建一个类SAXParserHander继续DefaultHandler，并且实例化这个类 4.SAXParser实例的parse来获取文件 public static void main(String[] args) { //
为什么mysql里的ibdata1文件不断的增长？ brotherlamp linux linux运维 linux资料 linux视频 linux运维自学
我们在 Percona 支持栏目经常收到关于 MySQL 的 ibdata1 文件的这个问题。当监控服务器发送一个关于 MySQL 服务器存储的报警时，恐慌就开始了 —— 就是说磁盘快要满了。一番调查后你意识到大多数地盘空间被 InnoDB 的共享表空间 ibdata1 使用。而你已经启用了 innodbfileper_table，所以问题是： ibdata1存了什么？当你启用了 i
Quartz-quartz.properties配置 eksliang quartz
其实Quartz JAR文件的org.quartz包下就包含了一个quartz.properties属性配置文件并提供了默认设置。如果需要调整默认配置，可以在类路径下建立一个新的quartz.properties，它将自动被Quartz加载并覆盖默认的设置。下面是这些默认值的解释 #-----集群的配置 org.quartz.scheduler.instanceName =
informatica session的使用 18289753290 workflow session log Informatica
如果希望workflow存储最近20次的log，在session里的Config Object设置，log options做配置，save session log :sessions run ;savesessio log for these runs:20 session下面的source 里面有个tracing
Scrapy抓取网页时出现CRC check failed 0x471e6e9a != 0x7c07b839L的错误酷的飞上天空 scrapy
Scrapy版本0.14.4 出现问题现象： ERROR: Error downloading <GET http://xxxxx CRC check failed 解决方法 1.设置网络请求时的header中的属性'Accept-Encoding': '*;q=0' 明确表示不支持任何形式的压缩格式，避免程序的解压
java Swing小集锦永夜-极光 java swing
1.关闭窗体弹出确认对话框 1.1 this.setDefaultCloseOperation (JFrame.DO_NOTHING_ON_CLOSE); 1.2 this.addWindowListener ( new WindowAdapter () { public void windo
强制删除.svn文件夹随便小屋 java
在windows上，从别处复制的项目中可能带有.svn文件夹，手动删除太麻烦，并且每个文件夹下都有。所以写了个程序进行删除。因为.svn文件夹在windows上是只读的，所以用File中的delete()和deleteOnExist()方法都不能将其删除，所以只能采用windows命令方式进行删除
GET和POST有什么区别？及为什么网上的多数答案都是错的。 aijuans get post
如果有人问你，GET和POST，有什么区别？你会如何回答？我的经历前几天有人问我这个问题。我说GET是用于获取数据的，POST，一般用于将数据发给服务器之用。这个答案好像并不是他想要的。于是他继续追问有没有别的区别？我说这就是个名字而已，如果服务器支持，他完全可以把G
谈谈新浪微博背后的那些算法 aoyouzi 谈谈新浪微博背后的那些算法
本文对微博中常见的问题的对应算法进行了简单的介绍，在实际应用中的算法比介绍的要复杂的多。当然，本文覆盖的主题并不全，比如好友推荐、热点跟踪等就没有涉及到。但古人云“窥一斑而见全豹”，希望本文的介绍能帮助大家更好的理解微博这样的社交网络应用。微博是一个很多人都在用的社交应用。天天刷微博的人每天都会进行着这样几个操作：原创、转发、回复、阅读、关注、@等。其中，前四个是针对短博文，最后的关注和@则针
Connection reset 连接被重置的解决方法百合不是茶 java 字符流连接被重置
流是java的核心部分,,昨天在做android服务器连接服务器的时候出了问题,就将代码放到java中执行,结果还是一样连接被重置被重置的代码如下; 客户端代码; package 通信软件服务器; import java.io.BufferedWriter; import java.io.OutputStream; import java.io.O
web.xml配置详解之filter bijian1013 java web.xml filter
一.定义 <filter> <filter-name>encodingfilter</filter-name> <filter-class>com.my.app.EncodingFilter</filter-class> <init-param> <param-name>encoding<
Heritrix Bill_chen 多线程 xml 算法制造配置管理
作为纯Java语言开发的、功能强大的网络爬虫Heritrix，其功能极其强大，且扩展性良好，深受热爱搜索技术的盆友们的喜爱，但它配置较为复杂，且源码不好理解，最近又使劲看了下，结合自己的学习和理解，跟大家分享Heritrix的点点滴滴。 Heritrix的下载（http://sourceforge.net/projects/archive-crawler/）安装、配置，就不罗嗦了，可以自己找找资
【Zookeeper】FAQ bit1129 zookeeper
1.脱离IDE，运行简单的Java客户端程序 #ZkClient是简单的Zookeeper~$ java -cp "./:zookeeper-3.4.6.jar:./lib/*" ZKClient 1. Zookeeper是的Watcher回调是同步操作，需要添加异步处理的代码 2. 如果Zookeeper集群跨越多个机房，那么Leader/
The user specified as a definer ('aaa'@'localhost') does not exist 白糖_ localhost
今天遇到一个客户BUG，当前的jdbc连接用户是root，然后部分删除操作都会报下面这个错误：The user specified as a definer ('aaa'@'localhost') does not exist 最后找原因发现删除操作做了触发器，而触发器里面有这样一句 /*!50017 DEFINER = ''aaa@'localhost' */ 原来最初
javascript中showModelDialog刷新父页面 bozch JavaScript 刷新父页面 showModalDialog
在页面中使用showModalDialog打开模式子页面窗口的时候，如果想在子页面中操作父页面中的某个节点，可以通过如下的进行： window.showModalDialog('url',self,‘status...’); // 首先中间参数使用self 在子页面使用w
编程之美-买书折扣 bylijinnan 编程之美
import java.util.Arrays; public class BookDiscount { /**编程之美买书折扣书上的贪心算法的分析很有意思，我看了半天看不懂，结果作者说，贪心算法在这个问题上是不适用的。。下面用动态规划实现。哈利波特这本书一共有五卷，每卷都是8欧元，如果读者一次购买不同的两卷可扣除5%的折扣，三卷10%，四卷20%，五卷
关于struts2.3.4项目跨站执行脚本以及远程执行漏洞修复概要 chenbowen00 struts WEB安全
因为近期负责的几个银行系统软件，需要交付客户，因此客户专门请了安全公司对系统进行了安全评测，结果发现了诸如跨站执行脚本，远程执行漏洞以及弱口令等问题。下面记录下本次解决的过程以便后续 1、首先从最简单的开始处理，服务器的弱口令问题，首先根据安全工具提供的测试描述中发现应用服务器中存在一个匿名用户，默认是不需要密码的，经过分析发现服务器使用了FTP协议，而使用ftp协议默认会产生一个匿名用
[电力与暖气]煤炭燃烧与电力加温 comsci
在宇宙中,用贝塔射线观测地球某个部分,看上去,好像一个个马蜂窝,又像珊瑚礁一样,原来是某个国家的采煤区..... 不过,这个采煤区的煤炭看来是要用完了.....那么依赖将起燃烧并取暖的城市,在极度严寒的季节中...该怎么办呢? &nbs
oracle O7_DICTIONARY_ACCESSIBILITY参数 daizj oracle
O7_DICTIONARY_ACCESSIBILITY参数控制对数据字典的访问.设置为true,如果用户被授予了如select any table等any table权限,用户即使不是dba或sysdba用户也可以访问数据字典.在9i及以上版本默认为false,8i及以前版本默认为true.如果设置为true就可能会带来安全上的一些问题.这也就为什么O7_DICTIONARY_ACCESSIBIL
比较全面的MySQL优化参考 dengkane mysql
本文整理了一些MySQL的通用优化方法，做个简单的总结分享，旨在帮助那些没有专职MySQL DBA的企业做好基本的优化工作，至于具体的SQL优化，大部分通过加适当的索引即可达到效果，更复杂的就需要具体分析了，可以参考本站的一些优化案例或者联系我，下方有我的联系方式。这是上篇。 1、硬件层相关优化 1.1、CPU相关在服务器的BIOS设置中，可
C语言homework2，有一个逆序打印数字的小算法 dcj3sjt126com c
#h1# 0、完成课堂例子 1、将一个四位数逆序打印 1234 ==> 4321 实现方法一： # include <stdio.h> int main(void) { int i = 1234; int one = i%10; int two = i / 10 % 10; int three = i / 100 % 10;
apacheBench对网站进行压力测试 dcj3sjt126com apachebench
ab 的全称是 ApacheBench ，是 Apache 附带的一个小工具，专门用于 HTTP Server 的 benchmark testing ，可以同时模拟多个并发请求。前段时间看到公司的开发人员也在用它作一些测试，看起来也不错，很简单，也很容易使用，所以今天花一点时间看了一下。通过下面的一个简单的例子和注释，相信大家可以更容易理解这个工具的使用。
2种办法让HashMap线程安全 flyfoxs java jdk jni
多线程之--2种办法让HashMap线程安全多线程之--synchronized 和reentrantlock的优缺点多线程之--2种JAVA乐观锁的比较( NonfairSync VS. FairSync) HashMap不是线程安全的,往往在写程序时需要通过一些方法来回避.其实JDK原生的提供了2种方法让HashMap支持线程安全.
Spring Security（04）——认证简介 234390216 Spring Security 认证过程
认证简介目录 1.1 认证过程 1.2 Web应用的认证过程 1.2.1 ExceptionTranslationFilter 1.2.2 在request之间共享SecurityContext 1
Java 位运算 Javahuhui java 位运算
// 左移( << ) 低位补0 // 0000 0000 0000 0000 0000 0000 0000 0110 然后左移2位后，低位补0： // 0000 0000 0000 0000 0000 0000 0001 1000 System.out.println(6 << 2);// 运行结果是24 // 右移( >> ) 高位补"
mysql免安装版配置 ldzyz007 mysql
1、my-small.ini是为了小型数据库而设计的。不应该把这个模型用于含有一些常用项目的数据库。 2、my-medium.ini是为中等规模的数据库而设计的。如果你正在企业中使用RHEL,可能会比这个操作系统的最小RAM需求(256MB)明显多得多的物理内存。由此可见，如果有那么多RAM内存可以使用，自然可以在同一台机器上运行其它服务。 3、my-large.ini是为专用于一个SQL数据
MFC和ado数据库使用时遇到的问题你不认识的休道人 sql C++mfc
=================================================================== 第一个 =================================================================== try{ CString sql; sql.Format("select * from p
表单重复提交Double Submits rensanning double
可能发生的场景： *多次点击提交按钮 *刷新页面 *点击浏览器回退按钮 *直接访问收藏夹中的地址 *重复发送HTTP请求（Ajax）（1）点击按钮后disable该按钮一会儿，这样能避免急躁的用户频繁点击按钮。这种方法确实有些粗暴，友好一点的可以把按钮的文字变一下做个提示，比如Bootstrap的做法： http://getbootstrap.co
Java String 十大常见问题 tomcat_oracle java 正则表达式
　1.字符串比较，使用“==”还是equals()? 　　"=="判断两个引用的是不是同一个内存地址(同一个物理对象)。　　equals()判断两个字符串的值是否相等。　　除非你想判断两个string引用是否同一个对象，否则应该总是使用equals()方法。　　如果你了解字符串的驻留(String Interning)则会更好地理解这个问题。　　
SpringMVC 登陆拦截器实现登陆控制 xp9802 springMVC
思路，先登陆后，将登陆信息存储在session中，然后通过拦截器，对系统中的页面和资源进行访问拦截，同时对于登陆本身相关的页面和资源不拦截。实现方法： 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23