textfilestream_Spark从本地文件流式传输到hdfs。textFileStream-问答-阿里云开发者社区-阿里云...

我正在尝试将本地目录内容流式传输到HDFS。脚本将修改此本地目录,并且每5秒添加一次内容。我的spark程序将流式传输本地目录内容并将其保存到HDFS。但是,当我开始流式传输时,没有任何事情发生。我检查了日志,但我没有得到提示。

让我解释一下这个场景。shell脚本将在本地目录中每5秒移动一个带有一些数据的文件。流上下文的持续时间对象也是5秒。当脚本移动一个新文件时,如果我没有错,则保持原子性。接收器将每隔五秒处理数据并创建Dstream对象。我刚刚搜索了流本地目录,发现路径应该提供为“file:/// my / path”。我没试过这种格式。但如果是这种情况,那么节点的spark执行器将如何保持所提供的本地路径的公共状态?

import org.apache.spark._

import org.apache.spark.streaming._

val ssc = new StreamingContext(sc, Seconds(5))

val filestream = ssc.textFileStream("/home/karteekkhadoop/ch06input")

import java.sql.Timestamp

case class Order(time: java.sql.Timestamp, orderId:Long, clientId:Long, symbol:String, amount:Int, price:Double, buy:Boolean)

import java.text.SimpleDateFormat

val orders = filestream.flatMap(line => {

val dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss")

var s = line.split(",")

try {

assert(s(6) == "B" || s(6) == "S")

List(Order(new Timestamp(dateFormat.parse(s(0)).getTime()), s(1).toLong, s(2).toLong, s(3), s(4).toInt, s(5).toDouble, s(6)=="B"))

}catch{

case e: Throwable => println("Wrong line format("+e+") : " + line)

List()

}

})

val numPerType = orders.map(o => (o.buy, 1L)).reduceByKey((x,y) => x+y)

numPerType.repartition(1).saveAsTextFiles("/user/karteekkhadoop/ch06output/output", "txt")

ssc.awaitTermination()

给出的路径是绝对存在的。还包括以下日志。

[karteekkhadoop@gw03 stream]$ yarn logs -applicationId application_1540458187951_12531

18/11/21 11:12:35 INFO client.RMProxy: Connecting to ResourceManager at rm01.itversity.com/172.16.1.106:8050

18/11/21 11:12:35 INFO client.AHSProxy: Connecting to Application History server at rm01.itversity.com/172.16.1.106:10200

Container: container_e42_1540458187951_12531_01_000001 on wn02.itversity.com:45454

LogAggregationType: LOCAL

LogType:stderr

LogLastModifiedTime:Wed Nov 21 10:52:00 -0500 2018

LogLength:5320

LogContents:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/hdp01/hadoop/yarn/local/filecache/2693/spark2-hdp-yarn-archive.tar.gz/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/usr/hdp/2.6.5.0-292/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

18/11/21 10:51:57 INFO SignalUtils: Registered signal handler for TERM

18/11/21 10:51:57 INFO SignalUtils: Registered signal handler for HUP

18/11/21 10:51:57 INFO SignalUtils: Registered signal handler for INT

18/11/21 10:51:57 INFO SecurityManager: Changing view acls to: yarn,karteekkhadoop

18/11/21 10:51:57 INFO SecurityManager: Changing modify acls to: yarn,karteekkhadoop

18/11/21 10:51:57 INFO SecurityManager: Changing view acls groups to:

18/11/21 10:51:57 INFO SecurityManager: Changing modify acls groups to:

18/11/21 10:51:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, karteekkhadoop); groups with view permissions: Set(); users with modify permissions: Set(yarn, karteekkhadoop); groups with modify permissions: Set()

18/11/21 10:51:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

18/11/21 10:51:58 INFO ApplicationMaster: Preparing Local resources

18/11/21 10:51:59 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

18/11/21 10:51:59 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1540458187951_12531_000001

18/11/21 10:51:59 INFO ApplicationMaster: Waiting for Spark driver to be reachable.

18/11/21 10:51:59 INFO ApplicationMaster: Driver now available: gw03.itversity.com:38932

18/11/21 10:51:59 INFO TransportClientFactory: Successfully created connection to gw03.itversity.com/172.16.1.113:38932 after 90 ms (0 ms spent in bootstraps)

18/11/21 10:51:59 INFO ApplicationMaster:

YARN executor launch context:

env:

CLASSPATH -> {{PWD}}{{PWD}}/__spark_conf__{{PWD}}/__spark_libs__/*/usr/hdp/2.6.5.0-292/hadoop/conf/usr/hdp/2.6.5.0-292/hadoop/*/usr/hdp/2.6.5.0-292/hadoop/lib/*/usr/hdp/current/hadoop-hdfs-client/*/usr/hdp/current/hadoop-hdfs-client/lib/*/usr/hdp/current/hadoop-yarn-client/*/usr/hdp/current/hadoop-yarn-client/lib/*/usr/hdp/current/ext/hadoop/*$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/2.6.5.0-292/hadoop/lib/hadoop-lzo-0.6.0.2.6.5.0-292.jar:/etc/hadoop/conf/secure:/usr/hdp/current/ext/hadoop/*{{PWD}}/__spark_conf__/__hadoop_conf__

SPARK_YARN_STAGING_DIR -> *********(redacted)

SPARK_USER -> *********(redacted)

command:

LD_LIBRARY_PATH="/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:$LD_LIBRARY_PATH" \

{{JAVA_HOME}}/bin/java \

-server \

-Xmx1024m \

-Djava.io.tmpdir={{PWD}}/tmp \

'-Dspark.history.ui.port=18081' \

'-Dspark.driver.port=38932' \

'-Dspark.port.maxRetries=100' \

-Dspark.yarn.app.container.log.dir= \

-XX:OnOutOfMemoryError='kill %p' \

org.apache.spark.executor.CoarseGrainedExecutorBackend \

--driver-url \

spark://[email protected]:38932 \

--executor-id \

\

--hostname \

\

--cores \

1 \

--app-id \

application_1540458187951_12531 \

--user-class-path \

file:$PWD/__app__.jar \

1>/stdout \

2>/stderr

resources:

__spark_libs__ -> resource { scheme: "hdfs" host: "nn01.itversity.com" port: 8020 file: "/hdp/apps/2.6.5.0-292/spark2/spark2-hdp-yarn-archive.tar.gz" } size: 202745446 timestamp: 1533325894570 type: ARCHIVE visibility: PUBLIC

__spark_conf__ -> resource { scheme: "hdfs" host: "nn01.itversity.com" port: 8020 file: "/user/karteekkhadoop/.sparkStaging/application_1540458187951_12531/__spark_conf__.zip" } size: 248901 timestamp: 1542815515889 type: ARCHIVE visibility: PRIVATE

===============================================================================

18/11/21 10:51:59 INFO RMProxy: Connecting to ResourceManager at rm01.itversity.com/172.16.1.106:8030

18/11/21 10:51:59 INFO YarnRMClient: Registering the ApplicationMaster

18/11/21 10:51:59 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances

18/11/21 10:52:00 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals

End of LogType:stderr.This log file belongs to a running container (container_e42_1540458187951_12531_01_000001) and so may not be complete.

代码有什么问题

你可能感兴趣的:(textfilestream)