flume与spark-streaming联通(测试过程)

安装好flume

       配置flume配置文件,确定flume数据源以及要将数据发送给谁

安装telnet

       apt-getinstall xinetd telnetd

      

       安装后使用显示

       root@master:/usr/local/hadoop-2.7.5/sbin#telnet 

       bash:telnet: command not found

       因为telnet依赖xinetd启动,所以xinetd得先启动

       root@master:/etc/xinetd.d#service xinetd status

        *  isnot running

       root@master:/etc/xinetd.d#service xinetd staart

       Usage:/etc/init.d/xinetd {start|stop|reload|force-reload|restart|status}

       root@master:/etc/xinetd.d#service xinetd start

        * Starting internet superserver xinetd                                                                    [ OK ]

       root@master:/etc/xinetd.d#

       出现问题

              root@master:/etc/xinetd.d#apt-get install telnetd

              Readingpackage lists... Done

              Buildingdependency tree      

              Readingstate information... Done

              telnetdis already the newest version (0.17-40).

              0upgraded, 0 newly installed, 0 to remove and 4 not upgraded.

              root@master:/etc/xinetd.d#telnetd

              bash:telnetd: command not found

              root@master:/etc/xinetd.d#

       然后百度,竟然百度到自己之前写的博客。。。。。。

       问题解决

              不是apt-get install xinetd telnetd,是apt-getinstall telnet

              使用apt-get install telnet安装后就能用了,同样的错误两次。。。。。。

配置flume链接spark-streaming需要的jar包

       首先看本地的scala版本,spark版本

              root@master:/usr/local/spark/bin#spark-shell

              Settingdefault log level to "WARN".

              Toadjust logging level use sc.setLogLevel(newLevel).

              18/04/2601:28:22 WARN spark.SparkContext: Use an existing SparkContext, someconfiguration may not take effect.

              Sparkcontext Web UI available at http://172.17.0.2:4040

              Sparkcontext available as 'sc' (master = local[*], app id = local-1524706101693).

              Sparksession available as 'spark'.

              Welcometo

                    ____              __

                   / __/__ ___ _____/ /__

                  _\ \/ _ \/ _ `/ __/  '_/

                 /___/ .__/\_,_/_/ /_/\_\   version 2.0.2

                    /_/

                      

              UsingScala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)

              Typein expressions to have them evaluated.

              Type:help for more information.

             

              scala>

       然后到官网下载相应的包

              http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume_2.11/2.0.2

       然后复制到docker中

              sudodocker cp spark-streaming-flume_2.11-2.0.2.jar master:/root/build

       放入spark的jars目录下

              root@master:/usr/local/spark/jars#mkdir flume

              root@master:/usr/local/spark/jars#cd flume/

              root@master:/usr/local/spark/jars/flume#ll

              total8

              drwxr-xr-x2 root root 4096 Apr 26 01:33 ./

              drwxr-xr-x3  500 500 4096 Apr 26 01:33 ../

              root@master:/usr/local/spark/jars/flume#cp /root/build/spark-streaming-flume_2.11-2.0.2.jar .

              root@master:/usr/local/spark/jars/flume#ll

              total112

              drwxr-xr-x2 root root   4096 Apr 26 01:34 ./

              drwxr-xr-x3  500 500   4096 Apr 26 01:33 ../

              -rw-------1 root root 105087 Apr 26 01:34 spark-streaming-flume_2.11-2.0.2.jar

              root@master:/usr/local/spark/jars/flume#

       修改spark-env.sh文件中的SPARK_DIST_CLASSPATH变量

              原来是export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop-2.7.5/bin/hadoopclasspath)

              添加为export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop-2.7.5/bin/hadoopclasspath):/usr/local/spark/jars/flume/*:/usr/local/flume/lib/*

编写spark程序测试

       网上找到一段代码,来源http://dblab.xmu.edu.cn/blog/1745-2/

       from__future__ import print_function

        

       importsys

        

       frompyspark import SparkContext

       frompyspark.streaming import StreamingContext

       frompyspark.streaming.flume import FlumeUtils

       importpyspark

       if__name__ == "__main__":

           if len(sys.argv) != 3:

               print("Usage: flume_wordcount.py ", file=sys.stderr)

               exit(-1)

        

           sc =SparkContext(appName="FlumeEventCount")

           ssc = StreamingContext(sc, 2)

        

           hostname= sys.argv[1]

           port = int(sys.argv[2])

           stream = FlumeUtils.createStream(ssc,hostname, port,pyspark.StorageLevel.MEMORY_AND_DISK_SER_2)

           stream.count().map(lambda cnt :"Recieve " + str(cnt) +" Flume events!!!!").pprint()

        

           ssc.start()

           ssc.awaitTermination()

使用spark-submit运行spark应用程序

       root@master:~/pyworkspace#spark-submit --driver-class-path/usr/local/spark/jars/*:/usr/local/spark/jars/flume/* flumetest.py localhost44444

       这一步相当于打开了服务器,在本地端口等待flume接收的消息发送过来

启动flume

       进入flume文件夹下,输入

              root@master:/usr/local/flume/conf#bin/flume-ng agent --conf ./conf --conf-file ./conf/flume-to-spark.conf --namea1 -Dflume.root.logger=INFO,console

             

              需要输入的有两个参数,一个是配置文件的路径和文件名字,一个是配置文件中这个flume的名字

使用telnet向flume发送数据

       root@master:/#telnet localhost 33333

       Trying127.0.0.1...

       Connectedto localhost.

       Escapecharacter is '^]'.

       vvvvvvvvvvvvvvv  gggggggg     ttttttt

       OK

       uuuuuuuuuuu

       OK

出现问题

       在spark输出终端上,没有输出接收的数据,而是输出WARN BlockManager:Block input-0-1524707416800 replicated to only 0 peer(s) instead of 1 peers

              此时的状态是只是启动了hadoop集群,没有启动spark集群,启动spark集群后,还是出现这个错误

       解决方法:

              百度到的解释是:Do not run Spark Streaming programs locally with master configuredas local or local[1]. This allocates only one CPU for tasks and if a receiveris running on it, there is no resource left to process the received data. Useat least local[2] to have more cores.

             

       测试:使用root@master:~/pyworkspace# spark-submit --master yarn--driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/*flumetest.py localhost 44444在yarn上管理

              此时flume和spark都出现了错误

              flume的错误

                     org.apache.flume.EventDeliveryException:Failed to send events

              spark的错误

                     WARNClient: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back touploading libraries under SPARK_HOME.

                     WARNDFSClient: Caught exception

                     java.lang.InterruptedException

                     WARNTransportChannelHandler: Exception in connection from /172.17.0.3:60354

                     java.io.IOException:Connection reset by peer

                    

                     ERRORSparkContext: Error initializing SparkContext.

                     org.apache.spark.SparkException:Yarn application has already ended! It might have been killed or unable tolaunch application master.

                    

                     WARNYarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executorsbefore the AM has registered!

                     WARN MetricsSystem: Stopping aMetricsSystem that is not running

                     Traceback(most recent call last):

                    

                     py4j.protocol.Py4JJavaError:An error occurred while callingNone.org.apache.spark.api.java.JavaSparkContext.

                    

                    

                    

                    

                    

       测试:使用root@master:~/pyworkspace# spark-submit --master local[4]--driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/*flumetest.py localhost 44444

      

              这回的确显示出东西来了,不过还是有WARN BlockManager: Block input-0-1524813488200 replicated to only 0peer(s) instead of 1 peers这个问题,这个问题好像并不影响

       测试:spark-submit在yarn 上运行

              要想在HADOOP YARN 上运行程序,必须先设置HADOOP_CONF_DIR环境变量

                     exportHADOOP_CONF_DIR=/usr/local/hadoop-2.7.5/etc/hadoop

              添加后spark-submit运行没有问题

                     spark-submit--master yarn --driver-class-path /usr/local/spark/jars/*:/usr/local/spark/jars/flume/*flumetest.py localhost 44444

              但是启动flume出现问题

                      (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO -org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:205)]Rpc sink k1: Building RpcClient with hostname: localhost, port: 44444

                    

                     (SinkRunner-PollingRunner-DefaultSinkProcessor)[INFO - org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:126)]Attempting to create Avro Rpc client.

                    

                      (SinkRunner-PollingRunner-DefaultSinkProcessor)[WARN -org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:634)]Using default maxIOWorkers

                    

                      (SinkRunner-PollingRunner-DefaultSinkProcessor)[ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)]Unable to deliver event. Exception follows.

                    

                     org.apache.flume.EventDeliveryException:Failed to send events

                    

                     Causedby: org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost,port: 44444 }: RPC connection error

                    

                     Causedby: java.io.IOException: Error connecting to localhost/127.0.0.1:44444

                    

                     Causedby: java.net.ConnectException: Connection refused: localhost/127.0.0.1:44444

              修改flume配置文件

                     a1.sources=r1

                     a1.sinks=k1

                     a1.channels=c1

                    

                     #Describe/configure the source

                     a1.sources.r1.type=netcat

                     a1.sources.r1.bind=localhost

                     a1.sources.r1.port=33333

                    

                     #Describe the sink

                     a1.sinks.k1.type=logger

                     #a1.sinks.k1.hostname=localhost

                     #a1.sinks.k1.port=44444

                    

                     #Use a channel which buffers events in memory

                     a1.channels.c1.type=memory

                     a1.channels.c1.capacity=1000000

                     a1.channels.c1.transactionCapacity=1000000

                    

                     #Bind the source and sink to the channel

                     a1.sources.r1.channels=c1

                     a1.sinks.k1.channel=c1

              即flume不再向spark发送消息,则flume正常工作,没有错误

             

             

       问题:WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set,falling back to uploading libraries under SPARK_HOME

      

       方法一:

      

              root@master:~/pyworkspace#hadoop fs -mkdir spark_jars

              root@master:~/pyworkspace#hadoop fs -ls

              Found2 items

              drwxr-xr-x   - root supergroup          0 2018-04-27 13:42 .sparkStaging

              drwxr-xr-x   - root supergroup          0 2018-04-27 13:44 spark_jars

              root@master:~/pyworkspace#

              root@master:~/pyworkspace#hadoop fs -copyFromLocal /usr/local/spark/jars/* spark_jars

              在spark的conf的spark-default.conf添加

                     spark.yarn.jars       hdfs://master:9000/spark_jars/*

              上述错误WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set,falling back to uploading libraries under SPARK_HOME消失

              但是出现新错误

                     WARNDFSClient: Caught exception

                     java.lang.InterruptedException

                            atjava.lang.Object.wait(Native Method)

                    

                     ERRORYarnClientSchedulerBackend: Yarn application has already exited with stateFINISHED!

                    

                     ERRORTransportClient: Failed to send RPC 9028375459380775738 to /172.17.0.3:55780:java.nio.channels.ClosedChannelException

                    

                     java.io.IOException:Failed to send RPC 9028375459380775738 to /172.17.0.3:55780:java.nio.channels.ClosedChannelException

                    

                     Causedby: java.io.IOException: Failed to send RPC 9028375459380775738 to/172.17.0.3:55780: java.nio.channels.ClosedChannelException

                    

                     解决方法一:

                            在yarn-site.xml中配置

                                  

                                   yarn.nodemanager.pmem-check-enabled

                                   false

                                  

                                  

                                  

                                   yarn.nodemanager.vmem-check-enabled

                                   false

                                  

                            运行出现错误

                                   ERRORSparkContext: Error initializing SparkContext.

                                   org.apache.spark.SparkException:Yarn application has already ended! It might have been killed or unable tolaunch application master.

                                   ERRORYarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map())to AM was unsuccessful

                                   java.io.IOException:Failed to send RPC 8340030922793312011 to /172.17.0.4:52390:java.nio.channels.ClosedChannelException

                                   ERRORTransportClient: Failed to send RPC 8340030922793312011 to /172.17.0.4:52390:java.nio.channels.ClosedChannelException

                                   java.nio.channels.ClosedChannelException

                            仔细查了查,spark-env.sh中masterip配置错了

                           

                            将配置文件复制到slave

                                   root@master:~/pyworkspace#scp /usr/local/spark/conf/spark-defaults.conf slave01:/usr/local/spark/conf/

                                   spark-defaults.conf                                                            100% 1429     1.4KB/s   00:00   

                                   root@master:~/pyworkspace#scp /usr/local/spark/conf/spark-defaults.conf slave02:/usr/local/spark/conf/

                                   spark-defaults.conf                                                            100% 1429     1.4KB/s   00:00   

                                   root@master:~/pyworkspace#

                            进入spark日志配置文件,修改配置

                                   log4j.rootCategory=DEBUG,console

                            重启hadoop与spark集群

                    

                     解决方法二:

                            配置yarn-site.xml,配置队列权限

                                  

                                   yarn.scheduler.capacity.root.queues 

                                   default

                                    

                                  

                                   yarn.scheduler.capacity.root.capacity 

                                   100

                                  

                                  

                                   yarn.scheduler.capacity.root.acl_submit_applications 

                                   root

                                  

                                  

                                   yarn.scheduler.capacity.root.acl_administer_queue 

                                   root

                                  

                            运行spark-submit,上面的问题没有了,又出现这个问题

                                   ERRORSparkContext: Error initializing SparkContext.

                            org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException):Cannot create directory/user/root/.sparkStaging/application_1524905120483_0001. Name node is in safemode.

                                   Thereported blocks 0 needs additional 101 blocks to reach the threshold 0.9990 oftotal blocks 101.

                                   Thenumber of live datanodes 0 has reached the minimum number 0. Safe mode will beturned off automatically once the thresholds have been reached.

                                  

                                   错误显示namenode处于安全模式下,好像在这种模式不能操作文件什么的,然后退出安全模式

                                          root@master:~/pyworkspace#hadoop dfsadmin -safemode leave

                                          DEPRECATED:Use of this script to execute hdfs command is deprecated.

                                          Insteaduse the hdfs command for it.

                                         

                                          Safemode is OFF

                                   重新运行spark-submit。又出现

                                   WARNDFSClient: DataStreamer Exception

                                   org.apache.hadoop.ipc.RemoteException(java.io.IOException):File /user/root/.sparkStaging/application_1524905120483_0002/pyspark.zip couldonly be replicated to 0 nodes instead of minReplication (=1).  There are 0 datanode(s) running and nonode(s) are excluded in this operation.

                                  

                                   查看slave,jps命令显示什么都没有,看来slave没有启动服务

                                          关闭集群的时候显示没有datanode可关闭

                                                 root@master:~/pyworkspace#. /usr/local/hadoop-2.7.5/sbin/stop-all.sh

                                                 Thisscript is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh

                                                 Stoppingnamenodes on [master]

                                                 master:stopping namenode

                                                 slave02:no datanode to stop

                                                 slave01:no datanode to stop

                                                 Stoppingsecondary namenodes [0.0.0.0]

                                                 0.0.0.0:stopping secondarynamenode

                                                 stoppingyarn daemons

                                                 stoppingresourcemanager

                                                 slave02:no nodemanager to stop

                                                 slave01:no nodemanager to stop

                                                 noproxyserver to stop

                                   修改好后出现新错误

                                          WARNYarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed:

                                          ERRORmaster.Master: RECEIVED SIGNAL TERM

                           

                                  

             

       方法二:不可用

              修改yarn-site.xml文件

              添加

                    

                     yarn.nodemanager.pmem-check-enabled

                     false

                    

                    

                    

                     yarn.nodemanager.vmem-check-enabled

                     false

                    

              没有用

      

              如果不使用--master yarn

                     spark-submit  --driver-class-path/usr/local/spark/jars/*:/usr/local/spark/jars/flume/* flumetest.py --confspark.yarn.jars="hdfs://master:9000/usr/local/spark/jars/* "

              可以出现正确显示,但是不能运行

      

             

测试

       在spark-streaming的输出中显示使用telnet输入的信息

              在spark应用程序中添加了一句

                     stream.map(lambdacn : "Recieve " + str(cn)).pprint()

              在终端的显示是

                     Recieve({}, 'gggggg\r')

定期清理spark中已停止的应用文件

       在spark-env.sh添加

              exportSPARK_WORKER_OPTS="

              -Dspark.worker.cleanup.enabled=true

              -Dspark.worker.cleanup.interval=1800

              -Dspark.worker.cleanup.appDataTtl=604800"

             

      

linux查看端口占用

       netstat-tunlp


你可能感兴趣的:(大数据)