spark解析离线日志:spark每天读取离线hdfs日志存放到hdfs目录下,创建hive表进行分析

1、日志和代码

{"date":"20190312095854","uid":"d0e213542e032203","reason":"2","sver":"7.1.2"}
{"date":"20190312095855","uid":"c43632a682d79c64","reason":"2","sver":"7.1.3"}
package com.tv.sohu.spark

import net.sf.json.JSONObject
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object OnlineNoUse {

  def main(args: Array[String]): Unit = {
    var idate=args(0)
    println(idate)
    val conf=new SparkConf()
    conf.setAppName("onlineNoUse").setMaster("local")
    val sc=new SparkContext(conf)
    val lines:RDD[String]=sc.textFile("hdfs://video_actionLiveInfo/"+idate+"*")
//    val lines:RDD[String]=sc.textFile("src/onlineNoUse.txt")
    val res:RDD[String]=lines.map((line)=>{
      val obj: JSONObject = JSONObject.fromObject(line)
      val date=obj.get("date")
      val uid=obj.get("uid")
      val reason=obj.get("reason")
      val sver=obj.get("sver")
      date+"\t"+uid+"\t"+reason+"\t"+sver       //返回字符串,以制表符分割
    })
//    res.saveAsTextFile("d://out15")
    res.saveAsTextFile("hdfs://warehouse/onlineNoUse/p_day="+idate)
    sc.stop()
  }
}

 

2、maven代码



    4.0.0

    com.tv.sohu
    spark
    1.0-SNAPSHOT

    
        
            net.sf.json-lib
            json-lib
            2.3
            jdk15
        
        
            org.apache.spark
            spark-core_2.11
            2.1.0
        
        
            commons-codec
            commons-codec
            1.9
        
    

    
        
            
                maven-compiler-plugin
                3.1
                
                    1.7
                    1.7
                    UTF-8
                
            
            
                org.scala-tools
                maven-scala-plugin
                2.15.2
                
                    
                        scala-compile-first
                        process-resources
                        
                            compile
                        
                    
                    
                    
                    
                    
                    
                    
                    
                
                
                    2.11.8
                
            
            
                maven-assembly-plugin
                
                    
                    
                    
                    com.tv.sohu.spark.OnlineNoUse
                    
                    
                    
                        jar-with-dependencies
                    
                
            
        
    


 

3、将所有的jar包都打包上传

在maven中加入


    maven-assembly-plugin
    
        
        
        
        com.tv.sohu.spark.OnlineNoUse
        
        
        
            jar-with-dependencies
        
    

在maven运行      D:\projects\sparkCompute>mvn assembly:assembly

4、在spark客户端运行

/opt/spark-2.1.1/bin/spark-submit \
--deploy-mode client \
--master yarn \
--queue video \
--num-executors 9 \
--class com.tv.sohu.spark.OnlineNoUse \
/data/opt/userdata/vetl/spark-1.0-SNAPSHOT-jar-with-dependencies.jar $idate

或者

$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--queue spark \
--num-executors 9 \
--class com.tv.sohu.spark.streaming.dm.webp2p.OnlineNoUse \
./spark-learn.jar

5、创建表

CREATE EXTERNAL TABLE if not exists `onlineNoUse`(
  `date` string, 
  `uid` string, 
  `reason` string, 
  `sver` string
)
PARTITIONED BY ( 
  `p_day` bigint)
row format delimited fields terminated by '\t'
LOCATION
  'hdfs://hive/warehouse/onlineNoUse';

6、修复分区

msck repair table order_created_partition;

 

 

你可能感兴趣的:(spark)