Flink读写kafka主题,并进行数据清洗

半截入土

    • Datastream程序架构
    • maven
    • 简单流水写法
    • 优化后oop写法
      • 抽象接口--读、写、数据处理
      • 添加读取和写入的数据源
      • 依据业务 实现数据处理的特质
      • 执行器,混入特质
      • 动态混入方法,用户执行

flink参考
flink参考

Datastream程序架构

datastream是flink提供给用户使用的用于进行流计算和批处理的api,是对底层流式计算模型的api封装,便于用户编程
一般流程为:

  1. 获得一个执行环境;Executiion Environment
  2. 加载/创建初始数据;Source
  3. 指定转换数据;Transformation
  4. 指定存放计算结果的位置;Sink
  5. 触发程序执行;流失计算必须的操作,批处理则不必

maven

1.7.2
2.0.0

<dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-scala_2.11artifactId>
            <version>${flink.version}version>
        dependency>
        
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-coreartifactId>
            <version>${flink.version}version>
        dependency>
        
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-streaming-scala_2.11artifactId>
            <version>${flink.version}version>
        dependency>
        
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-clients_2.11artifactId>
            <version>${flink.version}version>
        dependency>
        
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-shaded-hadoop-2-uberartifactId>
            <version>2.4.1-9.0version>
        dependency>
        
        
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-connector-kafka_2.11artifactId>
            <version>${flink.version}version>
        dependency>
        
        <dependency>
            <groupId>org.apache.kafkagroupId>
            <artifactId>kafka-clientsartifactId>
            <version>${kafka.version}version>
        dependency>
        
        <dependency>
            <groupId>com.alibabagroupId>
            <artifactId>fastjsonartifactId>
            <version>1.2.62version>
        dependency>
        <dependency>
            <groupId>org.apache.kafkagroupId>
            <artifactId>kafka_2.11artifactId>
            <version>${kafka.version}version>
        dependency>

简单流水写法

  1. 读:设置kafka消费者为flink数据源
  2. 写:设置kafka的生产者接受结果
object FlinkReadWriteKafka {
  def main(args: Array[String]): Unit = {
    //首先获取flink流计算环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //kafka props
    val prop = new Properties()
    //指定kafka的Broker地址
    prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
    //指定组的ID
    prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "md")
    //k v 序列化
    prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    //如果没有记录偏移量,第一次开始消费
    prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")


    val ds = env.addSource(
      new FlinkKafkaConsumer[String](
        "event_attendees",
        //对于字符串序列化和反序列化的schema
        new SimpleStringSchema(),
        prop
      ).setStartFromEarliest()//重置游标
    )

    //transform操作
    val dataStream = ds.map(x => {
      val info = x.split(",", -1)
      Array(
        (info(0), info(1).split(" "), "yes"),
        (info(0), info(2).split(" "), "maybe"),
        (info(0), info(3).split(" "), "invited"),
        (info(0), info(4).split(" "), "no")
      )
    }).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2 != "")
      .map(_.productIterator.mkString(","))

    val prop2 = new Properties()
    prop2.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
    prop2.setProperty(ProducerConfig.RETRIES_CONFIG, "0")
    prop2.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
    prop2.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")

    dataStream.addSink(new FlinkKafkaProducer[String](
      "single:9092",
      "event_attendees_ff",
      new SimpleStringSchema()))
    //启动流式计算
    env.execute("event_attendees_xf")
  }
}

优化后oop写法

抽象接口–读、写、数据处理

trait Read[T] {
  def read(prop:Properties,tableName:String):DataStream[T]
}
trait Write[T] {
  def write(localhost:String,tableName:String,dataStream:DataStream[T]):Unit
}
trait Transform[T,V] {
  def transform(in:DataStream[T]):DataStream[V]
}

添加读取和写入的数据源

class KafkaSource(env: StreamExecutionEnvironment) extends Read[String] {
  override def read(prop: Properties, tableName: String): DataStream[String] = {
    env.addSource(
      new FlinkKafkaConsumer[String](
        tableName,
        new SimpleStringSchema(),
        prop
      )
    )
  }
}

object KafkaSource {
  def apply(env: StreamExecutionEnvironment): KafkaSource = new KafkaSource(env)
}
class KafkaSink extends Write[String] {
  override def write(localhost: String, tableName: String, dataStream: DataStream[String]): Unit = {
    dataStream.addSink(new FlinkKafkaProducer[String](
      localhost,
      tableName,
      new SimpleStringSchema()
    ))
  }
}

object KafkaSink {
  def apply[T](): KafkaSink = new KafkaSink()
}

依据业务 实现数据处理的特质

trait FlikTransform extends Transform[String, String] {
  override def transform(in: DataStream[String]): DataStream[String] = {
    in.map(x => {
      val info = x.split(",", -1)
      Array(
        (info(0), info(1).split(" "), "yes"),
        (info(0), info(2).split(" "), "maybe"),
        (info(0), info(3).split(" "), "invited"),
        (info(0), info(4).split(" "), "no")
      )
    }).flatMap(x => x).flatMap(x => x._2.map(y => (x._1, y, x._3))).filter(_._2 != "")
      .map(_.productIterator.mkString(","))
  }
}

执行器,混入特质

class KTExcutor(readConf: Properties, writelocalhost: String) {
  tran: FlikTransform =>
  def worker(intopic: String, outputtopic: String) = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val kr = new KafkaSource(env).read(readConf, intopic)
    val ds = tran.transform(kr)
    KafkaSink().write(writelocalhost, outputtopic, ds)
    env.execute()
  }
}

动态混入方法,用户执行

object EAtest {
  def main(args: Array[String]): Unit = {
    val prop = new Properties()
    prop.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "single:9092")
    prop.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "md")
    prop.setProperty(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "1000")
    prop.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
    val localhost = "single:9092"
    (new KTExcutor(prop, localhost) with FlikTransform)
      .worker("event_attendees", "attendees_AA")
  }
}

你可能感兴趣的:(大数据,flink,kafka,scala)